Using SVE intrinsics directly in your C code
Intrinsics are C or C++ pseudofunction calls that the compiler replaces with the appropriate SIMD instructions. This lets you use the data types and operations available in the SIMD implementation, while allowing the compiler to handle instruction scheduling and register allocation.
These intrinsics are defined in the ARM C Language Extensions for SVE specification.
Introduction
The ARM C language extensions for SVE provide a set of types and accessors for SVE vectors and predicates, and a function interface for all relevant SVE instructions.
The function interface is more general than the underlying architecture, so not every function maps directly to an architectural instruction. The intention is to provide a regular interface and leave the compiler to pick the best mapping to SVE instructions.
The ARM C Language Extensions for SVE specification has a detailed description of this interface, and should be used as the primary reference. This section introduces a selection of features to help you get started with the ARM C Language Extensions (ACLE) for SVE.
Note
Please be aware of issue 1758, described in Known limitations in SVE support.Header file inclusion
Translation units that use the ACLE should first include arm_sve.h, guarded by
__ARM_FEATURE_SVE
:
#ifdef __ARM_FEATURE_SVE #include <arm_sve.h> #endif /* __ARM_FEATURE_SVE */
All functions and types that are defined in the header file have the prefix
sv
, to reduce the chance of collisions with
other extensions.
SVE vector types
arm_sve.h
defines the following C types to represent values
in SVE vector registers. Each type describes the type of the elements within the
vector:
svint8_t svuint8_t
svint16_t svuint16_t svfloat16_t
svint32_t svuint32_t svfloat32_t
svint64_t svuint64_t svfloat64_t
For example, svint64_t represents a vector of 64bit signed integers, and svfloat16_t represents a vector of halfprecision floatingpoint numbers.
SVE predicate type
The extension also defines a single sizeless predicate type svbool_t
, which has enough bits to control an operation
on a vector of bytes.
The main use of predicates is to select elements in a vector. When the elements in the vector have N bytes, only the low bit in each sequence of N predicate bits is significant, as shown in the following table:
Table 32 Element selection by predicate type svbool_t
Vector type  Element selected by each svbool_t bit  

svint8_t  0  1  2  3  4  5  6  7  8  ... 
svint16_t  0  1  2  3  4  ...  
svint32_t  0  1  2  ...  
svint64_t  0  1  ... 
Limitations on how SVE ACLE types can be used
SVE is a vectorlength agnostic architecture, allowing an implementation to choose a vector length of any multiple of 128 bits, up to a maximum of 2048 bits. Therefore, the size of SVE ACLE types are unknown at compile time, which limits how these types can be used.
Common situations where SVE types may be used include:
 as the type of an object with automatic storage duration
 as a function parameter or return type
 as the type in a
(type) {value}
compound literal  as the target of a pointer or reference type
 as a template type argument.
Due to their unknown size at compile time, SVE types may not be used:
 to declare or define a static or threadlocal storage variable
 as the type of an array element
 as the operand to a new expression
 as the type of object deleted by a delete expression
 as the argument to
sizeof
and_Alignof
 with pointer arithmetic on pointers to SVE objects (this affects the
+
,
,++
, and
operators)  as members of unions, structures and classes
 in standard library containers like
std::vector
.
For a comprehensive list of valid usage, refer to the ARM C Language Extensions for SVE specification.
Writing SVE ACLE functions
SVE ACLE functions have the form:
svbase[_disambiguator][_type0][_type1]...[_predication]
where the function is built using the following:
base

For most functions, this is the lowercase name of the SVE instruction. Sometimes, letters indicating the type or size of data being operated on are dropped, where it can be implied from the argument types.
Unsigned extending loads add a
u
to indicate that the data will be zero extended, to more explicitly differentiate them from their signed equivalent. disambiguator

This field distinguishes between different forms of a function, for example:
 To distinguish between addressing modes
 To distinguish forms that take a scalar rather than a vector as the final argument.
type0 type1 ...

A list of types for vectors and predicates, starting with the return type then with each argument type. For example,
_s8
,_u32
, and_f32
, which represent signed 8bit integer, an unsigned 32bit integer and single precision 32bit float types, respectively.Predicate types are represented by
_b8
,_b16
and so on, for predicates suitable for 8bit and 16bit types respectively. A predicate type suitable for all element types is represented by_b
. Where a type is not needed to disambiguate between variants of a base function, it is omitted. predication

This suffix describes the inactive elements in the result of a predicated operation. It can be one of the following:
 z – Zero predication: Set all inactive elements of the result to zero.
 m – Merge predication: copy all inactive elements from the first vector argument.
 x – ‘Don’t care’ predication. Use this form when you do not care about the inactive elements. The compiler is then free to choose between zeroing, merging, or unpredicated forms to give the best code quality, but gives no guarantee of what data, if any, will be left in inactive elements.
Addressing modes
Load, store, prefetch, and ADR functions have arguments that describe the memory area being addressed. The first addressing argument is the base – either a single pointer to an element type, or a 32bit or 64bit vector of addresses. The second argument, when present, offsets the base (or bases) by some number of bytes, elements, or vectors. This offset argument can be an immediate constant value, a scalar argument, or a vector of offsets.
Not every combination of the above options exists. The following table gives examples of some common addressing mode disambiguators, and describes how to interpret the address arguments:
Table 33 Common addressing mode disambiguators
Disambiguator  Interpretation 

_u32base  The base argument is a vector of unsigned 32bit addresses. 
_u64base  The base argument is a vector of unsigned 64bit addresses. 
_s32offset _s64offset _u32offset _u64offset 
The offset argument is a vector of byte offsets. These offsets are signed or unsigned 32bit or 64bit numbers. 
_s32index _s64index _u32index _u64index 
The offset argument is a vector of elementsized indices. These indices are signed or unsigned 32bit or 64bit numbers. 
_offset  The offset argument is a scalar, and should be treated as a byte offset. 
_index  The offset argument is a scalar, and should be treated as an index into an array of elements. 
_vnum  The offset argument is a scalar, and should be treated an index into an array of SVE vectors. 
In the following example, the address of element i
is &base[indices[i]]
.
svuint32_t svld1_gather_[s32]index[_u32] (svbool_t pg, const uint32_t *base, svint32_t indices)
Operations involving vectors and scalars
All arithmetic functions that take two vector inputs have an alternative form that takes a vector and a scalar. Conceptually, this scalar is duplicated across a vector, and that vector is used as the second vector argument.
Similarly, arithmetic functions that take three vector inputs have an alternative form that takes two vectors and one scalar.
To differentiate these forms, the disambiguator _n
is added to the form that takes a scalar.
Short forms
Sometimes, it is possible to omit part of the full name, and still uniquely identify the correct form of a function, by inspecting the argument types. Where this is possible, these simplified forms are provided as aliases to their fully named equivalents, and will be used for preference in the rest of this document.
In the ARM C Language Extensions for SVE
specification, the portion that can be removed is enclosed in square brackets. For
example svclz[_s16]_m
has the full name svclz_s16_m
, and an overloaded alias, svclz_m
.
Example – Naïve step1 daxpy
daxpy is a BLAS (Basic Linear Algebra Subroutines) subroutine that operates on two arrays of double precision floatingpoint numbers. A slice is taken of each of these arrays. For each element in these slices, an element (x) in the first array is multiplied by a constant (a), then added to the element (y) from the second array. The result is stored back to the second array at the same index.
This example presents a step1 daxpy implementation, where the indices of x and y start at 0 and increment by 1 each iteration. A C code implementation might look like this:
void daxpy_1_1(int64_t n, double da, double *dx, double *dy) { for (int64_t i = 0; i < n; ++i) { dy[i] = dx[i] * da + dy[i]; } }
Here is an ACLE equivalent:
void daxpy_1_1(int64_t n, double da, double *dx, double *dy) { int64_t i = 0; svbool_t pg = svwhilelt_b64(i, n); // [1] do { svfloat64_t dx_vec = svld1(pg, &dx[i]); // [2] svfloat64_t dy_vec = svld1(pg, &dy[i]); // [2] svst1(pg, &dy[i], svmla_x(pg, dy_vec, dx_vec, da)); // [3] i += svcntd(); // [4] pg = svwhilelt_b64(i, n); // [1] } while (svptest_any(svptrue_b64(), pg)); // [5] }
The following steps explain this example:
[1]  Initialize a predicate register to control the loop. _b64
specifies a predicate for 64bit elements.
Conceptually, this operation creates an integer vector starting at i
and incrementing by 1 in each subsequent lane. The
predicate lane is active if this value is less than n
. Therefore, this loop is safe, if inefficient, even if n
≤ 0. The same operation is used at the bottom of the
loop, to update the predicate for the next iteration.
[2]  Load some values into an SVE vector, guarded by the loop predicate. Lanes where this predicate is false do not perform any load (and so will not generate a fault), and set the result value to 0.0. The number of lanes that are loaded depends on the vector width, which is only known at runtime.
[3]  Perform a floatingpoint multiplyadd operation, and pass the result to
a store. The _x
on the MLA indicates we don’t care
about the result for inactive lanes. This gives the compiler maximum flexibility in
choosing the most efficient instruction. The result of this operation is stored at
address &dy[i]
, guarded by the loop predicate. Lanes where the predicate is
false are not stored, and so the value in memory will retain its prior value.
[4]  Increment i
by the number of doubleprecision lanes in the vector.
[5]  ptest
returns true if any lane of the (newly updated) predicate is
active, which causes control to return to the start of the while loop if there is
any work left to do.
“Ideal” assembler output:
daxpy_1_1: MOV Z2.D, D0 // da MOV X3, #0 // i WHILELT P0.D, X3, X0 // i, n loop: LD1D Z1.D, P0/Z, [X1, X3, LSL #3] LD1D Z0.D, P0/Z, [X2, X3, LSL #3] FMLA Z0.D, P0/M, Z1.D, Z2.D ST1D Z0.D, P0, [X2, X3, LSL #3] INCD X3 // i WHILELT P0.D, X3, X0 // i, n B.ANY loop RET
Example – Naïve general daxpy
This example presents a general daxpy implementation, where the indices of x and y start at 0 and are then incremented by unknown (but loopinvariant) strides each iteration.
void daxpy(int64_t n, double da, double *dx, int64_t incx, double *dy, int64_t incy) { svint64_t incx_vec = svindex_s64(0, incx); // [1] svint64_t incy_vec = svindex_s64(0, incy); // [1] int64_t i = 0; svbool_t pg = svwhilelt_b64(i, n); // [2] do { svfloat64_t dx_vec = svld1_gather_index(pg, dx, incx_vec); // [3] svfloat64_t dy_vec = svld1_gather_index(pg, dy, incy_vec); // [3] svst1_scatter_index(pg, dy, incy_vec, svmla_x(pg, dy_vec, dx_vec, da)); // [4] dx += incx * svcntd(); // [5] dy += incy * svcntd(); // [5] i += svcntd(); // [6] pg = svwhilelt_b64(i, n); // [2] } while (svptest_any(svptrue_b64(), pg)); // [7] }
The following steps explain this example:
[1]  For each of x
and y
, initialize a vector of indices, starting at 0 for
the first lane and incrementing by incx
and
incy
respectively in each subsequent lane.
[2]  Initialize or update the loop predicate.
[3]  Load a vector’s worth of values, guarded by the loop predicate. Lanes where this predicate is false do not perform any load (and so will not generate a fault), and set the result value to 0.0. This time, a base + vectorofindices gather load, is used to load the required nonconsecutive values.
[4]  Perform a floatingpoint multiplyadd operation, and pass the result to
a store. This time, the base + vectorofindices scatter store is used to store each
result in the correct index of the dy[]
array.
[5]  Instead of using i
to calculate the
load address, increment the base pointer, by multiplying the vector length by the
stride.
[6]  Increment i
by the number of doubleprecision lanes in the vector.
[7]  Test the loop predicate to work out whether there is any more work to do, and loop back if appropriate.