Overview

This overview topic is a short introduction to the Scalable Vector Extension version two (SVE2) for the Arm AArch64 architecture. It describes the extension concept, main features, application domains, and how to develop programs with SVE2. The topic also describes how SVE2 compares to SVE (and Neon).

Introducing SVE2

After Neon, which has a fixed 128-bit vector length for the instruction set, Arm designed a new SIMD instruction set, as an extension to AArch64, to allow for flexible vector length implementations: SVE (Scalable Vector Extension). SVE improves the architectures suitability for HPC (High Performance Computing) applications, which require very large quantities of data processing.

The SVE2 (Scalable Vector Extension version two) is a superset of SVE and Neon, and allows for more function domains in data-level parallelism. SVE2 inherits the concept, vector registers, and operation principles of SVE. SVE and SVE2 define 32 scalable vector registers. Silicon partners can choose a suitable vector length implementation varying between 128 bits and 2048 bits (with 128 bit increments), based on their hardware design. The advantage of SVE and SVE2 is that there is only one vector instruction set utilizing the scalable variables. This design concept of SVE enables developers to write and build software once, then run the same binaries on different AArch64 hardware (with various SVE vector length implementations), without rebuilding the binaries. Removing the requirement to rebuild binaries allows software to be ported more easily. In addition to the scalable vectors, SVE and SVE2 include:

  • Per-lane predication
  • Gather-load / Scatter-store
  • Speculative vectorization

The features listed above help vectorize and optimize loops when you process large datasets.

The main difference between SVE2 and SVE is the functional coverage of the instruction set. SVE was designed for HPC (High Performance Computing) and ML (Machine Learning) applications. SVE2 extends the SVE instruction set to enable more data-processing domains (beyond HPC and ML). The SVE2 instruction set can also accelerate the common algorithms used in following applications:

  • Computer vision
  • Multi-media
  • LTE baseband processing
  • Genomics
  • In-memory database
  • Web serving
  • General-purpose software.

To help compilers better vectorize for these domains, SVE2 adds a vector-width-agnostic version of the Neon instructions in most of the fixed-point DSP (Digital Signal Processing) and media processing functionality.

What is common to both SVE and SVE2 is that they enable a large amount of data to be collected and processed.

Neither SVE nor SVE2 are an extension of the Neon instruction set. SVE and SVE2 are redesigned for better data parallelism. However, their hardware logic overlays the Neon hardware implementation. When a microarchitecture supports SVE or SVE2, it also supports Neon. To be able to utilize SVE and SVE2, software that runs on that microarchitecture must first utilize Neon.

Note:

An SVE2 architecture overview is available to next generation architecture licensees, but is not publicly available yet. For more information about SVE, see Introducing Scalable Vector Extension (SVE). For more information about Neon, see the Neon webpage.

SVE2 architecture fundamentals

Like SVE, SVE2 is based on the scalable vectors. The additional register banks that SVE and SVE2 add, include: 32 scalable vector registers Z0-Z31, 16 scalable predicate registers P0-P15, one First Fault predicate Register (FFR), and scalable vector system control registers ZCR_Elx.

Scalable vector registers Z0-Z31

Z0-Z31 registers

Each of the vectors can be 128-2048 bits with 128 bits increments. The bottom 128 bits are shared with the fixed 128-bit long V0-V31 vectors of Neon. The scalable vectors can:

  • Hold 64, 32, 16, and 8-bit elements.
  • Support integer, double-precision, single-precision, and half-precision floating-point elements.
  • Be configured with the vector length in each Exception Level (EL).

Scalable predicate registers P0-P15

Scalable predicate registers p0-p15 

The predicate registers are mainly used as bit masks for data operations, where:

  • Each predicate register is 1/8 of the Zx length.
  • P0-P7 are governing predicates for load, store and arithmetic.
  • P8-P15 are additional predicates for loop management.
  • FFR First Fault Register is for speculative memory accesses.

If the predicate registers are not used as bit masks, they are used as operands.

Scalable vector system control registers ZCR_Elx

Scalable vector control registers ZCR_Elx

The scalable vector system control registers indicate the SVE implementation features.

  • ZCR_Elx.LEN field for the vector length of the current and lower exception levels.
  • Most bits are currently reserved for future use.

SVE2 assembly syntax

SVE2 follows the same SVE assembly syntax format, which is shown in the instruction examples below:

assembly syntax

For more information and examples see the Arm Architecture Reference Manual Supplement – The Scalable Vector Extension (SVE) for Armv8-A.

Key SVE architecture features that SVE2 inherits:

  • Gather-load and scatter-store

    The flexible address mode in SVE allows vector base address or vector offset, which enables loading to a single vector register from non-contiguous memory locations. For example:

    LD1SB  Z0.S, P0/Z, [Z1.S, #4]   // Gather load of signed bytes to active 32-bit elements of Z0 from memory addresses generated by 32-bit vector base Z1 plus immediate index #4.
    LD1SB  Z0.D, P0/Z, [X0, Z1.D]  // Gather load of signed bytes to active elements of Z0 from memory addresses generated by a 64-bit scalar base X0 plus vector index in Z1.D.
  • Per-lane predication

    Operate on individual lanes of vector controlled by a governing predicate register P0-P15. For example:

    ADD Z0.D, P0/M, Z1.D, Z2.D  // Add the active elements Z1 and Z2 and put the result in Z0. P0 indicates which elements of the operands are active and inactive. ‘M’ after P0 indicates that the inactive element will be merged, meaning Z0 inactive element will remain its original value before the ADD operation. If it was ‘Z’ after P0, then it would mean that inactive element will be zeroed in the destination vector register.
  • Predicate-driven loop control and management

    Eliminate loop heads and tails and other overhead by processing partial vectors, by registering the active and inactive elements index in the predicate registers, so that in the next loop only the active elements do the expected options. For example:

    WHILEL0 P0.S, x8, x9  // Generate a predicate in P0 that starting from the lowest numbered element is true while the incrementing value of the first, unsigned scalar X8 operand is lower than the second scalar operand X9 and false thereafter, up to the highest numbered element.
  • Vector partitioning for software-managed speculation

    SVE improved the Neon vectorization restrictions on speculative load. SVE introduces the first-fault vector load instructions such as LDRFF and first-fault predicate registers FFR to allow vector accesses to cross into invalid pages. For example:

    LDFF1D Z0.D, P0/Z, [Z1.D, #0]  // Gather load with first-faulting behaviour of doublewords to active elements of Z0 from memory addresses generated by the vector base Z1 plus 0. Inactive elements will not read Device memory or signal faults and are set to zero in the destination vector. Successful load to the valid memory will set true to the first-fault register (FFR), and the first-faulting load will set false to the according element and the rest elements in FFR.
    RDFFR P0.B // Read the first-fault register (FFR) and place in the destination predicate without predication.
  • Extended floating-point and bitwise horizontal reductions

    In-order or tree-based floating-point sum, trade-off repeatability versus performance. For example:

    FADDP  Z0.S, P0/M, Z1.S, Z2.S  // Add pairs of adjacent floating-point elements within each source vector Z1 and Z2, and interleave the results from corresponding lanes. The interleaved result values are destructively placed in the first source vector Z0.
  • New features in SVE2

    • To achieve scalable performance, SVE2 builds on the SVE foundations, allowing vector implementation up to 2048 bits.

    • Adds translations from Neon into SVE2:
    • SVE2 adds most instructions that can replicate the Neon instructions, including:

      • Transformed Neon fixed-point operations, such as: SABA (Signed absolute difference and accumulate), SHADD (Signed halving addition).

      • Transformed Neon widen, narrow & pairwise ops, such as: UADDLB (Unsigned add long - bottom) and UADDLT (Unsigned add long - top).

        Note that there are changes in the element processing orders; SVE2 process on interleaving even and odd elements, Neon process on low and high half elements for narrow or wider operations.

      • Fixed-point complex arithmetic, for example: CMLA (Complex integer multiply-add with rotate).

      • Multi-precision arithmetic for large integer arithmetic and cryptography, for example: ADCLB (Add with carry long - bottom), ADCLT (Add carry long - top), SM4E (SM4 encryption and decryption).

    • For backwards compatibility, Neon and VFP are still mandated in the latest architectures. Although SVE2 covers functions of SVE and Neon, SVE2 does not exclude the Neon presence on the chip.
    • Optimizes for emerging applications beyond HPC:
    • Optimizations are provided for ML (for example UDOT), Computer Vision (for example TBL and TBX), baseband networking (for example CADD and CMLA), genomics (for example BDEP and BEXT), and server (for example MATCH and NMATCH).

    • SVE2 enhances the performance in a general-purpose processor, without additional accelerators.

    Program with SVE2

    Software and libraries support

    To build an SVE or SVE2 program, you must choose a compiler that supports SVE and SVE2 features. GNU tools versions 8.0+ support SVE. Arm Compiler for Linux versions 18.0+ support SVE and versions 20.0+ support SVE and SVE2. Both compilers support optimizing C/C++/Fortran code.

    Arm Performance Libraries are highly optimized for math routines, and can be linked to your application. Arm Performance Libraries versions 19.3+ support math libraries for SVE.

    Arm Compiler for Linux (part of Arm Allinea Studio) consists of the Arm C/C++ Compiler, Arm Fortran Compiler, and Arm Performance Libraries.

    How to program for SVE2

    There are a few ways to write or generate SVE and SVE2 code: write assembly with SVE and SVE2 instructions, use intrinsics in C/C++/Fortran applications, let compilers auto-vectorize your code, and utilize the SVE-optimized libraries:

    • Write assembly code: you can write assembly files using SVE instructions, or use inline assembly in GNU style. For example:

              .globl  subtract_arrays         // -- Begin function 
              .p2align        2 
              .type   subtract_arrays,@function 
      subtract_arrays:               // @subtract_arrays 
              .cfi_startproc 
      // %bb.0: 
              orr     w9, wzr, #0x400 
              mov     x8, xzr 
              whilelo p0.s, xzr, x9 
      .LBB0_1:                       // =>This Inner Loop Header: Depth=1 
              ld1w    { z0.s }, p0/z, [x1, x8, lsl #2] 
              ld1w    { z1.s }, p0/z, [x2, x8, lsl #2] 
              sub     z0.s, z0.s, z1.s 
              st1w    { z0.s }, p0, [x0, x8, lsl #2] 
              incw    x8 
              whilelo p0.s, x8, x9 
              b.mi    .LBB0_1 
      // %bb.2: 
              ret 
      .Lfunc_end0: 
              .size   subtract_arrays, .Lfunc_end0-subtract_arrays 
              .cfi_endproc T

      To program in assembly, you need to know the ABI (Application Binary Interface) standard updates for SVE (and SVE2). Of all the ABIs, the AAPCS (Procedure Call Standard for Arm Architecture) specifies the data types and register allocations and is most relevant to programming in assembly. The AAPCS requires that:

      • Z0-Z7, P0-P3 are used for parameter and results passing
      • Z8-Z15, P4-P15 are callee-saved registers
      • Z16-Z31 are the corruptible registers

    • Use instruction functions: you can call instruction functions directly in high level languages such as C, C++, or Fortran that match corresponding SVE instructions. These instruction functions, sometimes referred to as intrinsics, are detailed in the ACLE (Arm C Language Extension) for SVE. Intrinsics are functions that match to corresponding instructions, so that programmers can directly call them in high level languages such as C, C++, or Fortran. The instruction functions are inserted with specific instructions after compilation. The ACLE for SVE document also includes the full list of instruction functions for SVE2 that programmers can use.

      For example, take the following code:

      //intrinsic_example.c
      #include <arm_sve.h>
      svuint64_t uaddlb_array(svuint32_t Zs1, svuint32_t Zs2)
      {
               // widening add of even elements
          svuint64_t result = svaddlb(Zs1, Zs2);
          return result;
      }

      Compile it using Arm C/C++ Compiler:

      armclang -O3 -S -march=armv8-a+sve2 -o intrinsic_example.s intrinsic_example.c

      Generates the assembly:

      //instrinsic_example.s
      uaddlb_array: // @uaddlb_array .cfi_startproc
      // %bb.0:
      uaddlb z0.d, z0.s, z1.s ret

      Note: Arm Compiler for Linux 20.0 is used

    • Auto-vectorization: C/C++/Fortran Compilers such as Arm Compiler for Linux and GNU compilers for Arm platforms aim to generate the SVE and SVE2 code from C/C++/Fortran loops. To generate SVE or SVE2 code, you need to select the appropriate compiler options for the SVE or SVE2 features. For example, with armclang, one option that enables SVE2 optimizations is -march=armv8-a+sve2 (coupled with -armpl=sve, if you want to use the SVE version of the libraries). For more information about all the supported options that enable SVE2 features, see the Arm C/C++ Compiler or Arm Fortran Compiler reference guide.

    • Use libraries optimized for SVE and SVE2: there are already highly optimized libraries with SVE available, such as Arm Performance Libraries and Arm Compute Libraries. Arm Performance Libraries contain the highly optimized implementations for BLAS, LAPACK, FFT and math routines. You must install Arm Allinea Studio and include armpl.h in your code to be able to link any of the ArmPL functions. To build the application with ArmPL using Arm Compiler for Linux, you need to specify -armpl=<arg> on the command line. If you use the GNU tools, you need to include the ArmPL installation path on command line. For more information, please refer to Arm Performance Libraries Get Started Guide.

    How to run SVE/SVE2 program: Hardware (HW) and Model

    Although SVE-enabled hardware is unavailable now, you can use models and emulators for the development of your code ahead of SVE-enabled hardware becoming available. There are a few models and emulators to choose from:

    • QEMU: Cross and native models, supporting modelling Arm AArch64 platforms with SVE.
    • Fast Models: Cross platform models, supporting modelling Arm AArch64 platforms with SVE (AEM with SVE2 support is available for lead partners).
    • ArmIE (Arm Instruction Emulator): Directly running on Arm platforms. Supports SVE, and from version 19.2+ supports SVE2.

    How to port applications to SVE or SVE2

    For more information about porting your code to Arm or Arm SVE-enabled hardware, see the HPC application porting guides:

    Check your knowledge