Vectorize memory access
The Mali GPU shader core load-store data cache has a wide data path capable of returning multiple values in a single clock cycle. It is important to make vector data accesses to get the highest access bandwidth from the data caches. Shader programs expose direct access to the underlying memory.
You must understand the following concepts:
- The load-store data cache.
- Vector data accesses.
- Thread quads.
How to optimize the use of vectorized memory accesses on Mali GPUs
Try using the following optimizations:
- For memory accesses with a single thread, use vector data types.
- Bifrost GPUs can run four neighboring threads in lockstep, which is known as a quad. You must access overlapping or sequential memory ranges across the four threads to allow load merging.
Things to avoid when optimizing your use of vectorized memory accesses on Mali GPUs
Arm recommends that you:
- Do not use scalar loads if vector loads are possible.
- Do not access divergent addresses across a thread quad where possible.
The negative impact of not using vectorized memory accesses correctly on Mali GPUs
Many types of common compute programs perform relatively light arithmetic on large data sets. Getting memory access correct for them can have a significant impact on performance.