The shader core instruction cache is a performance-impacting area that is often overlooked. Due to the number of threads running concurrently, it is a critically important part to be aware of.
You must understand the following concepts:
- Instruction caches.
- Early-zs testing.
How to optimize the use of instruction caches
Try using the following optimizations:
- Use shorter shaders with many threads over long shaders with few threads. A shorter program is more likely to be hot in the cache.
- Use shaders that do not have control-flow divergence. Divergence can reduce temporal locality and increase cache pressure.
Things to avoid when optimizing your use of instruction caches
Arm recommends that you:
- Do not unroll loops too aggressively, although some unrolling can help.
- Do not generate duplicate shader programs or pipeline binaries from identical source code.
- Beware of fragment shading with many visible layers in a tile. The shaders for all layers that are not killed by early-zs or Forward Pixel Killing (FPK), must be loaded and executed, increasing cache pressure. https://community.arm.com/developer/tools-software/graphics/b/blog/posts/killing-pixels---a-new-optimization-for-shading-on-arm-mali-gpus
How to debug instruction cache-related performance issues
Try the following debugging steps:
- Use the Mali Offline Compiler to statically determine the sizes of the programs being generated for any given Mali GPU. https://developer.arm.com/tools-and-software/graphics-and-gaming/mali-offline-compiler
- The Arm Mobile Studio tool suite can be used to step through draw calls and visualize how many transparent layers are building up in your render passes. https://www.arm.com/products/development-tools/graphics/arm-mobile-studio