You copied the Doc URL to your clipboard.

Multipass rendering

Multipass rendering is an important feature of Vulkan which enables applications to exploit the full power of tile-based architectures using the standard API.


You must understand the following concepts:

  • Anti-aliasing.
  • Vulkan APIs.
  • Using late-zs testing.
  • Render passes and subpasses.

Enabling powerful algorithms

Mali GPUs can take color attachments and depth attachments from one subpass, and use them as input attachments in a later subpass without going via main memory. This enables powerful algorithms, such as deferred shading or programmable blending, to be used efficiently. However, a few things must be set up correctly.

Per-pixel storage requirements

Most of the Mali GPUs are designed for rendering 16x16 pixel tiles with 128 bit per pixel of tile buffer color storage. Some GPUs, such as Mali-G72, increase this count to up to 256 bits per pixel.

G-buffers, which require more color storage, can be used at the expense of requiring smaller tiles during fragment shading, which can reduce performance.

For example, a sensible G-buffer layout that fits neatly into a 128-bit budget could be:

  • Light: B10G11R11_UFLOAT
  • Albedo: RGBA8_UNORM
  • Normal: RGB10A2_UNORM
  • PBR material parameters/misc: RGBA8_UNORM

Image layouts

Multipass rendering is one of the few cases where image layout matters because it impacts the optimizations which the driver enables.

Here is a sample multipass layout which hits all the good paths:

Initial layouts
  • Light: UNDEFINED
  • Albedo: UNDEFINED
  • Normal: UNDEFINED
  • Depth: UNDEFINED
G-buffer pass (subpass #0) output attachments

To boost performance make the eventual output, in this case the light attachment, occupy the first render target slot in the hardware. To do this, light attachment must be attachment #0 in the VkRenderPass.

To enable the render to write out emissive parameters from the opaque material, light is included as an output from the G-buffer in this example. There is no extra bandwidth to write out an extra render target because the subpasses are merged.

Unlike a desktop GPU, there is no need to invent schemes to forward emissive light contributions through the other G-buffer attachments.

Lighting pass (subpass #1) input attachments

From the point that any pass starts to read from the tile buffer, optimize multipass performance by marking every depth or stencil attachment as read-only.

DEPTH_STENCIL_READ_ONLY is designed for this read-only depth or stencil testing. It can be used concurrently as an input attachment to the shader program for programmatic access to depth values.

Lighting pass (subpass #1) output attachments

Lighting that is computed during subpass #1, is blended on top of the pre-computed emissive data from subpass #0. If needed, the application also blends transparent objects after the lighting passes have completed.

Subpass dependencies

Dependencies between the subpasses use VkSubpassDependency which sets the DEPENDENCY_BY_REGION_BIT flag. This dependency tells the driver that each subpass depends on the previous subpasses at that pixel coordinate.

For the example above, the subpass dependency setup would look like:

VkSubpassDependency subpassDependency = {}; 
subpassDependency.srcSubpass = 0; 
subpassDependency.dstSubpass = 1; 
                                 VK_PIPELINE_STAGE_EARLY_FRAGMENT_TESTS_BIT | 
subpassDependency.dstStageMask = VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT | 
                                 VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT | 
                                 VK_PIPELINE_STAGE_EARLY_FRAGMENT_TESTS_BIT | 
subpassDependency.srcAccessMask = VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT | 
subpassDependency.dstAccessMask = VK_ACCESS_INPUT_ATTACHMENT_READ_BIT | 
                                  VK_ACCESS_COLOR_ATTACHMENT_READ_BIT | 
                                  VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT | 
                                  VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_READ_BIT | 
subpassDependency.dependencyFlags = VK_DEPENDENCY_BY_REGION_BIT;

Subpass merge considerations

The driver merges subpasses if the following conditions are met:

  • The color attachment data formats can be merged.
  • Merging can save a write-out or read-back. Two unrelated subpasses which do not share any data do not benefit from multipass and are not merged.
  • The number of unique VkAttachments used for input and color attachments in all considered subpasses is less than nine. However, keep in mind that depth or stencil does not count towards this limit.
  • The depth or stencil attachment does not change between subpasses.
  • Multisample counts are the same for all attachments.

How to optimize the use of multipass rendering with Vulkan

Try using the following optimization steps:

  • Use multipass.
  • Use a 128-bit G-buffer budget for color.
  • Use by-region dependencies between subpasses.
  • Use DEPTH_STENCIL_READ_ONLY image layout for depth after the G-buffer pass is done.
  • Use LAZILY_ALLOCATED memory to back images for every attachment except for the light buffer, which is the only texture that is written out to memory.
  • Follow the basic render pass best practices, with LOAD_OP_CLEAR or LOAD_OP_DONT_CARE for attachment loads and STORE_OP_DONT_CARE for transient stores.

Multipass rendering steps to avoid

It is important that you do not store G-buffer data to memory, only to write the final color output.

The negative impact of implementing multipass rendering incorrectly

Not using multipass correctly forces the driver to use multiple physical passes, sending intermediate image data back to the main memory between passes. In turn, losing all of the benefits of multipass rendering.

How to debug multipass rendering issues more effectively

Here is a couple of steps you can try to debug issues you can encounter:

  • To determine if passes are being merged, refer to the GPU performance counters for information about the number of physical tiles that are rendered.
  • The GPU performance counters also provide information about the number of fragment threads using late-zs testing. A high value in the late-zs test can be indicative of your application not using DEPTH_STENCIL_READ_ONLY correctly.