Introduction to Vulkan Render Passes

Vulkan graphics rendering is organized into render passes and subpasses. This article provides an introduction to these concepts and how to use them in the Vulkan API.

Render Passes

When a GPU renders a scene, it is configured with one or more render targets, or framebuffer attachments in Khronos terminology. The size and format of the attachments determine how graphics work is configured across the parallelism available on all modern GPUs. For example, on a tile-based renderer, the set of attachments is used to determine the way the image is divided into tiles. In Vulkan, a render pass is the set of attachments, the way they are used, and the rendering work that is performed using them. In a traditional API, a change to a new render pass might correspond to binding a new framebuffer.

Subpasses

During normal rendering, it is not possible for a fragment shader to access the attachments to which it is currently rendering: GPUs have optimized hardware for writing to the attachments, and accessing the attachment interferes with this. However, some common rendering techniques such as deferred shading rely on being able to access the result previous rendering during shading. For a tile-based renderer, the results of previous rendering can efficiently stay on-chip if subsequent rendering operations are at the same resolution, and if only the data in the pixel currently being rendered is needed (accessing different pixels would require access to values outside the current tile, which breaks this optimization). The solution in Vulkan is to split the rendering operations into subpasses, which share the same resolution and tile arrangement, and which can access the results of previous subpasses.

In Vulkan, a render pass consists of one or more subpasses; for simple rendering operations, there may be only a single subpass in a render pass.

Creating a VkRenderPass

In Vulkan, a render pass is described by an (opaque) VkRenderPass object. This provides a template that is used when beginning a render pass inside a command buffer. The render pass is used with a compatible VkFrameBuffer object that describes the specific images which will be used during execution of the render pass.

vkCreateRenderPass

Like many driver objects in Vulkan, a VkRenderPass object is created with a corresponding create function, VkCreateRenderPass():

VkResult vkCreateRenderPass()
VkDevice device Logical device used for rendering (from vkCreateDevice)
const
VkRenderPassCreateInfo*
pCreateInfo Parameters for creation
const
VkAllocationCallbacks*
pAllocator Host memory allocation callback (can be NULL)
VkRenderPass* pRenderPass Resulting render pass handle

As with many Vulkan creation functions, most parameters are passed through a creation structure. This approach makes it more efficient to create multiple identical objects, and provides a way to support type-safe additional parameters through extensions.

Many creation methods in Vulkan offer a call-back for applications which wish to track host-side memory usage. While important for applications that wish to have precise control over resource allocation, and useful for debugging, in most cases this callback can be left as NULL to rely on the driver's default memory allocation scheme.

As with other Vulkan creation functions, the function returns an error code if anything goes wrong - although more information may be available through validation layers if the problem is an application error. The newly-created render pass description is returned via the pRenderPass pointer.

The interesting parameters are contained in the pCreateInfo structure.

VkRenderPassCreateInfo

struct VkRenderPassCreateInfo
VkStructureType sType Used for type safety and extensions, must be VK_STRUCTURE_TYPE_RENDER_PASS_CREATE_INFO
const void* pNext Allows extensions to provide extra parameters, must be NULL if not needed by an extension
VkRenderPassCreateFlags flags Reserved for future use (should be 0)
uint32_t attachmentCount Number of framebuffer attachments used in this render pass (across all subpasses)
const VkAttachmentDescription* pAttachments Description of the attachments (array of size attachmentCount)
uint32_t subpassCount Number of subpasses
const VkSubpassDescription* pSubpasses Description of subpasses (array of size subpassCount)
uint32_t dependencyCount Number of dependencies between subpass pairs
const VkSubpassDependency* pDependencies Descriptions of dependencies between subpasses (array of size dependencyCount)

For the purposes of this article, we will begin with a simple rendering operation with only a single subpass (a render pass always consists of at least one subpass). In this case, subpassCount can be 1 and dependencyCount can be 0 (so pDependencies can be NULL - we'll come back to describe how else dependencies are used below).

VkAttachmentDescription

An attachment corresponds to a single Vulkan VkImageView. A description of the attachment is provided to the render pass creation, which allows the render pass to be configured appropriately; the actual images to be used are provided when the render pass is used, via the VkFrameBuffer. It is possible to associate multiple attachments with a render pass; these may be used for example as multiple render targets, or in separate subpasses. More commonly, a color framebuffer and a depth buffer are separate attachments in Vulkan. Therefore the pAttachments member of VkRenderPassCreateInfo points to an array of attachmentCount elements.

struct VkAttachmentDescription
VkAttachmentDescriptionFlags flags Used by extensions; can be 0
VkFormat format Image format of the attachment
VkSampleCountFlagBits samples Number of samples in the attachment (used for multi-sampling)
VkAttachmentLoadOp loadOp What should be done to access the attachment before rendering
VkAttachmentStoreOp storeOp What should be done with the attachment after rendering
VkAttachmentLoadOp stencilLoadOp In the case of a depth/stencil attachment, how to access the stencil contents before rendering
VkAttachmentStoreOp stencilStoreOp In the case of a depth/stencil attachment, what should be done with the stencil after rendering
VkImageLayout initialLayout Layout of the attachment when first used in the render pass
VkImageLayout finalLayout Layout of the attachment after use in the render pass

For a simple rendering operation, we might decide to create two attachments:

Color attachment (pAttachments[0]) Depth attachment (pAttachments[1])
flags 0 0
format VK_IMAGE_FORMAT_B8G8R8A8_UNORM

VK_IMAGE_FORMAT_D16_UNORM

samples 1 1
loadOp VK_ATTACHMENT_LOAD_OP_DONT_CARE VK_ATTACHMENT_LOAD_OP_CLEAR
storeOp VK_ATTACHMENT_STORE_OP_STORE VK_ATTACHMENT_STORE_OP_DONT_CARE
stencilLoadOp VK_ATTACHMENT_LOAD_OP_DONT_CARE VK_ATTACHMENT_LOAD_OP_DONT_CARE
stencilStoreOp VK_ATTACHMENT_STORE_OP_DONT_CARE VK_ATTACHMENT_STORE_OP_DONT_CARE
initialLayout VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL
finalLayout VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL

Stencil is special because the combined depth/stencil attachment is a single attachment. Here, we aren't using stencil, so the stencilLoadOp and stencilStoreOp are irrelevant. Note that the "DONT_CARE" store ops don't guarantee not to touch the memory are there because, while they may not access memory on a tile-based renderer, an immediate-mode renderer may actually use memory to implement them during rendering; similarly, a "DONT_CARE" load op avoids the need to read the previous frame buffer contents in a tiler, but also avoids the need to perform an explicit clear of the memory which may be costly for an immediate-mode renderer.

Note: We're assuming that the images have been transitioned from VK_IMAGE_LAYOUT_UNDEFINED (on creation) to VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL and VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL before we use them, for example by using a VkImageMemoryBarrier.

There is a complication to this mechanism to be aware of: Consider the example of drawing a scene with two render passes, the second of which uses the results of the first (written with STORE_OP_STORE) as an input (LOAD_OP_LOAD) input attachment but does not write to it. If this input attachment is still wanted after the second render pass, it must still have STORE_OP_STORE associated with it: using STORE_OP_DONT_CARE causes some hardware to perform an optimization and discard the attachment content after the second render pass, even though the first render pass used STORE_OP_STORE. You may think of this as a cache discard of the output of the first render pass, where the cache line was previously considered to be valid. This is potentially a good performance enhancement, but it does mean that users need to be prepared for surprising behavior!

VkSubpassDescription

struct VkSubpassDescription
VkSubpassDescriptionFlags flags Reserved for future use, must be 0
VkPipelineBindPoint pipelineBindPoint Should be VK_PIPELINE_BIND_POINT_GRAPHICS
uint32_t inputAttachmentCount Number of input attachments to this subpass
const VkAttachmentReference* pInputAttachments Array of input attachments read by this subpass (array of size inputAttachmentCount)
uint32_t colorAttachmentCount Number of output attachments for this subpass
const VkAttachmentReference* pColorAttachments Array of color attachments written to by this subpass (array of size colorAttachmentCount)
const VkAttachmentReference* pResolveAttachments Attachments for antialiasing (NULL or array of size colorAttachmentCount)
const VkAttachmentReference* pDepthStencilAttachment One attachment reference describing the depth/stencil attachment
uint32_t preserveAttachmentCount Number of attachments preserved across this subpass
const uint32_t* pPreserveAttachments Array of attachment indices preserved across this subpass, of size preserveAttachmentCount, or NULL

In our first example, we only have a single subpass, and we'll render to it directly. We won't use pResolveAttachments (so we can set it to NULL) and we do not need to preserve any attachments (so preserveAttachmentCount can be 0 and pPreserveAttachments can be NULL). The fields we don't need now will be described in more detail below, but in our simple case we can configure the (single) subpass. Before we get there, we have one more level of Vulkan object to worry about:

VkAttachmentReference

struct VkAttachmentReference
uint32_t attachment Id of the attachment (index into VkRenderpassCreateInfo::pAttachments)
VkImageLayout layout Layout of the attachment during the subpass

In the example we're walking through, we have two attachments in total:

pColorAttachments[0] pDepthStencilAttachment
attachment 0 1
layout VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL

The layout can change between subpasses of a render pass, hence the need to describe it on a per-subpass basis.

Example render pass (complete create info)

In summary, in the simple render pass we've been using as an example, we have the following two attachments:

Attachment At start of render pass At end of render pass
Color attachment
VK_IMAGE_FORMAT_B8G8R8A8_UNORM
VK_ATTACHMENT_LOAD_OP_DONT_CARE
(All pixels will be overwritten)
VK_ATTACHMENT_STORE_OP_STORE
(Write pixels to memory after rendering)
Depth(/stencil) attachment
VK_IMAGE_FORMAT_D16_UNORM
VK_ATTACHMENT_LOAD_OP_CLEAR
(Don't want previous depth values)
VK_ATTACHMENT_STORE_OP_DONT_CARE
(Don't need depth after rendering)

In total, then, our simple render pass looks like this:

*pCreateInfo
sType VK_STRUCTURE_TYPE_RENDER_PASS_CREATE_INFO
pNext NULL
flags 0
attachmentCount 2
pAttachments
pAttachments[0]
flags 0
format VK_IMAGE_FORMAT_B8G8R8A8_UNORM
samples 1
loadOp VK_ATTACHMENT_LOAD_OP_DONT_CARE
storeOp VK_ATTACHMENT_STORE_OP_STORE
stencilLoadOp VK_ATTACHMENT_LOAD_OP_DONT_CARE
stencilStoreOp VK_ATTACHMENT_STORE_OP_DONT_CARE
initialLayout VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL
finalLayout VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL
pAttachments[1]
flags 0
format VK_IMAGE_FORMAT_D16_UNORM
samples 1
loadOp VK_ATTACHMENT_LOAD_OP_CLEAR
storeOp VK_ATTACHMENT_STORE_OP_DONT_CARE
stencilLoadOp VK_ATTACHMENT_LOAD_OP_DONT_CARE
stencilStoreOp VK_ATTACHMENT_STORE_OP_DONT_CARE
initialLayout VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL
finalLayout VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL
subpassCount 1
pSubpasses
pSubpasses[0]
flags 0
pipelineBindPoint VK_PIPELINE_BIND_POINT_GRAPHICS
inputAttachmentCount 0
pInputAttachments NULL
colorAttachmentCount 1
pColorAttachments
pColorAttachments[0]
attachment 0
layout VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL
pResolveAttachments NULL
pDepthStencilAttachment
*pDepthStencilAttachment
attachment 1
layout VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL
preserveAttachmentCount 0
pPreserveAttachments NULL
dependencyCount 0
pDependencies NULL

Fortunately, since render passes can be reused, you may not need to do this too often. We'll see later the flexibility exposed by this mechanism.

Creating a VkFrameBuffer

A VkRenderPass is a template for how a render pass will be used. When we use the render pass, we need to provide the actual images which are to be used for rendering. The mechanism containing references to the actual images is a VkFramebuffer, which contains all the attachments used by the render pass.

vkCreateFrameBuffer

As with vkCreateRenderPass for a vkRenderPass, a VkFramebuffer is created with vkCreateFramebuffer():

VkResult vkCreateFramebuffer()
VkDevice device Logical device used for rendering (from vkCreateDevice)
const VkFramebufferCreateInfo* pCreateInfo Parameters for creation
const VkAllocationCallbacks* pAllocator Host memory allocation callback (can be NULL)
VkFramebuffer* pFramebuffer Resulting frame buffer handle

Again, to allow extensibility and reusability, the parameters are passed through a pCreateInfo pointer. (Yes, here we go again!)

VkFramebufferCreateInfo

struct VkFramebufferCreateInfo
VkStructureType sType Used for type safety and extensions, must be VK_STRUCTURE_TYPE_FRAMEBUFFER_CREATE_INFO
const void* pNext Used for extensions; NULL if no extensions are used to add parameters
VkFramebufferCreateFlags flags Reserved for future use, must be 0
VkRenderPass renderPass The render pass (or a compatible one) with which the framebuffer will be used
uint32_t attachmentCount Number of attachments used in the render pass
const VkImageView* pAttachments Array of image views, which refer to actual images; array is of size attachmentCount
uint32_t width Width of framebuffer
uint32_t height Height of framebuffer
uint32_t layers Number of layers in framebuffer

Note that all the attachments used in the framebuffer are of the same width, height and number of layers - but that this is independent of the render pass, so the same render pass can be used with framebuffers of different sizes.

For our simple example, we need two image views: one referring to a VK_IMAGE_FORMAT_B8G8R8A8_UNORM image and one referring to a VK_IMAGE_FORMAT_D16_UNORM image. For efficiency, since we typically don't need the depth buffer to persist after rendering, the D16 image can be created with VK_IMAGE_USAGE_TRANSIENT_ATTACHMENT_BIT in its usage flags, and can be bound to memory with the VK_MEMORY_PROPERTY_LAZILY_ALLOCATED_BIT set. In this case, a tile-based renderer may be able to avoid allocating any memory for the depth buffer, since it is only used for rendering operations which occur on-chip.

Using a VkRenderPass

Now that we have a VkRenderPass and a VkFramebuffer, we can use them in the rendering process.

vkCmdBeginRenderPass

To begin a render pass instance in a command buffer, call vkCmdBeginRenderPass():

void vkCmdBeginRenderPass()
VkCommandBuffer commandBuffer Command buffer into which to insert the render pass
const VkRenderPassBeginInfo* pRenderPassBegin Arguments
VkSubpassContents contents Indication whether secondary command buffers are in use (see below)

A render pass can only be inserted into a primary command buffer.

Once a render pass has begun on a command buffer, subsequent commands submitted to that command buffer will execute within the first (and in the case of our example, only) subpass of the render pass instance. In our simple case, we could use just the one command buffer and record rendering commands directly into it. In this case, contents should be VK_SUBPASS_CONTENTS_INLINE.

VkRenderPassBeginInfo

As with many functions, Vulkan uses an info structure for reusability and extensibility.

struct VkRenderPassBeginInfo
VkStructureType sType Must be VK_STRUCTURE_TYPE_RENDER_PASS_BEGIN_INFO
void* pNext Used for extensions; must be NULL if no extension is used which adds to this struct
VkRenderPass renderPass The render pass description created by vkCreateRenderPass()
VkFramebuffer framebuffer The framebuffer containing the images for rendering, created by vkCreateFramebuffer()
VkRect2D renderArea Bounds of the rectangular area affected by the render pass
uint32_t clearValueCount Number of clear values
const VkClearValue* pClearValues Values used for clearing attachments (array of size clearValueCount)

renderArea is used for rendering a subset of the framebuffer, for example for partial updates of dirty areas of the screen. The application is responsible for clipping rendering to this area, and rendering to less than the entire screen can invoke a performance hit if the area being drawn is not aligned as can be determined by vkGetRenderAreaGranularity() - which for a tile-based renderer might be expected to correspond to the alignment of the tile grid. For most purposes, the render area can be set to the full width and height of the framebuffer.

pClearValues is indexed by the attachment number and used if the attachment has a loadOp of VK_ATTACHMENT_LOAD_OP_CLEAR. In the case of our simple example, we clear the depth attachment at the start of rendering, and the depth attachment is at index 1 in our attachment array - so we need pClearValues[1] to represent the value to which we want to clear the depth buffer.

union VkClearValue
VkClearColorValue color Value used when clearing color buffers
VkClearDepthStencilValue depthStencil Value used when clearing depth/stencil buffers

VkClearColorValue is a union of arrays of various channel types, with the format chosen by the attachment format being cleared. VkClearDepthStencilValue always has a float depth value, and a uint32_t stencil value. For our simple example, only the float depth value is relevant, and should be set to the depth value we want for our rendering.

vkCmdEndRenderPass

After the last rendering commands for the render pass instance have been submitted to the command buffer, the application must end the render pass instance:

void vkCmdEndRenderPass()
VkCommandBuffer commandBuffer

In this example, if we have been recording commands direct to the primary command buffer, the command buffer looks like this:

Command buffer
Previous render pass...
Current render pass
vkCmdBeginRenderPass()
vkCmdBind*...
vkCmdDraw*... etc.
vkCmdEndRenderPass()
Next render pass...

Multiple render passes can be inserted into the same command buffer, so long as one is ended before the next is begun. A render pass must both begin and end within a single primary command buffer (that is, a render pass cannot span multiple primary command buffers), so parallelism in command buffer building in this approach relies on parallel building of multiple render passes. In many rendering frameworks, this level of parallelism is still enough to allow the CPU cores to stay busy, and simplifies the task of resource management and state tracking.

Render passes and secondary command buffers

In some rendering scenarios, a large amount of work needs to be performed within a single rendering pass. For example, a large number of characters may be managed and animated by their own threads, but all appear on screen at once. This complicates the task of optimizing rendering order and minimizing state changes, but can still be necessary in some highly-parallel systems.

Vulkan's solution to this is to make use of secondary command buffers, which (for graphics rendering) are executed inside a render pass. A secondary command buffer is created by vkAllocateCommandBuffers() using a VkCommandBufferAllocateInfo with a level member of VK_COMMAND_BUFFER_LEVEL_SECONDARY.

Beginning a secondary command buffer

For graphics, the VkCommandBufferBeginInfo argument of vkBeginCommandBuffer when creating a secondary command buffer must have a valid pInheritanceInfo field:

VkResult vkBeginCommandBuffer()
VkCommandBuffer commandBuffer Command buffer to start using
const VkCommandBufferBeginInfo* pBeginInfo Arguments
struct VkCommandBufferBeginInfo
VkStructureType sType VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO
const void* pNext For extensions, should be NULL
VkCommandBufferUsageFlags flags Usage flags (see below)
const VkCommandBufferInheritanceInfo* pInheritanceInfo Inherited info (NULL for a primary command buffer, must be valid for a secondary)
VkCommandBufferBeginInfo::flags

flags has the following bit values:

enum VkCommandBufferUsageFlagBits
VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT_BIT Set if the command buffer will only ever be used once (potential optimization)
VK_COMMAND_BUFFER_USAGE_RENDER_PASS_CONTINUE_BIT Must be set for secondary graphics command buffers
VK_COMMAND_BUFFER_USAGE_SIMULTANEOUS_USE_BIT Set if the command buffer will be submitted more than once concurrently
(set only if needed when reusing command buffers)

Secondary command buffers can also be used for compute, and in this case their operations do not fall within a render pass. For graphics, we must set RENDER_PASS_CONTINUE_BIT, may be able to set ONE_TIME_SUBMIT_BIT, and may need to set SIMULTANEOUS_USE_BIT. These options affect the way secondary command buffers are implemented - for example, some may make the difference between whether a separate copy must be made of a secondary command buffer before use, or whether the existing copy may be used indirectly.

VkCommandBufferBeginInfo::pInheritanceInfo

pInheritanceInfo is used to allow the secondary command buffer to be configured correctly for the render pass:

struct VkCommandBufferInheritanceInfo
VkStructureType sType VK_STRUCTURE_TYPE_COMMAND_BUFFER_INHERITANCE_INFO
void* pNext Used for extensions, and should be NULL unless needed
VkRenderPass renderPass The render pass (or a compatible one) which will be active when the command buffer is used
uint32_t subpass The subpass of the render pass that this command buffer will be used in
VkFramebuffer framebuffer The framebuffer to be used (if known), or VK_NULL_HANDLE if unknown
VkBool32 occlusionQueryEnable Should be VK_TRUE if the primary command buffer might have a query active, and VK_FALSE otherwise
VkQueryControlFlags queryFlags Queries that can be used in the primary command buffer when this secondary command buffer executes; 0 if unused
VkQueryPipelineStatisticFlags pipelineStatistics Set of pipeline statistics that can be counted by a query; 0 if pipeline statistics queries are disabled

If the framebuffer is known at the time the command buffer is recorded (for example, if the same framebuffer is always used for generating a shadow map) then providing an explicit framebuffer may be more efficient; otherwise (if the framebuffer argument is VK_NULL_HANDLE) the framebuffer is determined by the render pass in the primary command buffer, which allows secondary command buffers to be reused with different (compatible) framebuffers determined by the primary command buffer that is using the secondary command buffer.

Rendering commands are recorded into the secondary command buffer in the same way as for a primary command buffer, and having multiple secondary command buffers allows multiple threads to record rendering commands concurrently without need for synchronization.

Invoking a secondary command buffer

When the secondary command buffers have been recorded, they can be invoked in a "parent" primary command buffer with vkCmdExecuteCommands():

void vkCmdExecuteCommands()
VkCommandBuffer commandBuffer Primary command buffer to be recorded into
uint32_t commandBufferCount Number of secondary command buffers to submit
const VkCommandBuffer* pCommandBuffers Array of commandBufferCount secondary command buffers to execute (in increasing array index order)
Secondary command buffers inside a subpass

Using the above techniques, work may be distributed as in the following example:

Thread 1 Record secondary
command buffer A
(frame 2)
Record secondary
command buffer A
(frame 3)
...
Thread 2 Record secondary
command buffer B
(frame 2)
Record secondary
command buffer B
(frame 3)
...
Thread 3 Record secondary
command buffer C
(frame 2)
Record secondary
command buffer C
(frame 3)
...
Thread 4
Primary command buffer
(frame 1)
vkCmdBeginRenderPass()
vkCmdExecuteCommands(A)
vkCmdExecuteCommands(B)
vkCmdExecuteCommands(C)
vkCmdEndRenderPass()
Primary command buffer
(frame 2)
vkCmdBeginRenderPass()
vkCmdExecuteCommands(A)
vkCmdExecuteCommands(B)
vkCmdExecuteCommands(C)
vkCmdEndRenderPass()
Primary command buffer
(frame 3)
vkCmdBeginRenderPass()
vkCmdExecuteCommands(A)
vkCmdExecuteCommands(B)
vkCmdExecuteCommands(C)
vkCmdEndRenderPass()

Recording the primary command buffer should be faster than recording a significant amount of work into the secondary command buffers. However, there is typically some cost - especially for implementations which require the secondary command buffers to be copied into the primary command buffer. This approach also assumes that the secondary command buffers are at least double-buffered, and that the threads are suitably synchronized.

Since primary command buffers can be recorded in parallel and vkQueueSubmit() allows multiple command buffers to be submitted efficiently, exposing parallelism across secondary command buffers is not necessary in many applications, so this technique should be matched to the rendering work load. Note that it can also be possible to re-use secondary command buffers, although again this may carry some driver overhead (hopefully less than recording anew). Command buffer reuse should be used selectively, allowing for other optimizations such as frustum culling.


Destroying a VkRenderPass

Once a render pass is no longer needed, it can be deleted as follows:

void vkDestroyRenderPass()
VkDevice device
VkRenderPass renderPass
const VkAllocationCallbacks* pAllocator

Note that it is up to the user to ensure that nothing is still rendering which referred to the render pass at the point vkDestroyRenderPass() is called - for example by using vkWaitForFences() with a VkFence handle previously passed to vkQueueSubmit().

Multi-sampling

Tiled rendering also provides a low-bandwidth way to implement antialiasing: we can render to the tiles normally, but average pixel values as part of the operation of writing the tile memory; this downsampling step is known as "resolving" the tile buffer.

Vulkan has the concept of a number of samples associated with an image. In a simple implementation the image might have several values stored at each pixel location; more complex implementations have compressed schemes. Therefore an image has a number of samples associated with it at image creation time. For multi-sampled rendering in Vulkan, the multi-sampled image is treated separately from the final single-sampled image; this provides separate control over what values need to reach memory, since - like the depth buffer - the multi-sampled image may only need to be accessed during the processing of a tile. For this reason, if the multi-sampled image is not required after the render pass, it can be created with VK_IMAGE_USAGE_TRANSIENT_ATTACHMENT_BIT and bound to an allocation created with VK_MEMORY_PROPERTY_LAZILY_ALLOCATED_BIT, as described above for depth buffers. The multi-sampled attachment storeOp can then be set to VK_ATTACHMENT_STORE_OP_DONT_CARE in the VkAttachmentDescription, so that (at least on tiled renderers) the full multi-sampled attachment does not need to be written to memory, which can save a lot of bandwidth.

To control multi-sampling, the index of an attached image view (in the pAttachments array of the VkFramebufferCreateInfo) with more than one sample should be used in the VkSubpassDescription's pColorAttachment array, and the index of a corresponding image view with exactly one sample should be placed in the corresponding index of the pResolveAttachments array; the multi-sampled image is then resolved to the single-sampled image at the end of the current sub-pass. To use pResolveAttachments for some attachments but not others, the entry in the pResolveAttachments array can be set to VK_ATTACHMENT_UNUSED to avoid resolving the corresponding multi-sampled image.

For example, if we had three multi-sampled attachments and only wanted the first and third to be resolved to single-sampled form, the VkSubpassDescription may have the following entries:

Index pColorAttachments[] pResolveAttachment[]
0 Index of first multi-sampled attachment Index of first single-sampled attachment
1 Index of second multi-sampled attachment VK_ATTACHMENT_UNUSED
2 Index of third multi-sampled attachment Index of second single-sampled attachment

Remember that if we don't want to resolve any attachments in the subpass, pResolveAttachments can simply be set to NULL. Multi-sampled images can also be resolved to a single-sample image with vkCmdResolveImage() - but this happens outside the render pass and requires a separate access to memory, so it is a much less efficient solution if it can be avoided. Note that you can write both the resolved and multi-sampled images out of the same render pass by setting the storeOp of both attachments to VK_ATTACHMENT_STORE_OP_STORE.

Resolving an image outside a render pass

On some occasions, the attachment containing all samples may need to be written to memory for later processing (for example, use in a later render pass as an input attachment). It is possible to resolve a multi-sampled image to a single-sampled one without using it as an attachment in a render pass using the vkCmdResolveImage() command.

However, please bear in mind that this should be the exception to normal rendering, not the default approach. Writing out the multi-sampled attachment to off-chip memory (rather than using VK_ATTACHMENT_STORE_OP_DONT_CARE) has a high bandwidth cost, and vkCmdResolveImage() itself must then read all this data back, process it, and write the single-sampled output. It is very much more efficient to perform resolve operations inside a render pass where possible.

Multiple subpasses

The render pass mechanism described so far is quite verbose for use with a single subpass. The reason for this is the flexibility that it provides when when using multiple subpasses.

Some rendering techniques, notably deferred shading and deferred lighting, traverse the scene geometry once to create a frame buffer, then use the rendering results in the framebuffer for further rendering operations. The same can be said for, for example, applying tone mapping effects after rendering. In a tiled renderer, because each of these operations requires access only to the current pixel and not the entire framebuffer, all of these operations can be performed consecutively on a per-tile basis, avoiding the need to write intermediate values out to memory. This can provide a significant bandwidth (and therefore power and performance) improvement. There is a graphical example of how deferred shading is evaluated on a tiler towards then end of the Understanding Tiling article.

Note that because the render area size is defined by the width and height fields of the VkFrameBufferCreateInfo object, the render area of each attachment is effectively the same size, and this is true for all subpasses in a render pass. If a rendering technique requires reading values outside the current fragment area (which on a tiler would mean accessing rendered data outside the currently-rendering tile), separate render passes must be used.

Taking the example of deferred lighting, we might render the scene in three "subpasses":

  • The first subpass renders the geometry and stores the depth, normal vector and specular spread function.

  • The second subpass renders each light's bounds, accumulating a specular and diffuse color for each light that is calculated with the position, normal and specular spread function from the first subpass.

  • Finally, the scene geometry is processed again with conventional forward shading, picking up the light contributions from the results of the second subpass.

Since the shading in the first subpass is highly simplistic, the shader run-time cost can be significantly reduced in this approach, although the degree of shader parallelism in the final subpass may still depend on fragment coverage. The related deferred shading technique can allow for better shader parallelism at the cost of reduced flexibility and increasing intermediate storage requirements.

Multiple attachments for multiple subpasses

In our deferred lighting example, the depth buffer is used in all three subpasses; it should only be updated by the first, but the lighting subpass needs the depth attachment both to provide an accurate bounds for a light and to calculate the shading position in world space, and the final rendering pass can inherit the depth buffer to avoid unnecessary overdraw.

In this case, our render pass might use the following attachments:

ID Field Value Notes
0 flags 0 Reserved
samples 1 Single-sampled
format VK_IMAGE_FORMAT_B8G8R8A8_UNORM Surface normal (BGR) + specular power (A)
loadOp VK_ATTACHMENT_LOAD_OP_DONT_CARE Assuming this will be completely overwritten
storeOp VK_ATTACHMENT_STORE_OP_DONT_CARE Intermediate storage (not written)
stencilLoadOp VK_ATTACHMENT_LOAD_OP_DONT_CARE Unused
stencilStoreOp VK_ATTACHMENT_STORE_OP_DONT_CARE Unused
initialLayout VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL Rendering to it
finalLayout VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL Rendering to it
1 flags 0 Reserved
samples 1 Single-sampled
format VK_IMAGE_FORMAT_D16_UNORM Depth
loadOp VK_ATTACHMENT_LOAD_OP_CLEAR Need empty depth buffer before use
storeOp VK_ATTACHMENT_STORE_OP_DONT_CARE Intermediate storage (not written)
stencilLoadOp VK_ATTACHMENT_LOAD_OP_DONT_CARE Unused
stencilStoreOp VK_ATTACHMENT_STORE_OP_DONT_CARE Unused
initialLayout VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL Rendering to it
finalLayout VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL Rendering to it
2 flags 0 Reserved
samples 1 Single-sampled
format VK_IMAGE_FORMAT_B8G8R8A8_UNORM Accumulated diffuse lighting contribution
loadOp VK_ATTACHMENT_LOAD_OP_CLEAR Accumulating, so start at 0
storeOp VK_ATTACHMENT_STORE_OP_DONT_CARE Intermediate storage (not written)
stencilLoadOp VK_ATTACHMENT_LOAD_OP_DONT_CARE Unused
stencilStoreOp VK_ATTACHMENT_STORE_OP_DONT_CARE Unused
initialLayout VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL Rendering to it
finalLayout VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL Rendering to it
3 flags 0 Reserved
samples 1 Single-sampled
format VK_IMAGE_FORMAT_B8G8R8A8_UNORM Accumulated specular lighting contribution
loadOp VK_ATTACHMENT_LOAD_OP_CLEAR Accumulating, so start with 0
storeOp VK_ATTACHMENT_STORE_OP_DONT_CARE Intermediate storage (not written)
stencilLoadOp VK_ATTACHMENT_LOAD_OP_DONT_CARE Unused
stencilStoreOp VK_ATTACHMENT_STORE_OP_DONT_CARE Unused
initialLayout VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL Rendering to it
finalLayout VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL Rendering to it
4 flags 0 Reserved
samples 1 Single-sampled
format VK_IMAGE_FORMAT_B8G8R8A8_UNORM Final output of rendering
loadOp VK_ATTACHMENT_LOAD_OP_DONT_CARE Assuming rendering the whole frame
storeOp VK_ATTACHMENT_STORE_OP_STORE Write output of rendering
stencilLoadOp VK_ATTACHMENT_LOAD_OP_DONT_CARE Unused
stencilStoreOp VK_ATTACHMENT_STORE_OP_DONT_CARE Unused
initialLayout VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL Rendering to it
finalLayout VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL Rendering to it

That is:

  • Attachment 0 holds the surface normal and specular factor output by the first subpass, and used by the second subpass.

  • Attachment 1 holds the depth buffer for the scene, and applies to all three subpasses.

  • Attachment 2 holds the diffuse contributions from light sources output by the second subpass and read by the third.

  • Attachment 3 holds the specular contributions from light sources output by the second subpass and read by the third.

  • Attachment 4 holds the final result of rendering generated by the third subpass.

Relating attachments to subpasses

To associate the way these attachments are used with each subpass, we need a more complex array of VkSubpassDescription objects to pass to the pSubpasses member of our VkRenderPassCreateInfo object:

pSubpasses[0]
flags 0
pipelineBindPoint VK_PIPELINE_BIND_POINT_GRAPHICS
inputAttachmentCount 0
pInputAttachments NULL
colorAttachmentCount 1
pColorAttachments
pColorAttachments[0]
attachment 0 (normal + specularity)
layout VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL
pResolveAttachments NULL
pDepthStencilAttachment
*pDepthStencilAttachment
attachment 1 (depth)
layout VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL
preserveAttachmentCount 0
pPreserveAttachments NULL
pSubpasses[1]
flags 0
pipelineBindPoint VK_PIPELINE_BIND_POINT_GRAPHICS
inputAttachmentCount 1
pInputAttachments
pInputAttachments[0]
attachment 0 (normal + specularity)
layout VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL
colorAttachmentCount 2
pColorAttachments
pColorAttachments[0]
attachment 2 (diffuse lighting)
layout VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL
pColorAttachments[1]
attachment 3 (specular lighting)
layout VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL
pResolveAttachments NULL
pDepthStencilAttachment
*pDepthStencilAttachment
attachment 1 (depth)
layout VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL
preserveAttachmentCount 0
pPreserveAttachments NULL
pSubpasses[2]
flags 0
pipelineBindPoint VK_PIPELINE_BIND_POINT_GRAPHICS
inputAttachmentCount 2
pInputAttachments
pInputAttachments[0]
attachment 2 (diffuse lighting)
layout VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL
pInputAttachments[1]
attachment 3 (specular lighting)
layout VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL
colorAttachmentCount 1
pColorAttachments
pColorAttachments[0]
attachment 4 (final output)
layout VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL
pResolveAttachments NULL
pDepthStencilAttachment
*pDepthStencilAttachment
attachment 1 (depth)
layout VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL
preserveAttachmentCount 0
pPreserveAttachments NULL

Since all but the final output color attachment in this example are used only as intermediate values, they can be created with the VK_IMAGE_USAGE_TRANSIENT_ATTACHMENT_BIT set, and be bound to memory allocated with VK_MEMORY_PROPERTY_LAZILY_ALLOCATED_BIT. Tiling hardware typically has limitations on the number and type of attachments which can be kept in flight concurrently, so despite this optimization, it is possible that implementations will have to spill intermediate results to main memory.

More complex arrangements of subpasses are possible. If an attachment is not used during a subpass, but is needed in previous and subsequent subpasses, the attachment should appear in the pPreserveAttachments array of the subpass. Implementations can change the order in which subpasses are evaluated (while preserving dependencies) in order to reduce the need for spilling. In the above example, attachment 0 is not preserved, and the implementation may use the same internal tile memory for both it and the final output attachment. It is also possible to use multi-sampling with these approaches, but this complicates the intermediate read operations and may make it more likely that tilers will have to spill to external memory.

Subpass dependencies

When multiple subpasses are in use, the driver needs to be told the relationship between them. A subpass can depend on operations which were submitted outside the current render pass, or be the source on which later rendering depends. Most commonly, the need is to ensure that the fragment shader from an earlier subpass has completed rendering (to the current tile, on a tiler) before the next subpass starts to try to read that data. An array of subpass dependencies - if there are any - is passed to VkRenderPassCreateInfo, defining a set of dependencies between "source" (the thing being waited on) and "destination" (the thing doing the waiting). Each subpass dependency is defined as follows:

struct VkSubpassDependency
uint32_t srcSubpass The index of the render pass being depended upon by dstSubpass
uint32_t dstSubpass The index of the render pass depending on srcSubpass
VkPipelineStageFlags srcStageMask What pipeline stage must have completed for the dependency
VkPipelineStageFlags dstStageMask What pipeline stage is waiting on the dependency
VkAccessFlagBits srcAccessMask What access scopes are influence the dependency
VkAccessFlagBits dstAccessMask What access scopes are waiting on the dependency
VkDependencyFlags dependencyFlag Other configuration about the dependency

Typically, for dependencies between fragment writes and fragment shader reads, we might expect the following settings:

srcStageMask VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT Fragment data has been written
dstStageMask VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT Don't start shading until data is available
srcAccessMask VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT Waiting for color data to be written
dstAccessMask VK_ACCESS_SHADER_READ_BIT Don't read things from the shader before ready
dependencyFlag VK_DEPENDENCY_BY_REGION_BIT Only need the current fragment (or tile) synchronized, not the whole framebuffer

In the cases of our deferred lighting example, we have three subpasses, and we have dependencies between the first and second and between the second and third. That is, we need to set the dependencyCount member of our VkRenderPassCreateInfo to 2, and set the pDependencies member of our VkRenderPassCreateInfo to point to the following array:

pDependencies[0] srcSubpass 0
dstSubpass 1
srcStageMask VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT
dstStageMask VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT
srcAccessMask VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT
dstAccessMask VK_ACCESS_SHADER_READ_BIT
dependencyFlag VK_DEPENDENCY_BY_REGION_BIT
pDependencies[1] srcSubpass 1
dstSubpass 2
srcStageMask VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT
dstStageMask VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT
srcAccessMask VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT
dstAccessMask VK_ACCESS_SHADER_READ_BIT
dependencyFlag VK_DEPENDENCY_BY_REGION_BIT
Using subpasses in a command buffer

When recording to a VkCommandBuffer, we described above that vkCmdBeginRenderPass() and vkCmdEndRenderPass() are used to wrap the render pass operations. After vkCmdBeginRenderPass() is called, subsequent commands are applied to the first subpass within the render pass.

To move operations to subsequent subpasses, vkCmdNextSubpass() should be called. Each call of this function moves operations to the next subpass index, in increasing order, until vkCmdEndRenderPass() is called. Synchronization between access to attachments described in subpass dependencies is handled automatically.

Using subpasses in shaders

In SPIR-V, the contents of an input attachment can be accessed with the OpImageRead operation, with an OpTypeImage that has a dim argument of SubpassData. The coordinate argument of the OpImageRead must be (0,0), and corresponds to accessing the input attachment at the current fragment location. When multi-sampling, the sample operand to OpImageRead can be used to access separate samples at the current fragment.

In GLSL, this functionality is exposed through the subpassLoad() function, with subpassInput types for the subpasses.

Summary

The Vulkan API acknowledges the fact that modern rendering technique may perform multiple passes over the same image data, and is designed to ensure that these approaches are explicitly and efficiently supported on modern graphics hardware. The unfortunate consequence of this expressivity is the complexity of the description and the verbosity of simple examples, although the overhead in a practical, optimized renderer should be less significant.

In Vulkan, the render pass is an explicit concept within which rendering operations execute. A VkFrameBuffer, with a list of associated attachments, is associated with the render pass when rendering work is recorded into a VkCommandBuffer. The render pass is divided into one or more subpasses, with explicitly-defined interactions between them. This explicit configuration VkRenderPass object can be shared between rendering operations, which can limit the impact on real-world, complex applications. Providing this additional information to a driver can allow significantly improved memory overhead, especially on tiled architectures, without the unpredictability of the heuristics applied to achieve good performance in more traditional APIs.

Additional reading

A simplified version of the content of this article may be found in a presentation on the subject at a UK developer event.