When a GPU renders a scene, it is configured with one or more render targets, or framebuffer attachments in Khronos terminology. The size and format of the attachments determine how graphics work is configured across the parallelism available on all modern GPUs. For example, on a tile-based renderer, the set of attachments is used to determine the way the image is divided into tiles. In Vulkan, a render pass is the set of attachments, the way they are used, and the rendering work that is performed using them. In a traditional API, a change to a new render pass might correspond to binding a new framebuffer.
During normal rendering, it is not possible for a fragment shader to access the attachments to which it is currently rendering: GPUs have optimized hardware for writing to the attachments, and accessing the attachment interferes with this. However, some common rendering techniques such as deferred shading rely on being able to access the result of previous rendering during shading. For a tile-based renderer, the results of previous rendering can efficiently stay on-chip if subsequent rendering operations are at the same resolution, and if only the data in the pixel currently being rendered is needed (accessing different pixels may require access to values outside the current tile, which breaks this optimization). In order to help optimize deferred shading on tile-based renderers, Vulkan splits the rendering operations of a render pass into subpasses. All subpasses in a render pass share the same resolution and tile arrangement, and as a result, they can access the results of previous subpass.
In Vulkan, a render pass consists of one or more subpasses; for simple rendering operations, there may be only a single subpass in a render pass.
In Vulkan, a render pass is described by an (opaque) VkRenderPass object. This provides a template that is used when beginning a render pass inside a command buffer. The render pass is used with a compatible VkFrameBuffer object, which represents the set of images that will be used as attachments during execution of the render pass.
Like many driver objects in Vulkan, a VkRenderPass object is created with a corresponding create function, VkCreateRenderPass():
As with many Vulkan creation functions, most parameters are passed through a creation structure. This approach makes it more efficient to create multiple identical objects, and provides a way to support type-safe additional parameters through extensions.
Many creation methods in Vulkan offer a call-back for applications which wish to track host-side memory usage. While important for applications that wish to have precise control over resource allocation, and useful for debugging, in most cases this callback can be left as NULL to rely on the driver's default memory allocation scheme.
As with other Vulkan creation functions, the function returns an error code if anything goes wrong - although more information may be available through validation layers if the problem is an application error. The newly-created render pass description is returned via the pRenderPass pointer.
The interesting parameters are contained in the pCreateInfo structure.
For the purposes of this article, we will begin with a simple rendering operation with only a single subpass (a render pass always consists of at least one subpass). In this case, subpassCount can be 1 and dependencyCount can be 0 (so pDependencies can be NULL - we'll come back to describe how else dependencies are used below).
An attachment corresponds to a single Vulkan VkImageView. A description of the attachment is provided to the render pass creation, which allows the render pass to be configured appropriately; the actual images to be used are provided when the render pass is used, via the VkFrameBuffer. It is possible to associate multiple attachments with a render pass; these may be used for example as multiple render targets, or in separate subpasses. More commonly, a color framebuffer and a depth buffer are separate attachments in Vulkan. Therefore the pAttachments member of VkRenderPassCreateInfo points to an array of attachmentCount elements.
For a simple rendering operation, we might decide to create two attachments:
Color attachment (pAttachments)
Depth attachment (pAttachments)
Stencil is special because the combined depth/stencil attachment is a single attachment. Here, we aren't using stencil, so the stencilLoadOp and stencilStoreOp are irrelevant. Note that a "DONT_CARE" store op doesn’t guarantee not to touch the memory, because while they may not access memory on a tile-based renderer, an immediate-mode renderer may actually use memory to implement them during rendering; similarly, a "DONT_CARE" load op avoids the need to read the previous frame buffer contents in a tiler, but also avoids the need to perform an explicit clear of the memory which may be costly for an immediate-mode renderer.
Note: We're assuming that the images have been transitioned from
VK_IMAGE_LAYOUT_UNDEFINED (on creation) to
VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL before we use them, for example by using a VkImageMemoryBarrier.
There is a complication to this mechanism to be aware of: Consider the example of drawing a scene with two render passes, the second of which uses the results of the first (written with STORE_OP_STORE) as an input (LOAD_OP_LOAD) input attachment but does not write to it. If this input attachment is still wanted after the second render pass, it must still have STORE_OP_STORE associated with it: using STORE_OP_DONT_CARE causes some hardware to perform an optimization and discard the attachment content after the second render pass, even though the first render pass used STORE_OP_STORE. You may think of this as a cache discard of the output of the first render pass, where the cache line was previously considered to be valid. This is potentially a good performance enhancement, but it does mean that users need to be prepared for surprising behavior!
In our first example, we only have a single subpass, and we'll render to it directly. We won't use pResolveAttachments (so we can set it to NULL) and we do not need to preserve any attachments (so preserveAttachmentCount can be 0 and pPreserveAttachments can be NULL). The fields we don't need now will be described in more detail below, but in our simple case we can configure the (single) subpass. Before we get there, we have one more level of Vulkan object to worry about:
In the example we're walking through, we have two attachments in total:
The layout can change between subpasses of a render pass, hence the need to describe it on a per-subpass basis.
In summary, in the simple render pass we've been using as an example, we have the following two attachments:
In total, then, our simple render pass looks like this:
Fortunately, since render passes can be reused, you may not need to do this too often. We'll see later the flexibility exposed by this mechanism.
A VkRenderPass is a template for how a render pass will be used. When we use the render pass, we need to provide the actual images which are to be used for rendering. The mechanism containing references to the actual images is a VkFramebuffer, which contains all the attachments used by the render pass.
As with vkCreateRenderPass for a vkRenderPass, a VkFramebuffer is created with vkCreateFramebuffer():
Again, to allow extensibility and reusability, the parameters are passed through a pCreateInfo pointer. (Yes, here we go again!)
Note that all the attachments used in the framebuffer are of the same width, height and number of layers - but that this is independent of the render pass, so the same render pass can be used with framebuffers of different sizes.
For our simple example, we need two image views: one referring to a VK_IMAGE_FORMAT_B8G8R8A8_UNORM image and one referring to a VK_IMAGE_FORMAT_D16_UNORM image. For efficiency, since we typically don't need the depth buffer to persist after rendering, the D16 image can be created with VK_IMAGE_USAGE_TRANSIENT_ATTACHMENT_BIT in its usage flags, and can be bound to memory with the VK_MEMORY_PROPERTY_LAZILY_ALLOCATED_BIT set. In this case, a tile-based renderer may be able to avoid allocating any memory for the depth buffer, since it is only used for rendering operations which occur on-chip.
Now that we have a VkRenderPass and a VkFramebuffer, we can use them in the rendering process.
To begin a render pass instance in a command buffer, call vkCmdBeginRenderPass():
A render pass can only begin (and end) in a primary command buffer.
Once a render pass has begun on a command buffer, subsequent commands submitted to that command buffer will execute within the first (and in the case of our example, only) subpass of the render pass instance. In our simple case, we could use just the one command buffer and record rendering commands directly into it. In this case, contents should be VK_SUBPASS_CONTENTS_INLINE.
As with many functions, Vulkan uses an info structure for reusability and extensibility.
renderArea is used for rendering a subset of the framebuffer, for example for partial updates of dirty areas of the screen. The application is responsible for clipping rendering to this area, and rendering to less than the entire screen can invoke a performance hit if the area being drawn is not aligned as can be determined by vkGetRenderAreaGranularity() - which for a tile-based renderer might be expected to correspond to the alignment of the tile grid. For most purposes, the render area can be set to the full width and height of the framebuffer.
pClearValues is indexed by the attachment number and used if the attachment has a loadOp of VK_ATTACHMENT_LOAD_OP_CLEAR. In the case of our simple example, we clear the depth attachment at the start of rendering, and the depth attachment is at index 1 in our attachment array - so we need pClearValues to represent the value to which we want to clear the depth buffer.
VkClearColorValue is a union of arrays of various channel types, with the format chosen by the attachment format being cleared. VkClearDepthStencilValue always has a float depth value, and a uint32_t stencil value. For our simple example, only the float depth value is relevant, and should be set to the depth value we want for our rendering.
After the last rendering commands for the render pass instance have been submitted to the command buffer, the application must end the render pass instance:
In this example, if we have been recording commands direct to the primary command buffer, the command buffer looks like this:
Multiple render passes can be inserted into the same command buffer, so long as one is ended before the next is begun. A render pass must both begin and end within a single primary command buffer (that is, a render pass cannot span multiple primary command buffers), so parallelism in command buffer building in this approach relies on parallel building of multiple render passes. In many rendering frameworks, this level of parallelism is still enough to allow the CPU cores to stay busy, and simplifies the task of resource management and state tracking.
In some rendering scenarios, a large amount of work needs to be performed within a single rendering pass. For example, a large number of characters may be managed and animated by their own threads, but all appear on screen at once. This complicates the task of optimizing rendering order and minimizing state changes, but can still be necessary in some highly-parallel systems.
Vulkan's solution to this is to make use of secondary command buffers, which (for graphics rendering) are executed inside a render pass. A secondary command buffer is created by vkAllocateCommandBuffers() using a VkCommandBufferAllocateInfo with a level member of VK_COMMAND_BUFFER_LEVEL_SECONDARY.
For graphics, the VkCommandBufferBeginInfo argument of vkBeginCommandBuffer when creating a secondary command buffer must have a valid pInheritanceInfo field:
flags has the following bit values:
Secondary command buffers can also be used for compute, and in this case their operations do not fall within a render pass. For graphics, we must set RENDER_PASS_CONTINUE_BIT, may be able to set ONE_TIME_SUBMIT_BIT, and may need to set SIMULTANEOUS_USE_BIT. These options affect the way secondary command buffers are implemented - for example, some may make the difference between whether a separate copy must be made of a secondary command buffer before use, or whether the existing copy may be used indirectly.
pInheritanceInfo is used to allow the secondary command buffer to be configured correctly for the render pass:
If the framebuffer is known at the time the command buffer is recorded (for example, if the same framebuffer is always used for generating a shadow map) then providing an explicit framebuffer may be more efficient; otherwise (if the framebuffer argument is VK_NULL_HANDLE) the framebuffer is determined by the render pass in the primary command buffer, which allows secondary command buffers to be reused with different (compatible) framebuffers determined by the primary command buffer that is using the secondary command buffer.
Rendering commands are recorded into the secondary command buffer in the same way as for a primary command buffer, and having multiple secondary command buffers allows multiple threads to record rendering commands concurrently without need for synchronization.
When the secondary command buffers have been recorded, they can be invoked in a "parent" primary command buffer with vkCmdExecuteCommands():
Using the above techniques, work may be distributed as in the following example:
Recording the primary command buffer should be faster than recording a significant amount of work into the secondary command buffers. However, there is typically some cost - especially for implementations which require the secondary command buffers to be copied into the primary command buffer. This approach also assumes that the secondary command buffers are at least double-buffered, and that the threads are suitably synchronized.
Since primary command buffers can be recorded in parallel and vkQueueSubmit() allows multiple command buffers to be submitted efficiently, exposing parallelism across secondary command buffers is not necessary in many applications, so this technique should be matched to the rendering work load. Note that it can also be possible to re-use secondary command buffers, although again this may carry some driver overhead (hopefully less than recording anew). Command buffer reuse should be used selectively, allowing for other optimizations such as frustum culling.
Once a render pass is no longer needed, it can be deleted as follows:
Note that it is up to the user to ensure that nothing is still rendering which referred to the render pass at the point vkDestroyRenderPass() is called - for example by using vkWaitForFences() with a VkFence handle previously passed to vkQueueSubmit().
Tiled rendering also provides a low-bandwidth way to implement antialiasing: we can render to the tiles normally, but average pixel values as part of the operation of writing the tile memory; this downsampling step is known as "resolving" the tile buffer.
Vulkan has the concept of a number of samples associated with an image. In a simple implementation the image might have several values stored at each pixel location; more complex implementations have compressed schemes. Therefore an image has a number of samples associated with it at image creation time. For multi-sampled rendering in Vulkan, the multi-sampled image is treated separately from the final single-sampled image; this provides separate control over what values need to reach memory, since - like the depth buffer - the multi-sampled image may only need to be accessed during the processing of a tile. For this reason, if the multi-sampled image is not required after the render pass, it can be created with VK_IMAGE_USAGE_TRANSIENT_ATTACHMENT_BIT and bound to an allocation created with VK_MEMORY_PROPERTY_LAZILY_ALLOCATED_BIT, as described above for depth buffers. The multi-sampled attachment storeOp can then be set to VK_ATTACHMENT_STORE_OP_DONT_CARE in the VkAttachmentDescription, so that (at least on tiled renderers) the full multi-sampled attachment does not need to be written to memory, which can save a lot of bandwidth.
To control multi-sampling, the index of an attached image view (in the pAttachments array of the VkFramebufferCreateInfo) with more than one sample should be used in the VkSubpassDescription's pColorAttachment array, and the index of a corresponding image view with exactly one sample should be placed in the corresponding index of the pResolveAttachments array; the multi-sampled image is then resolved to the single-sampled image at the end of the current sub-pass. To use pResolveAttachments for some attachments but not others, the entry in the pResolveAttachments array can be set to VK_ATTACHMENT_UNUSED to avoid resolving the corresponding multi-sampled image.
For example, if we had three multi-sampled attachments and only wanted the first and third to be resolved to single-sampled form, the VkSubpassDescription may have the following entries:
Remember that if we don't want to resolve any attachments in the subpass, pResolveAttachments can simply be set to NULL. Multi-sampled images can also be resolved to a single-sample image with vkCmdResolveImage() - but this happens outside the render pass and requires a separate access to memory, so it is a much less efficient solution if it can be avoided. Note that you can write both the resolved and multi-sampled images out of the same render pass by setting the storeOp of both attachments to VK_ATTACHMENT_STORE_OP_STORE.
On some occasions, the attachment containing all samples may need to be written to memory for later processing (for example, use in a later render pass as an input attachment). It is possible to resolve a multi-sampled image to a single-sampled one without using it as an attachment in a render pass using the vkCmdResolveImage() command.
However, please bear in mind that this should be the exception to normal rendering, not the default approach. Writing out the multi-sampled attachment to off-chip memory (rather than using VK_ATTACHMENT_STORE_OP_DONT_CARE) has a high bandwidth cost, and vkCmdResolveImage() itself must then read all this data back, process it, and write the single-sampled output. It is very much more efficient to perform resolve operations inside a render pass where possible.
The render pass mechanism described so far is quite verbose for use with a single subpass. The reason for this is the flexibility that it provides when when using multiple subpasses.
Some rendering techniques, notably deferred shading and deferred lighting, traverse the scene geometry once to create a frame buffer, then use the rendering results in the framebuffer for further rendering operations. The same can be said for, for example, applying tone mapping effects after rendering. In a tiled renderer, because each of these operations requires access only to the current pixel and not the entire framebuffer, all of these operations can be performed consecutively on a per-tile basis, avoiding the need to write intermediate values out to memory. This can provide a significant bandwidth (and therefore power and performance) improvement. There is a graphical example of how deferred shading is evaluated on a tiler towards then end of the Understanding Tiling article.
Note that because the render area size is defined by the width and height fields of the VkFrameBufferCreateInfo object, the render area of each attachment is effectively the same size, and this is true for all subpasses in a render pass. If a rendering technique requires reading values outside the current fragment area (which on a tiler would mean accessing rendered data outside the currently-rendering tile), separate render passes must be used.
Taking the example of deferred lighting, we might render the scene in three "subpasses":
The first subpass renders the geometry and stores the depth, normal vector and specular spread function.
The second subpass renders each light's bounds, accumulating a specular and diffuse color for each light that is calculated with the position, normal and specular spread function from the first subpass.
Finally, the scene geometry is processed again with conventional forward shading, picking up the light contributions from the results of the second subpass.
Since the shading in the first subpass is highly simplistic, the shader run-time cost can be significantly reduced in this approach, although the degree of shader parallelism in the final subpass may still depend on fragment coverage. The related deferred shading technique can allow for better shader parallelism at the cost of reduced flexibility and increasing intermediate storage requirements.
In our deferred lighting example, the depth buffer is used in all three subpasses; it should only be updated by the first, but the lighting subpass needs the depth attachment both to provide an accurate bounds for a light and to calculate the shading position in world space, and the final rendering pass can inherit the depth buffer to avoid unnecessary overdraw.
In this case, our render pass might use the following attachments:
Attachment 0 holds the surface normal and specular factor output by the first subpass, and used by the second subpass.
Attachment 1 holds the depth buffer for the scene, and applies to all three subpasses.
Attachment 2 holds the diffuse contributions from light sources output by the second subpass and read by the third.
Attachment 3 holds the specular contributions from light sources output by the second subpass and read by the third.
Attachment 4 holds the final result of rendering generated by the third subpass.
To associate the way these attachments are used with each subpass, we need a more complex array of VkSubpassDescription objects to pass to the pSubpasses member of our VkRenderPassCreateInfo object:
Since all but the final output color attachment in this example are used only as intermediate values, they can be created with the VK_IMAGE_USAGE_TRANSIENT_ATTACHMENT_BIT set, and be bound to memory allocated with VK_MEMORY_PROPERTY_LAZILY_ALLOCATED_BIT. Tiling hardware typically has limitations on the number and type of attachments which can be kept in flight concurrently, so despite this optimization, it is possible that implementations will have to spill intermediate results to main memory.
More complex arrangements of subpasses are possible. If an attachment is not used during a subpass, but is needed in previous and subsequent subpasses, the attachment should appear in the pPreserveAttachments array of the subpass. Implementations can change the order in which subpasses are evaluated (while preserving dependencies) in order to reduce the need for spilling. In the above example, attachment 0 is not preserved, and the implementation may use the same internal tile memory for both it and the final output attachment. It is also possible to use multi-sampling with these approaches, but this complicates the intermediate read operations and may make it more likely that tilers will have to spill to external memory.
When multiple subpasses are in use, the driver needs to be told the relationship between them. A subpass can depend on operations which were submitted outside the current render pass, or be the source on which later rendering depends. Most commonly, the need is to ensure that the fragment shader from an earlier subpass has completed rendering (to the current tile, on a tiler) before the next subpass starts to try to read that data. An array of subpass dependencies - if there are any - is passed to VkRenderPassCreateInfo, defining a set of dependencies between "source" (the thing being waited on) and "destination" (the thing doing the waiting). Each subpass dependency is defined as follows:
Typically, for dependencies between fragment writes and fragment shader reads, we might expect the following settings:
In the cases of our deferred lighting example, we have three subpasses, and we have dependencies between the first and second and between the second and third. That is, we need to set the dependencyCount member of our VkRenderPassCreateInfo to 2, and set the pDependencies member of our VkRenderPassCreateInfo to point to the following array:
When recording to a VkCommandBuffer, we described above that vkCmdBeginRenderPass() and vkCmdEndRenderPass() are used to wrap the render pass operations. After vkCmdBeginRenderPass() is called, subsequent commands are applied to the first subpass within the render pass.
To move operations to subsequent subpasses, vkCmdNextSubpass() should be called. Each call of this function moves operations to the next subpass index, in increasing order, until vkCmdEndRenderPass() is called. Synchronization between access to attachments described in subpass dependencies is handled automatically.
In SPIR-V, the contents of an input attachment can be accessed with the OpImageRead operation, with an OpTypeImage that has a dim argument of SubpassData. The coordinate argument of the OpImageRead must be (0,0), and corresponds to accessing the input attachment at the current fragment location. When multi-sampling, the sample operand to OpImageRead can be used to access separate samples at the current fragment.
In GLSL, this functionality is exposed through the subpassLoad() function, with subpassInput types for the subpasses.
The Vulkan API acknowledges the fact that modern rendering technique may perform multiple passes over the same image data, and is designed to ensure that these approaches are explicitly and efficiently supported on modern graphics hardware. The unfortunate consequence of this expressivity is the complexity of the description and the verbosity of simple examples, although the overhead in a practical, optimized renderer should be less significant.
In Vulkan, the render pass is an explicit concept within which rendering operations execute. A VkFrameBuffer, with a list of associated attachments, is associated with the render pass when rendering work is recorded into a VkCommandBuffer. The render pass is divided into one or more subpasses, with explicitly-defined interactions between them. This explicit configuration VkRenderPass object can be shared between rendering operations, which can limit the impact on real-world, complex applications. Providing this additional information to a driver can allow significantly improved memory overhead, especially on tiled architectures, without the unpredictability of the heuristics applied to achieve good performance in more traditional APIs.
A simplified version of the content of this article may be found in a presentation on the subject at a UK developer event.