The Challenges of Porting Traha to Vulkan
Porting a game to Vulkan can be a difficult challenge! In this article, we’re going to discuss some of the challenges and optimizations required to bring functional and performant Vulkan support to Traha, an open-world Korean MMO developed by Moai Games in Gangnam, Seoul.
At the start of this project, Unreal Engine 4 provided a base level of Vulkan support, and through Samsung’s collaboration with Epic Games, the feature set and efficiency of the Vulkan implementation continues to improve all of the time!
We will be discussing some of the important engine changes, optimizations and bug fixes/drive-workarounds
that helped power this game, providing advice that you can take forward into your own projects and engines.
These tips and tricks will primarily focus on improving efficiency and performance on the mobile platforms,
specifically with references to our partner GPU vendors, Arm and Qualcomm.
Baseline Performance Analysis
The first step with any new project is to get an idea of where we currently are. In this instance, the focus was on using Vulkan to unlock the full potential for this game. The base performance check was promising, with Vulkan already offering an impressive performance gain of nearly 40% over GLES, however there is always room for improvement, and with our goal of pushing the limits even further, boosting overall performance, work would begin!
Initial Project Investigation
Along with establishing performance, there are a number of other areas that need investigating. This initial research provides insight into the extent of work required, and helps us build a roadmap for where the project needs to go:
Vulkan Validation Layers As the API does not provide any native error-checking, we need to ensure that the game is thoroughly tested with validation enabled. This helps us identify would-be issues before they become a problem.
Device Compatibility In order to get an idea of the performance budget and identify any device-specific issues, the game needs to be tested on a range of different devices.
Content Pipeline (Asset Creation and Conditioning) Many optimisations in these large projects require content changes. Gaining insight and understanding of the studios’ content pipeline is essential for ensuring an effective collaboration.
Distribution Requirements and Timeline Similar to the above, understanding any limitations is important as well. This could influence later decisions such as shader caching or texture compression options.
Optimising the UE4 Vulkan Implementation
[Vulkan Tips] Descriptor Set Management
Descriptor Sets are a collection of bindings, used to create resource mappings between shaders and their backing resources, such as uniform buffers and image samplers. Managing these in Vulkan can be challenging! Orchestrating the correct system that minimises CPU overhead, improves cache efficiency and maximises GPU throughput for your given application can be achieved with a whole host of different approaches.
In Traha, it was clear that the number of per-frame invocations of vkUpdateDescriptorSet(..) was excessive and causing a noticeable impact in performance. There were a total of 220 calls, taking 3.554ms inside the driver, executing in the hot path every frame. Similarly, vkAllocateDescriptorSets(..) could also prove to be problematic during run-time.
Arm Mali Best Practices Guide suggests avoiding calling vkAllocateDescriptorSets in the hot-path. Not only to avoid excess copying of data and memory-management overhead, but to also avoid resource contention on accessing the backing memory (as Descriptor Pools are not pooled on Mali, but share the same memory).
Other tips include:
• Prefer re-using already-allocated descriptor sets
• Prefer DYNAMIC_OFFSETS to updating descriptor sets for Uniform-buffers (UBOs) and Storage Buffers (SSBOs) if you plan on binding the same buffer with different offsets
• Prefer reusing already-allocated descriptor sets, and not updating them with same information every time
Redesigning the Descriptor Set Management Model
With this in mind, the Vulkan DescriptorSet model was re-engineered to instead make use of Dynamic Offsets for uniform buffer data. This allows a descriptor set to be prepared once, mapping shader uniform data to a specific Vulkan buffer, rather than a specific location within that buffer. With dynamic offsets, we can then specify the offsets into that uniform buffer when we call vkCmdBindDescriptorSets(), rather than needing to update or create a new DescriptorSet object each time we change the binding offsets (which would commonly happen when executing a draw call with new uniform data).
The advantage of this is model is that we only need to prime descriptor sets for uniform data once: we can then make each buffer binding, within a descriptor set, point to different memory locations in our buffer at bind time, without having to update the offset with costly vkUpdateDescriptorSet or vkAllocateDescriptorSets calls.
This diagram below demonstrates the reduced overhead, with the diagram on the right showing the new system:
Setting Up Dynamic Offsets
There are a few systematic requirements for preparing use of dynamic offsets.
- Create a large VkBuffer which will be used to store the uniform data for multiple subsequent draw calls (e.g. object transforms, material properties) – created with usage flags: VK_BUFFER_USAGE_UNIFORM_BUFFER_BIT.
- Create the VkDescriptorSetLayout with a VkDescriptorSetLayoutBinding of type VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER_DYNAMIC.
- Allocate a descriptor set from a pool and update it once, setting the base offset to 0.
- Use vkCmdBindDescriptorSet(..) and assign a dynamic offset at bind time, pointing to the desired location
of the uniform data within the previously allocated VkBuffer.
Note: The final offset into a buffer is the base offset, set in the DescriptorSet object, plus the specified dynamic offset.
Find more information on Vulkan Optimizations here
and check out our session from GDC2019 on bringing maximum boost to mobile games Video
Given the significant overhead of ~3.5ms, implementing this change had a respectable impact on performance. We can see from the CPU profile below that we can eliminate all 220 redundant calls to vkUpdateDescriptorSet by collapsing the same behaviour into the offset parameter of vkCmdBindDescriptorSet, which we are already invoking anyway!
This means that the full 3.5ms get saved, which equates to an increase from 37 to 42 FPS for this scene in Traha! The great advantage of these engine-level changes is that they can benefit a whole host of other projects too.
[Optimization] Pending Render Passes
Given that Vulkan is an explicit API, the hardware and driver have a requirement to execute the workload you give it, and equally, do not want to waste any overhead on additional optimisations which you may not have intended. As a result, the onus is on you to get things right.
We’re going to focus next on a “Lazy Evaluation” approach to use of the API.
As a fully dynamic game engine, UnrealEngine 4 has the responsibility of remaining adaptive. Internal, API-agnostic rendering commands are translated into Vulkan API calls through the Render Hardware Interface thread (Vulkan RHI). Sometimes however, high-level engine draw commands are not the most efficient, leading to immediate overhead and, in quite a few cases, redundant operations:
We identified that in certain cases, the RHI would generate empty render passes. In an API like OpenGL ES, the significance of this would be negligible; however in Vulkan, the hardware will still go through the process and setup of moving memory around to prepare for rendering. This level of control is ideal when we need to squeeze the most out of the hardware, but isn’t great when applications generate unintended and wasteful operations
The good news is that the solution to this problem is simple! When we receive a call to begin a render pass, we instead delay on issuing the Vulkan API command vkCmdBeginRenderPass() immediately, and we wait to see if any commands such as draw calls come through before a request to end the Render Pass comes in
To achieve this, we store the requested information as a Pending Render Pass: this is essentially just a snapshot of the data we would have issued as part of the vkCmdBeginRenderPass() command, with a few other bits of contextual information that we may be able to use for other optimisations. If a draw call then comes, we can then begin our render pass at that point! This is a classic case of Lazy Evaluation: don’t do the work until you know that it’s going to be required.
After the change, the generated command buffer may look something like this:
Arm mali Best Practices Guide provides a few tips on optimizing render passes:
• Avoid vkCmdClearColourImage() and vkCmdClearDepthStencilImage(), and instead clear attachments at the start of a render pass.
• Ensure you only use LOAD_OP_LOAD and STORE_OP_STORE when they are entirely essential. Wasted operations can increase external memory bandwidth requirements.
In Traha, this was particularly problematic on the UI, with individual UI components requiring their own Render Pass. These passes would often appear empty if the object did not need to be rendered. This is a captured frame from the lobby screen:
- Other Render Pass Suggestions
With an engine like UE4, we have the advantage of having both highly abstracted and contextual information about the work we want to achieve. This higher abstraction can make opportunities for certain other Render Pass optimisations easier to spot. A few ideas for other things we can do:
Render Pass Merge & Collapse : If a subsequent Render Pass shares the same attachments (input render targets) as a previous pass, then we can eliminate the need to split a Render Pass. This optimisation would also benefit the case above where the lobby screen rendering was split into a series of subsequent Render Passes which shared the same input attachments.
Optimize Load/Store : We can find out whether attachments have previously been rendered to or are needed at later stages during rendering. This determines whether we need to LOAD or STORE the associated attachment data. This is a more important topic that will be covered in more detail in other articles, but can have significant implications for bandwidth utilisation and therefore performance.
Pending Render Target Clear Operations : Similarly, if we need to clear a render target, then the RHI will issue a clear command. We often find that when this happens, the render target isn’t ever used until it is needed in a Render Pass, at which point, we can completely ignore the clear command and simply optimise the attachment load operation to LOAD_OP_CLEAR, providing an immediate performance improvement.
[Optimization] Pending Pipeline Barriers
Another observation in Traha was the opportunity for pipelining improvements. From this GPU trace, captured using Streamline, we can see lots of Pipeline Bubbles in the fragment activity. This inefficiency is caused by the misuse of pipeline barriers, which are used to define execution dependencies between pieces of work, introducing dependencies where they are not needed.
Instead, pipeline barriers and their dependent stages can be merged together. Merging is achieved by tracking any Pending Pipeline Barriers: we list all barrier commands that are issued until the moment they are needed (if any action command e.g. draw call, copy image etc. is issued) at which point we can optimise them in a number of ways:
Grouping vkCmdPipelineBarrier commands for multiple barriers into one call, reducing system overhead. Used when each independent barrier is still needed and represent different dependencies.
Collapsing and merging multiple pipeline barriers with shared dependencies into a single vkCmdPipelineBarrier call, where the srcStageMasks and dstStageMasks of each are collated. This reduces the number of small dependencies, reducing the total occurrence of gaps, “bubbles”, within the fragment pipeline.
With these changes made, we get a much cleaner trace and reduce overall frame time:
Eliminating redundant image layout transitions which alter the representation and usage of an image, allowing the hardware to optimise its storage. For example, an image being used as a render target needs to be in layout COLOUR_ATTACHMENT_OPTIMAL allowing the hardware to optimise store-only cache utilisation. Changing layout unnecessarily will result in an invalidation or flush of either tile memory or the texture cache, reducing the efficiency of subsequent work. We also avoid any execution dependencies that these transitions imposed, as they will no longer be necessary. There are two cases that are covered by this system:
Indirect Layout Transition: This is where an image is transitioned into an intermediate layout before its final layout, but no action command (Render, Clear, Copy) is executed in between layout transitions:
#1: VK_IMAGE_LAYOUT_GENERAL -> VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL #2: VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL -> VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL (Optimal - only need one) #1: VK_IMAGE_LAYOUT_GENERAL -> VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL
Cyclic Layout Transition: This is where an image is transitioned into a given layout and back again with no action command (Render, Clear, Copy) executing in between layout transitions:
#1: VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL -> VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL #2: VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL -> VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL (Optimal - Delete Both Layout transitions)
Resolving Hardware Issues
Often with these complex projects, it is inevitable that we will run into some hardware-specific issues. These are often the result of errors in the driver and are a challenging reality of mobile development. We will always attempt to work-around these issues where possible, in order to preserve the developer’s true intention, without having to sacrifice features or significant performance as a result.
We work closely with our hardware partners to arrive at practical solutions and also ensure that these issues get resolved in future updates and device releases.
Mali Transaction Elimination
Arm Mali Midgard and BiFrost GPU architectures have a bandwidth-saving feature known as Transaction Elimination.
This feature is designed to reduce the transfer of on-tile data to main memory if the contents on that tile have not changed.
However, there can sometimes be problematic instances which result in rendering artefacts. These black squares represent tiles that should have been written back to main memory, but have instead been discarded. This was a corner-case wherein one of the previously-discussed render pass optimisations for vkCmdClearColorImage caused a side effect, resulting in tiles getting incorrectly flagged as unmodified. A workaround to force correction of tile data was added in this instance, which involved reverting the previous clear optimisation, performing a double clear, when the problem-case arises.
Take-away: Hardware-level optimisations can sometimes cause problems in unforeseen edge cases. Thorough testing and QA can help identify these situations and once the cause is understood, actions can be taken to avoid hitting the problem-case.
Adreno Update Descriptor Set Crash
Unfortunately, another one of the optimisations for descriptor sets also resulted in problems on some Adreno-based devices. The reasoning behind this was unclear, as the Vulkan behaviour and implementation were correct, however this was likely down to an issue within the driver. As a result, this specific optimisation had to be disabled on devices with problematic drivers.
Conclusion : Vulkan vs GLES Performance
The results for this Samsung GameDev support camp were very positive, with the performance of the Vulkan implementation being rock-solid. Not only providing a substantial 55% improvement over GLES, but also raising the core performance of the project.
The performance benefits were mostly seen in CPU bound situations, and situations where the GPU was also being under-utilised. This was most prevalent when the game was played in HIGH or MEDIUM settings. When playing in highest settings, the main limiting factor become the GPU, primarily due to high fragment loads, therefore the difference between APIs was not as significant.
Below are some of the performance results from a 1-minute idle test on the scene above:
This graph is the clearest proof you could ever want for the value of using Vulkan and for working with the Samsung Galaxy GameDev team. We’ve moved from the initial, pretty satisfying, 40 frames a second GLES version to an astounding Vulkan experience which spends most of its time locked at the phone’s refresh rate and all with no reduction in quality. Simply put – that’s as good as it gets!
Thanks to all of GameDev engineers involved for their hard work and dedication on this project!
Sergey Pavliv, Fedir Nekrasov, Junsik Kong, Joonyong Park, Vadym Kosmin, Dohyun Kim, Oleksii Vasylenko, Seunghwan Lee, Inae Kim, Sangmin Lee, Munseoung Kang, Michael Parkin-White