Jet Set Vulkan : Reflecting on the move to Vulkan

Razvan Baraitaru | Atypical Games LLC
Calum Shields | Samsung Research UK (Galaxy GameDev)

Intro

The Vulkan API developed by the members of The Khronos Group is the first future-proof graphics API designed for cross-platform applications. Its low-level architecture enables developers to take explicit control of different parts of the graphics stack that weren't exposed previously in OpenGL. Atypical Games, with support from Samsung, took on the task of implementing Vulkan support into their engine. In this blog, Razvan will share his experiences with porting to Vulkan and then later Calum will discuss some of the issues that cropped up in the Android implementation of “Sky Gamblers: Infinite Jets”.

Background

Our engine has support for various APIs that we used in our games:

OpenGL ES 2.0/3.0 on Android and Nintendo Switch
OpenGL 4.0 on Windows (production only)
Metal on iOS and macOS

The shading language is mostly composed of standard ES30 features, thus making the code portable. We have an intermediate language, based on GLSL, and an internal tool for exporting the shaders in all shading languages.

The API callbacks are wrapped around an interface with different implementations for OpenGL and Metal.

Reason

There are numerous reasons for choosing Vulkan. Primarily it creates an opportunity for better performance and compatibility across many of our target platforms - Android, Nintendo Switch and Windows. Also, we had in mind some possible future Stadia games where Vulkan is the only graphics API.

In the end, it became clear that implementing Vulkan will actually replace OpenGL across all platforms other than Apple.

Vulkan Implementation

With full support from Samsung we started implementing Vulkan into our engine. Samsung helped us to profile and find the potential bottlenecks in our code. This was especially useful for the Android build where the device fragmentation was our main concern (some of you may have seen the chart that shows there are over 20,000 different Android configurations in the wild).

Some good tutorials that I used as references were Sascha Willems examples and demos that cover an extensive background over the API across multiple platforms and can be used as a great starting point. Also at the time of writing there is now a collection of Vulkan samples officially provided by Khronos.

Due to the similarities with Metal, we chose MoltenVK on macOS as the main porting target. The Metal version was already running and we took advantage of the IDE integrated GPU debugging.

MoltenVK is the official library which allows Vulkan applications to run on top of Metal on Apple's macOS and iOS. SPIR-V shaders are automatically converted to Metal and the overhead is minimal. There were some issues while we were porting, specifically compressed floating point texture formats were not renderable. We used 16-bit floating-point instead until MoltenVK added support for them.

Graphics Components

Our engine has a series of enumerations that need to be implemented for each graphics API. These include all types, formats and rasterization modes found in OpenGL and Metal. To name a few of them:

Texture Formats
Texture Address Modes
Draw Modes
Blend Modes
Compare Functions
Stencil Ops

They are pretty straightforward to define and I will present an example with different API implementations.

BlendMode engine enum

BlendMode Metal-maps to Metal defines

BlendMode Vulkan-maps to Vulkan defines (very similar to Metal)

Generating SPIR-V

For this, we needed to re-work our shader exporting tool. We choose to keep GLSL as base shaders and adjust them to fit SPIR-V requirements. Layout/set bindings and uniform buffers semantics were added.

Because there is no runtime reflection in Vulkan, we had to keep track of every binding and attribute at this stage. Later on, we used them to create pipeline layouts.

For pipeline layouts we first did the classic sorting of uniforms based on the update frequency:

1. SceneData (changes once every loading or when the sun moves)
2. FrameData (changes once every frame, eg. camera position/direction)
3. TextureBindings (can change every draw call)
4. Custom shader parameters (can change every draw call)
5. BonesMatrix array (special uniform buffer for matrix pallets)

Doing this allows for a clean uniform memory management with as close to zero unnecessary updates as possible.

Here's an example of how a base shader looks like in our engine (CausticRays):

The programmer can define both vertex and fragment shader in the same file. Each stage argument is declared in the beginning and "Base.glsl" will switch on and off every one of them with the appropriate shading language syntax. For this shader we need the frame data and an animation parameter for the vertex shader and the first texture for the fragment shader.

The following is the exported GLSL code:

A Script will send these to glslangValidator to generate the SPIR-V.

Other Components

Both Vulkan and Metal share the concept of render passes with clear or load at the beginning and store or resolve at the end. The twist in Vulkan comes from adding explicit execution and memory dependencies, causing a lot of confusion among programmers who are familiar with OpenGL. We will go into more detail on how we handled this later.

We had to use dynamic viewport and scissor because our menu renderings are relying on frequent changes of these states.

Finally, we integrated AMD's Vulkan Memory Allocator for a proper memory management of our buffers and textures allocations. That's 18K of quality code that's easy to use.

Validation Layer

This layer is designed to detect API misuses. We had about 100 errors on our first run and in theory if you solve all of them you will have a working build. This layer also tracks objects to find resource leaks and checks for thread safety. The latest builds will now also warn you of sub optimal API use even if it is technically correct as far as the spec is concerned.

Some of the common errors we had: - Image formats in pipelinedescriptor - Image layouts in render pass descriptors - Required vertex attributes not provided - Missing requested features : samplerAnisotropy, depthClamp

Running Vulkan - Nintendo Switch

Since we already had a working Vulkan engine, it was easy to integrate it in our latest project: Titan Glory - Mech Combat. The only thing that was added is the specific Nintendo Switch surface extension.

With an nVidia GPU, an extra 2% improvement was achieved only by activating VK_KHR_get_memory_requirements2 and VK_KHR_dedicated_allocation extensions in Vulkan Memory Allocator (#define VMA_DEDICATED_ALLOCATION 1 before including vk_mem_alloc.h).

There's not much info about VK_KHR_dedicated_allocation but it turns out that some buffers and images (probably bigger ones) combined with specific GPUs could benefit from having their own dedicated allocation. A single VkDeviceMemory object is created to fit just that single big resource, allowing the driver to make some hidden optimizations.

Compared to the OpenGL version we observed a 20% performance improvement on this platform using Vulkan. That’s seriously impressive!

picture of "Titan Glory" running on Vulkan on Nintendo Switch

Running Vulkan - Windows

This version is currently used for development only. We use SDL for window management and media layers. The current SDL version supports Vulkan and we only needed to add the specific Windows surface extension.

Having a Windows build helped us quickly solve the remaining Vulkan bugs. We found some synchronization issues that happen because of the high framerate on PC. They could also happen on mobile versions.

Picture of "Titan Glory" running on Vulkan on Windows - nVidia RTX2080

Running Vulkan - Android

For the Android build we had to re-implement some of our classes using NDK because we were using JNI with OpenGL. Vulkan wrapper was added for best compatibility with early drivers. Samsung helped us optimize the build on their Galaxy devices.

Picture of "Sky Gamblers: Infinite Jets" running on Vulkan

Android Specific Optimisations

Porting a game to Vulkan can present unique challenges especially when working with an engine originally designed around less explicit APIs such as OpenGL. Due to Vulkan’s explicit design, the hardware and driver do not dedicate any effort to additional optimisations which you may not need. As such what worked before may not work for Vulkan or may result in poorer performance. On the other hand, if you have an ‘optimal’ app it will run faster and with lower CPU-overhead on Vulkan than on OpenGL.

Here we will go over optimisations that were made specifically for these reasons.

Mali Varying Buffer Limit

This is a driver limitation of ARMs Mali GPUs whereby they can only allocate up to a maximum of 180MB for Varying data. Realistically, the limit will very rarely ever be reached, but when it does it will cause a lot of frustration because there is no elegant solution for working around it. ARM have a blog of their own on the subject if you would like to know more.

This technically affects GL as well but the driver will workaround this limitation for you so you would likely never be fully aware of it. Vulkan on the other hand leaves it entirely in the hands of the developers with no easy way to discern how close to the limit you are. You will know immediately when you do go over it however which was the case for Infinite Jets:

As I already mentioned, there is no way to accurately track how much Varying data is being used, leaving you to make an informed guess based on the number and size of the Varying’s being used in a Renderpass. Therefore, the simplest solution is to try reducing the precision of your varyings and/or be more conservative with LODs, draw distances, etc.

If that isn’t enough, or you don’t want to compromise on your visuals, then your only option is to use incremental rendering. Basically this involves keeping a rough track of how much memory is being used and if you go over the limit then end the Renderpass with a storeOp store and start a new renderpass with LoadOp load – this is what the GL driver effectively does.

With Infinite Jets we tried using lower LOD levels and even outright removing objects to try and narrow down the culprit. This then led me to discover an unexpected cause…

Sparse Indexing

Another limitation of the Mali driver is in how it handles indexed draws. Rather than only allocating memory for the indices you use, it will instead allocate them in a range from the first index in the buffer to the last. Thus, if you render a single triangle with the first index positioned at the start of the buffer and the last positioned a thousand places into the buffer then a thousand vertices of Varying memory will have been allocated. At this point I’m sure you can see where this is going.

Upon further inspection I realised that the game’s terrain rendering was using a square mesh that could have its complexity adjusted to meet the requirements of the height map. The problem, however, was that the indices were arranged in row by column order. So, no matter how many vertices were being used, the memory being allocated would be for the full range of the buffer.

After having rearranged the vertices so the lowest LOD levels were at the front, the flickering geometry disappeared:

Even better is that we managed to hit two birds with one stone – there had been large spikes in the profile where Vertex processing was observed and these were now gone as well.

TakeAway :

Tightly packing vertex indices is important for GPU efficiency.

vkQueuePresentKHR Blocking Render Thread

This is a curious issue that we have seen crop up a lot in games that are mostly CPU bound. vkQueuePresentKHR will take an excessive amount of time to return because it’s waiting on a fence to be signalled by the GPU driver. Below is a trace from Infinite Jets showing it taking over 16ms.

This happens because Android requires the driver to signal a fence once it’s done processing all the commands up to that point. The intent is to stop the CPU getting far ahead of the GPU but this can have the unfortunate consequence of reducing throughput.

The solution for this is to move vkQueue commands onto a separate thread so that command recording for the next frame isn’t delayed. Doing this for Infinite Jets yielded more than a 20% improvement in framerate on S10.

S10+	FPS	CPU	GPU
W/Present Thread	59	8.99%	80.04%
W/O Present Thread	48	7.69%	76.74%

Something else to consider however, is that queue commands are not thread safe so if you need to use them in the render thread (e.g for transfer operations) it is worth considering a separate vkQueue for those. Otherwise you may reintroduce stalls through mutexes.

Pipeline Barrier/Subpass Dependencies

Pipeline barriers are an important tool used for synchronisation between renderpasses. They specify execution dependencies between pipeline stages as well as the memory dependencies for images. Optimal usage is critical for performance because it allows the overlap of vertex and fragment work between renderpasses.

In Infinite Jets’ case they were using a naive dependency for all renderpasses:

  dependency.srcSubpass = VK_SUBPASS_EXTERNAL;
  dependency.dstSubpass = 0;
  dependency.srcStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT;
  dependency.srcAccessMask = 0;
  dependency.dstStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT;
  dependency.dstAccessMask = VK_ACCESS_COLOR_ATTACHMENT_READ_BIT | VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT;

The problem here is that the code assumes the previous renderpass used the framebuffer images as colour attachments but what if it was being used as a texture in the fragment stage? Furthermore, there is no dependency set after the renderpass to synchronise with a renderpass that reads the attachments in the fragment stage. This kind of dependency won’t affect performance but will introduce race conditions which could cause frame corruption or other visual artefacts.

The solution here was to make better informed decisions about how the dependencies are set. So if an attachment was stored then it can be assumed it will be read as a texture by a subsequent pass as well as adding in depth stage flags for when depth attachments are loaded and stored.

Conclusion

Vulkan is a very flexible and efficient API that may require more effort on the developers' part initially in order to get the best out of it, but will be less work in the end when working across multiple platforms. The most important aspect to keep in mind when optimising Vulkan, is synchronisation. It is one of the biggest differences between Vulkan and less explicit API’s and is a common area of interest across all platforms.

Performance is best exemplified with heavy CPU usage but even modest improvements are coupled with reductions in CPU usage which are very important for increasing battery life on portable devices.

API	FPS	Stability	CPU	GPU
Vulkan	60	100%	10.75%	82.86%
GL	56	100%	12.55%	80.75%

Infinite Jets on Note8

The figures in this table deserve some examination… The GPU load has risen only very slightly, because Vulkan is running at a nice steady 60 fps when GL was running at 56 fps. But the reduction in CPU load is very noteworthy. A game using 20% of the CPU is effectively running the CPU flat-out. That’s because if the CPU load is sustained at more than about 20% then the device will soon overheat. In that context, reducing the CPU load from roughly 13% to roughly 11% is actually most likely getting the power usage down by 10% - and that’s great for the user experience. Less heat, and longer battery life leads to happier players.

And now that we have the game running at 60 fps on a Galaxy Note 8 that means we have plenty of headroom for either better graphics or longer playing time on higher end devices like the Samsung Galaxy S20.