Search | Samsung Developer

success story game, mobile

blog

Accelerate game performance based on SceneSDK

introduction in recent years, the mobile game market has been growing fast. along with hardware upgrades, the implementation of mobile games is more complicated, and the loading process and the display of some scenes consume a lot of cpu and gpu resources. so samsung worked with game vendors, including the tencent game performance amelioration (mtgpa) team, to improve game user experience based on scenesdk. scenesdk focuses on performance optimization, with the game vendor's cooperation, by combining the abilities of device manufacturers to control hardware resources and the abilities of games to sync up scenario information. it can maximize the game experience. currently it includes many items such as scene guarantee and frequency reduction notification. it can get game information, send a mobile device's status to a game and supports 40+ games, including many popular games. optimization solutions scene protection scene protection divides a game into different game scenes according to certain rules (such as coarse-grained loading, lobby, single game round, ultimate kill, aiming and shooting, and much more), and then provide finer-grained performance guarantees for different game scenes. considering that hardware resources are limited, if the protection of all the scenes is the same, the actual protection effect is not ideal because the hardware is fully loaded at the beginning. the system high temperature protection is triggered quickly, the cpu and gpu are forced to reduce frequency, and the rendering performance will be even worse. therefore, the game could send game events, like loading, starting, lobby or scene loading, to scenesdk on the mobile device's side. the game information is sent in the json format {sceneid: value}. it’s flexible and can be extended with more sceneids if needed. the main value information is shown in the table below. after getting the scene info, the scenesdk service changes cpu/gpu frequency to improve game performance based on different game scenarios. during gameplay, it is necessary to classify the scene according to the importance level. the strategy of hierarchical protection is slightly different according to the underlying adjustment capabilities of each manufacturer, but the core is the highest-level scene, which is fully protected. for the protection of different levels of scenes, the effect can be shown as below. for the highest priority (critical) scene, the fps (frames per second) is more stable. for the lowest level scene, fps slowly declines without affecting the experience. at the same time, the protection of the scene switching is also effective in real time. as shown in the following figure, when the scene level is switched from low (1) to critical (3), the cpu frequency starts to increase, and the fps also gradually starts to increase. the comparison is as follows: frequency reduction notification the frequency reduction notification is to inform the game of the system cpu frequency reduction. the extent of cpu frequency reduction varies from manufacturer to manufacturer. such adjustments inevitably result in equipment that just meets the performance needs or already fails to meet the performance needs of the game so that it starts to freeze or becomes stuck. therefore, if the game can adjust the configuration items related to performance consumption instantly, by temporarily reducing or disabling some functions and using such notifications, a stuck game can be avoided to ensure the best player experience. variable refresh rate a key part of the technology used in the latest samsung flagship model is the ability to not only run conventional mobile refresh rate limits, but also to dynamically modify them based on game requirements at runtime. a common misconception is that if a device has the ability to run at 120 fps, it should always do so. however, most games support multiple options that are lower than 120 fps or don’t support 120 fps. if a game does not support 120 fps, a 120 fps refresh rate is not required and may cost more power consumption. the ideal approach is to make use of the maximum refresh rate only when it has the greatest benefit. so when the game sends the json string {“target fps”: value} to the mobile device, the device changes the refresh rate to save power by not running in the highest refresh rate. summary this table shows the benefit of using scenesdk at launch time. this table shows better performance during two rounds of testing with scenesdk on and scenesdk off. this table shows lower power consumption when dynamic refresh rate is applied. overall, scenesdk could be a good option to improve your game performance. please feel free to contact us if you need more information.

Xiangguo Qi

https://developer.samsung.com/galaxy-gamedev/blog/en-us/2022/04/26/accelerate-game-performance-based-on-scenesdk

tutorials game, mobile

blog

Using Conservative Morphological Anti-Aliasing to Improve Game Visuals

anti-aliasing is an important addition to any game to improve visual quality by smoothing out the jagged edges of a scene. msaa (multisample anti-aliasing) is one of the oldest methods to achieve this and is still the preferred solution for mobile. however it is only suitable for forward rendering and, with mobile performance improving year over year, deferred rendering is becoming more common, necessitating the use of post-process aa. this leaves slim pickings as such algorithms tend to be too expensive for mobile gpus with fxaa (fast approximate anti-aliasing) being the only ‘cheap’ option among them. fxaa may be performant enough but it only has simple colour discontinuity shape detection, leading to an often unwanted softening of the image. its kernel is also limited in size, so it struggles to anti-alias longer edges effectively. space module scene with cmaa applied. conservative morphological anti-aliasing conservative morphological anti-aliasing (cmaa) is a post-process aa solution originally developed by intel for their low power integrated gpus 1. its design goals are to be a better alternative to fxaa by: being minimally invasive so it can be acceptable as a replacement in a wide range of applications, including worst case scenarios such as text, repeating patterns, certain geometries (power lines, mesh fences, foliage), and moving images. running efficiently on low-medium range gpu hardware, such as integrated gpus (or, in our case, mobile gpus). we have repurposed this desktop-developed algorithm and come up with a hybrid between the original 1.3 version and the updated 2.0 version 2 to make the best use of mobile hardware. a demo app was created using khronos’ vulkan samples as a framework (which could also be done with gles) to implement this experiment. the sample has a drop down menu for easy switching between the different aa solutions and presents a frametime and bandwidth overlay. cmaa has four basic logical steps: image analysis for colour discontinuities (afterwards stored in a local compressed 'edge' buffer). the method used is not unique to cmaa. extracting locally dominant edges with a small kernel. (unique variation of existing algorithms.) handling of simple shapes. handling of symmetrical long edge shapes. (unique take on the original mlaa shape handling algorithm.) pass 1 edge detection result captured in renderdoc. a full screen edge detection pass is done in a fragment shader and the resulting colour discontinuity values are written into a colour attachment. our implementation uses the pixels’ luminance values to find edge discontinuities for speed and simplicity. an edge exists if the contrast between neighbouring pixels is above an empirically determined threshold. pass 2 neighbouring edges considered for local contrast adaptation. a local contrast adaptation is performed for each detected edge by comparing the value of the previous pass against the values of its closest neighbours by creating a threshold from the average and largest of these, as described by the formula below. any that pass the threshold are written into an image as a confirmed edge. threshold = (avg+avgxy) * (1.0 - nondominantedgeremovalamount) + maxe * (nondominantedgeremovalamount); nondominantedgeremovalamount is another empirically determined variable. pass 3 this pass collects all the edges for each pixel from the previous pass and packs them into a new image for the final pass. this pass also does the first part of edge blending. the detected edges are used to look for 2, 3 and 4 edges in a pixel and then blend in the colours from the adjacent pixels. this helps avoid the unnecessary blending of straight edges. pass 4 the final pass does long edge blending by identifying z-shapes in the detected edges. for each detected z-shape, the length of the edge is traced in both directions until it reaches the end or until it runs into a perpendicular edge. pixel blending is then performed along the traced edges proportional to their distance from the centre. before and after of z-shape detection. results image comparison shows a typical scenario for aa. cmaa manages high quality anti-aliasing while retaining sharpness on straight edges. cmaa demonstrates itself as a superior solution to aliasing than fxaa by avoiding the latter’s limitations. it maintains a crisper look to the overall image and won’t smudge thin lines, all while still providing effective anti-aliasing to curved edges. it also provides a significant performance advantage to qualcomm devices and only a small penalty to arm. image comparison shows a weakness of fxaa where it smudges thin lined geometry into the background. cmaa shows no such weakness and retains the colour of the railing while adding anti-aliasing effectively. msaa is still a clear winner and our recommended solution if your game allows for it to be resolved within a single render pass. for any case where that is impractical, cmaa is overall a better alternative than fxaa and should be strongly considered. graph shows the increase in frametime for each aa method across a range of samsung devices. follow up this site has many resources for developers looking to build for and integrate with samsung devices and services. stay in touch with the latest news by creating a free account or by subscribing to our monthly newsletter. visit the marketing resources page for information on promoting and distributing your apps. finally, our developer forum is an excellent way to stay up-to-date on all things related to the galaxy ecosystem. references filip strugar and leigh davies: conservative morphological anti-aliasing (cmaa) – march 2014. filip strugar and adam t lake: conservative morphological anti-aliasing 2.0 – april 2018.

Samsung GameDev Team

https://developer.samsung.com/galaxy-gamedev/blog/en-us/2021/06/01/using-conservative-morphological-anti-aliasing-to-improve-game-visuals

tutorials game, mobile

blog

New Game Changing Vulkan Extensions for Mobile: Timeline Semaphores

the samsung developers team works with many companies in the mobile and gaming ecosystems. we're excited to support our partner, arm, as they bring timely and relevant content to developers looking to build games and high-performance experiences. this vulkan extensions series will help developers get the most out of the new and game-changing vulkan extensions on samsung mobile devices. in previous blogs, we have already explored two key vulkan extension game changers that will be enabled by android r. these are descriptor indexing and buffer device address. in this blog, we explore the third and final game changer, which is 'timeline semaphores'. the introduction of timeline semaphores is a large improvement to the synchronization model of vulkan and is a required feature in vulkan 1.2. it solves some fundamental grievances with the existing synchronization apis in vulkan. the problems with vkfence and vksemaphore in earlier vulkan extensions, there are two distinct synchronization objects for dealing with cpu <-> gpu synchronization and gpu queue <-> gpu queue synchronization. the vkfence object only deals with gpu -> cpu synchronization. due to the explicit nature of vulkan, you must keep track of when the gpu completes the work you submit to it. vkqueuesubmit(queue, …, fence); the previous code is the way we would use a fence, and later this fence can be waited on. when the fence signals, we know it is safe to free resources, read back data written by gpu, and so on. overall, the vkfence interface was never a real problem in practice, except that it feels strange to have two entirely different api objects which essentially do the same thing. vksemaphore on the other hand has some quirks which makes it difficult to use properly in sophisticated applications. vksemaphore by default is a binary semaphore. the fundamental problem with binary semaphores is that we can only wait for a semaphore once. after we have waited for it, it automatically becomes unsignaled again. this binary nature is very annoying to deal with when we use multiple queues. for example, consider a scenario where we perform some work in the graphics queue, and want to synchronize that work with two different compute queues. if we know this scenario is coming up, we will then have to allocate two vksemaphore objects, signal both objects, and wait for each of them in the different compute queues. this works, but we might not have the knowledge up front that this scenario will play out. often where we are dealing with multiple queues, we have to be somewhat conservative and signal semaphore objects we never end up waiting for. this leads to another problem … a signaled semaphore, which is never waited for, is basically a dead and useless semaphore and should be destroyed. we cannot reset a vksemaphore object on the cpu, so we cannot ever signal it again if we want to recycle vksemaphore objects. a workaround would be to wait for the semaphore on the gpu in a random queue just to unsignal it, but this feels like a gross hack. it could also potentially cause performance issues, as waiting for a semaphore is a full gpu memory barrier. object bloat is another considerable pitfall of the existing apis. for every synchronization point we need, we require a new object. all these objects must be managed, and their lifetimes must be considered. this creates a lot of annoying “bloat” for engines. the timeline – fixing object bloat – fixing multiple waits the first observation we can make of a vulkan queue is that submissions should generally complete in-order. to signal a synchronization object in vkqueuesubmit, the gpu waits for all previously submitted work to the queue, which includes the signaling operation of previous synchronization objects. rather than assigning one object per submission, we synchronize in terms of number of submissions. a plain uint64_t counter can be used for each queue. when a submission completes, the number is monotonically increased, usually by one each time. this counter is contained inside a single timeline semaphore object. rather than waiting for a specific synchronization object which matches a particular submission, we could wait for a single object and specify “wait until graphics queue submission #157 completes.” we can wait for any value multiple times as we wish, so there is no binary semaphore problem. essentially, for each vkqueue we can create a single timeline semaphore on startup and leave it alone (uint64_t will not overflow until the heat death of the sun, do not worry about it). this is extremely convenient and makes it so much easier to implement complicated dependency management schemes. unifying vkfence and vksemaphore timeline semaphores can be used very effectively on cpu as well: vksemaphorewaitinfokhr info = { vk_structure_type_semaphore_wait_info_khr }; info.semaphorecount = 1; info.psemaphores = &semaphore; info.pvalues = &value; vkwaitsemaphoreskhr(device, &info, timeout); this completely removes the need to use vkfence. another advantage of this method is that multiple threads can wait for a timeline semaphore. with vkfence, only one thread could access a vkfence at any one time. a timeline semaphore can even be signaled from the cpu as well, although this feature feels somewhat niche. it allows use cases where you submit work to the gpu early, but then 'kick' the submission using vksignalsemaphorekhr. the accompanying sample demonstrates a particular scenario where this function might be useful: vksemaphoresignalinfokhr info = { vk_structure_type_semaphore_signal_info_khr }; info.semaphore = semaphore; info.value = value; vksignalsemaphorekhr(device, &info); creating a timeline semaphore when creating a semaphore, you can specify the type of semaphore and give it an initial value: vksemaphorecreateinfo info = { vk_structure_type_semaphore_create_info }; vksemaphoretypecreateinfokhr type_info = { vk_structure_type_semaphore_type_create_info_khr }; type_info.semaphoretype = vk_semaphore_type_timeline_khr; type_info.initialvalue = 0; info.pnext = &type_info; vkcreatesemaphore(device, &info, null, &semaphore); signaling and waiting on timeline semaphores when submitting work with vkqueuesubmit, you can chain another struct which provides counter values when using timeline semaphores, for example: vksubmitinfo submit = { vk_structure_type_submit_info }; submit.waitsemaphorecount = 1; submit.pwaitsemaphores = &compute_queue_semaphore; submit.pwaitdststagemask = &wait_stage; submit.commandbuffercount = 1; submit.pcommandbuffers = &cmd; submit.signalsemaphorecount = 1; submit.psignalsemaphores = &graphics_queue_semaphore; vktimelinesemaphoresubmitinfokhr timeline = { vk_structure_type_timeline_semaphore_submit_info_khr }; timeline.waitsemaphorevaluecount = 1; timeline.pwaitsemaphorevalues = &wait_value; timeline.signalsemaphorevaluecount = 1; timeline.psignalsemaphorevalues = &signal_value; submit.pnext = &timeline; signal_value++; // generally, you bump the timeline value once per submission. vkqueuesubmit(queue, 1, &submit, vk_null_handle); out of order signal and wait a strong requirement of vulkan binary semaphores is that signals must be submitted before a wait on a semaphore can be submitted. this makes it easy to guarantee that deadlocks do not occur on the gpu, but it is also somewhat inflexible. in an application with many vulkan queues and a task-based architecture, it is reasonable to submit work that is somewhat out of order. however, this still uses synchronization objects to ensure the right ordering when executing on the gpu. with timeline semaphores, the application can agree on the timeline values to use ahead of time, then go ahead and build commands and submit out of order. the driver is responsible for figuring out the submission order required to make it work. however, the application gets more ways to shoot itself in the foot with this approach. this is because it is possible to create a deadlock with multiple queues where queue a waits for queue b, and queue b waits for queue a at the same time. ease of porting it is no secret that timeline semaphores are inherited largely from d3d12’s fence objects. from a portability angle, timeline semaphores make it much easier to have compatibility across the apis. caveats as the specification stands right now, you cannot use timeline semaphores with swap chains. this is generally not a big problem as synchronization with the swap chain tends to be explicit operations renderers need to take care of. another potential caveat to consider is that the timeline semaphore might not have a direct kernel equivalent on current platforms, which means some extra emulation to handle it, especially the out-of-order submission feature. as the timeline synchronization model becomes the de-facto standard, i expect platforms to get more native support for it. conclusion all three key vulkan extension game changers improve the overall development and gaming experience through improving graphics and enabling new gaming use cases. we hope that we gave you enough samples to get you started as you try out these new vulkan extensions to help bring your games to life follow up thanks to hans-kristian arntzen and the team at arm for bringing this great content to the samsung developers community. we hope you find this information about vulkan extensions useful for developing your upcoming mobile games. the samsung developers site has many resources for developers looking to build for and integrate with samsung devices and services. stay in touch with the latest news by creating a free account or by subscribing to our monthly newsletter. visit the marketing resources page for information on promoting and distributing your apps and games. finally, our developer forum is an excellent way to stay up-to-date on all things related to the galaxy ecosystem.

Arm Developers

https://developer.samsung.com/galaxy-gamedev/blog/en-us/2021/07/12/new-game-changing-vulkan-extensions-for-mobile-timeline-semaphores

tutorials game, mobile

blog

New Vulkan Extensions for Mobile: Maintenance Extensions

the samsung developers team works with many companies in the mobile and gaming ecosystems. we're excited to support our partner, arm, as they bring timely and relevant content to developers looking to build games and high-performance experiences. this vulkan extensions series will help developers get the most out of the new and game-changing vulkan extensions on samsung mobile devices. android is enabling a host of useful new vulkan extensions for mobile. these new extensions are set to improve the state of graphics apis for modern applications, enabling new use cases and changing how developers can design graphics renderers going forward. in particular, in android r, there has been a whole set of vulkan extensions added. these extensions will be available across various android smartphones, including the samsung galaxy s21, which was recently launched on 14 january. existing samsung galaxy s models, such as the samsung galaxy s20, also allow upgrades to android r. one of these new vulkan extensions for mobile are ‘maintenance extensions’. these plug up various holes in the vulkan specification. mostly, a lack of these extensions can be worked around, but it is annoying for application developers to do so. having these extensions means less friction overall, which is a very good thing. vk_khr_uniform_buffer_standard_layout this extension is a quiet one, but i still feel it has a lot of impact since it removes a fundamental restriction for applications. getting to data efficiently is the lifeblood of gpu programming. one thing i have seen trip up developers again and again are the antiquated rules for how uniform buffers (ubo) are laid out in memory. for whatever reason, ubos have been stuck with annoying alignment rules which go back to ancient times, yet ssbos have nice alignment rules. why? as an example, let us assume we want to send an array of floats to a shader: #version 450 layout(set = 0, binding = 0, std140) uniform ubo { float values[1024]; }; layout(location = 0) out vec4 fragcolor; layout(location = 0) flat in int vindex; void main() { fragcolor = vec4(values[vindex]); } if you are not used to graphics api idiosyncrasies, this looks fine, but danger lurks around the corner. any array in a ubo will be padded out to have 16 byte elements, meaning the only way to have a tightly packed ubo is to use vec4 arrays. somehow, legacy hardware was hardwired for this assumption. ssbos never had this problem. std140 vs std430 you might have run into these weird layout qualifiers in glsl. they reference some rather old glsl versions. std140 refers to glsl 1.40, which was introduced in opengl 3.1, and it was the version uniform buffers were introduced to opengl. the std140 packing rules define how variables are packed into buffers. the main quirks of std140 are: vectors are aligned to their size. notoriously, a vec3 is aligned to 16 bytes, which have tripped up countless programmers over the years, but this is just the nature of vectors in general. hardware tends to like aligned access to vectors. array element sizes are aligned to 16 bytes. this one makes it very wasteful to use arrays of float and vec2. the array quirk mirrors hlsl’s cbuffer. after all, both opengl and d3d mapped to the same hardware. essentially, the assumption i am making here is that hardware was only able to load 16 bytes at a time with 16 byte alignment. to extract scalars, you could always do that after the load. std430 was introduced in glsl 4.30 in opengl 4.3 and was designed to be used with ssbos. std430 removed the array element alignment rule, which means that with std430, we can express this efficiently: #version 450 layout(set = 0, binding = 0, std430) readonly buffer ssbo { float values[1024]; }; layout(location = 0) out vec4 fragcolor; layout(location = 0) flat in int vindex; void main() { fragcolor = vec4(values[vindex]); } basically, the new extension enables std430 layout for use with ubos as well. #version 450 #extension gl_ext_scalar_block_layout : require layout(set = 0, binding = 0, std430) uniform ubo { float values[1024]; }; layout(location = 0) out vec4 fragcolor; layout(location = 0) flat in int vindex; void main() { fragcolor = vec4(values[vindex]); } why not just use ssbos then? on some architectures, yes, that is a valid workaround. however, some architectures also have special caches which are designed specifically for ubos. improving memory layouts of ubos is still valuable. gl_ext_scalar_block_layout? the vulkan glsl extension which supports std430 ubos goes a little further and supports the scalar layout as well. this is a completely relaxed layout scheme where alignment requirements are essentially gone, however, that requires a different vulkan extension to work. vk_khr_separate_depth_stencil_layouts depth-stencil images are weird in general. it is natural to think of these two aspects as separate images. however, the reality is that some gpu architectures like to pack depth and stencil together into one image, especially with d24s8 formats. expressing image layouts with depth and stencil formats have therefore been somewhat awkward in vulkan, especially if you want to make one aspect read-only and keep another aspect as read/write, for example. in vulkan 1.0, both depth and stencil needed to be in the same image layout. this means that you are either doing read-only depth-stencil or read/write depth-stencil. this was quickly identified as not being good enough for certain use cases. there are valid use cases where depth is read-only while stencil is read/write in deferred rendering for example. eventually, vk_khr_maintenance2 added support for some mixed image layouts which lets us express read-only depth, read/write stencil, and vice versa: vk_image_layout_depth_attachment_stencil_read_only_optimal_khr vk_image_layout_depth_read_only_stencil_attachment_optimal_khr usually, this is good enough, but there is a significant caveat to this approach, which is that depth and stencil layouts must be specified and transitioned together. this means that it is not possible to render to a depth aspect, while transitioning the stencil aspect concurrently, since changing image layouts is a write operation. if the engine is not designed to couple depths and stencil together, it causes a lot of friction in implementation. what this extension does is completely decouple image layouts for depth and stencil aspects and makes it possible to modify the depth or stencil image layouts in complete isolation. for example: vkimagememorybarrier barrier = {…}; normally, we would have to specify both depth and stencil aspects for depth-stencil images. now, we can completely ignore what stencil is doing and only modify depth image layout. barrier.subresourcerange.aspectmask = vk_image_aspect_depth_bit; barrier.oldlayout = vk_image_layout_depth_attachment_optimal_khr; barrier.newlayout = vk_image_layout_depth_read_only_optimal; similarly, in vk_khr_create_renderpass2, there are extension structures where you can specify stencil layouts separately from the depth layout if you wish. typedef struct vkattachmentdescriptionstencillayout { vkstructuretype stype; void* pnext; vkimagelayout stencilinitiallayout; vkimagelayout stencilfinallayout; } vkattachmentdescriptionstencillayout; typedef struct vkattachmentreferencestencillayout { vkstructuretype stype; void* pnext; vkimagelayout stencillayout; } vkattachmentreferencestencillayout; like image memory barriers, it is possible to express layout transitions that only occur in either depth or stencil attachments. vk_khr_spirv_1_4 each core vulkan version has targeted a specific spir-v version. for vulkan 1.0, we have spir-v 1.0. for vulkan 1.1, we have spir-v 1.3, and for vulkan 1.2 we have spir-v 1.5. spir-v 1.4 was an interim version between vulkan 1.1 and 1.2 which added some nice features, but the usefulness of this extension is largely meant for developers who like to target spir-v themselves. developers using glsl or hlsl might not find much use for this extension. some highlights of spir-v 1.4 that i think are worth mentioning are listed here. opselect between composite objects opselect before spir-v 1.4 only supports selecting between scalars and vectors. spir-v 1.4 thus allows you to express this kind of code easily with a simple opselect: mystruct s = cond ? mystruct(1, 2, 3) : mystruct(4, 5, 6); opcopylogical there are scenarios in high-level languages where you load a struct from a buffer and then place it in a function variable. if you have ever looked at spir-v code for this kind of scenario, glslang would copy each element of the struct one by one, which generates bloated spir-v code. this is because the struct type that lives in a buffer and a struct type for a function variable are not necessarily the same. offset decorations are the major culprits here. copying objects in spir-v only works when the types are exactly the same, not “almost the same”. opcopylogical fixes this problem where you can copy objects of types which are the same except for decorations. advanced loop control hints spir-v 1.4 adds ways to express partial unrolling, how many iterations are expected, and such advanced hints, which can help a driver optimize better using knowledge it otherwise would not have. there is no way to express these in normal shading languages yet, but it does not seem difficult to add support for it. explicit look-up tables describing look-up tables was a bit awkward in spir-v. the natural way to do this in spir-v 1.3 is to declare an array with private storage scope with an initializer, access chain into it and load from it. however, there was never a way to express that a global variable is const, which relies on compilers to be a little smart. as a case study, let us see what glslang emits when using vulkan 1.1 target environment: #version 450 layout(location = 0) out float fragcolor; layout(location = 0) flat in int vindex; const float lut[4] = float[](1.0, 2.0, 3.0, 4.0); void main() { fragcolor = lut[vindex]; } %float_1 = opconstant %float 1 %float_2 = opconstant %float 2 %float_3 = opconstant %float 3 %float_4 = opconstant %float 4 %16 = opconstantcomposite %_arr_float_uint_4 %float_1 %float_2 %float_3 %float_4 this is super weird code, but it is easy for compilers to promote to a lut. if the compiler can prove there are no readers before the opstore, and only one opstore can statically happen, compiler can optimize it to const lut. %indexable = opvariable %_ptr_function__arr_float_uint_4 function opstore %indexable %16 %24 = opaccesschain %_ptr_function_float %indexable %index %25 = opload %float %24 in spir-v 1.4, the nonwritable decoration can also be used with private and function storage variables. add an initializer, and we get something that looks far more reasonable and obvious: opdecorate %indexable nonwritable %16 = opconstantcomposite %_arr_float_uint_4 %float_1 %float_2 %float_3 %float_4 // initialize an array with a constant expression and mark it as nonwritable. // this is trivially a lut. %indexable = opvariable %_ptr_function__arr_float_uint_4 function %16 %24 = opaccesschain %_ptr_function_float %indexable %index %25 = opload %float %24 vk_khr_shader_subgroup_extended_types this extension fixes a hole in vulkan subgroup support. when subgroups were introduced, it was only possible to use subgroup operations on 32-bit values. however, with 16-bit arithmetic getting more popular, especially float16, there are use cases where you would want to use subgroup operations on smaller arithmetic types, making this kind of shader possible: #version 450 // subgroupadd #extension gl_khr_shader_subgroup_arithmetic : require for fp16 arithmetic: #extension gl_ext_shader_explicit_arithmetic_types_float16 : require for subgroup operations on fp16: #extension gl_ext_shader_subgroup_extended_types_float16 : require layout(location = 0) out f16vec4 fragcolor; layout(location = 0) in f16vec4 vcolor; void main() { fragcolor = subgroupadd(vcolor); } vk_khr_imageless_framebuffer in most engines, using vkframebuffer objects can feel a bit awkward, since most engine abstractions are based around some idea of: myrenderapi::bindrendertargets(colorattachments, depthstencilattachment) in this model, vkframebuffer objects introduce a lot of friction, since engines would almost certainly end up with either one of two strategies: create a vkframebuffer for every render pass, free later. maintain a hashmap of all observed attachment and render-pass combinations. unfortunately, there are some … reasons why vkframebuffer exists in the first place, but vk_khr_imageless_framebuffer at least removes the largest pain point. this is needing to know the exact vkimageviews that we are going to use before we actually start rendering. with imageless frame buffers, we can defer the exact vkimageviews we are going to render into until vkcmdbeginrenderpass. however, the frame buffer itself still needs to know about certain metadata ahead of time. some drivers need to know this information unfortunately. first, we set the vk_framebuffer_create_imageless_bit flag in vkcreateframebuffer. this removes the need to set pattachments. instead, we specify some parameters for each attachment. we pass down this structure as a pnext: typedef struct vkframebufferattachmentscreateinfo { vkstructuretype stype; const void* pnext; uint32_t attachmentimageinfocount; const vkframebufferattachmentimageinfo* pattachmentimageinfos; } vkframebufferattachmentscreateinfo; typedef struct vkframebufferattachmentimageinfo { vkstructuretype stype; const void* pnext; vkimagecreateflags flags; vkimageusageflags usage; uint32_t width; uint32_t height; uint32_t layercount; uint32_t viewformatcount; const vkformat* pviewformats; } vkframebufferattachmentimageinfo; essentially, we need to specify almost everything that vkcreateimage would specify. the only thing we avoid is having to know the exact image views we need to use. to begin a render pass which uses imageless frame buffer, we pass down this struct in vkcmdbeginrenderpass instead: typedef struct vkrenderpassattachmentbegininfo { vkstructuretype stype; const void* pnext; uint32_t attachmentcount; const vkimageview* pattachments; } vkrenderpassattachmentbegininfo; conclusions overall, i feel like this extension does not really solve the problem of having to know images up front. knowing the resolution, usage flags of all attachments up front is basically like having to know the image views up front either way. if your engine knows all this information up-front, just not the exact image views, then this extension can be useful. the number of unique vkframebuffer objects will likely go down as well, but otherwise, there is in my personal view room to greatly improve things. in the next blog on the new vulkan extensions, i explore 'legacy support extensions.' follow up thanks to hans-kristian arntzen and the team at arm for bringing this great content to the samsung developers community. we hope you find this information about vulkan extensions useful for developing your upcoming mobile games. the samsung developers site has many resources for developers looking to build for and integrate with samsung devices and services. stay in touch with the latest news by creating a free account or by subscribing to our monthly newsletter. visit the marketing resources page for information on promoting and distributing your apps and games. finally, our developer forum is an excellent way to stay up-to-date on all things related to the galaxy ecosystem.

Arm Developers

https://developer.samsung.com/galaxy-gamedev/blog/en-us/2021/06/17/new-vulkan-extensions-for-mobile-maintenance-extensions

tutorials game, mobile

blog

New Vulkan Extensions for Mobile: Legacy Support Extensions

the samsung developers team works with many companies in the mobile and gaming ecosystems. we're excited to support our partner, arm, as they bring timely and relevant content to developers looking to build games and high-performance experiences. this vulkan extensions series will help developers get the most out of the new and game-changing vulkan extensions on samsung mobile devices. android is enabling a host of useful new vulkan extensions for mobile. these new extensions are set to improve the state of graphics apis for modern applications, enabling new use cases and changing how developers can design graphics renderers going forward. i have already provided information about ‘maintenance extensions’. however, another important extension that i explore in this blog is ‘legacy support extensions’. vulkan is increasingly being used as a portable “hal”. the power and flexibility of the api allows for great layered implementations. there is a lot of effort spent in the ecosystem enabling legacy graphics apis to run efficiently on top of vulkan. the bright future for driver developers is a world where gpu drivers only implement vulkan, and where legacy apis can be implemented on top of that driver. to that end, there are several features which are generally considered backwards today. they should not be used in new applications unless absolutely required. these extensions exist to facilitate old applications which need to keep running through api translation layers such as angle, dxvk, zink, and so on. vk_ext_transform_feedback speaking the name of this extension causes the general angst level to rise in a room of driver developers. in the world of direct3d, this feature is also known as stream-out. the core feature of this extension is that whenever you render geometry, you can capture the resulting geometry data (position and vertex outputs) into a buffer. the key complication from an implementation point of view is that the result is ordered. this means there is no 1:1 relation for input vertex to output data since this extension is supposed to work with indexed rendering, as well as strip types (and even geometry shaders and tessellation, oh my!). this feature was invented in a world before compute shaders were conceived. the only real method to perform buffer <-> buffer computation was to make use of transform feedback, vertex shaders and rasterizationdiscard. over time, the functionality of transform feedback was extended in various ways, but today it is essentially obsoleted by compute shaders. there are, however, two niches where this extension still makes sense - graphics debuggers and api translation layers. transform feedback is extremely difficult to emulate in the more complicated cases. setting up shaders in vertex-like shader stages, you need to set up which vertex outputs to capture to a buffer. the shader itself controls the memory layout of the output data. this is unlike other apis, where you use the graphics api to specify which outputs to capture based on the name of the variable. here is an example vulkan glsl shader: #version 450 layout(xfb_stride = 32, xfb_offset = 0, xfb_buffer = 0, location = 0) out vec4 vcolor; layout(xfb_stride = 32, xfb_offset = 16, xfb_buffer = 0, location = 1) out vec4 vcolor2; layout(xfb_buffer = 1, xfb_stride = 16) out gl_pervertex { layout(xfb_offset = 0) vec4 gl_position; }; void main() { gl_position = vec4(1.0); vcolor = vec4(2.0); vcolor2 = vec4(3.0); } the resulting spir-v will then look something like: capability transformfeedback executionmode 4 xfb decorate 8(gl_pervertex) block decorate 10 xfbbuffer 1 decorate 10 xfbstride 16 decorate 17(vcolor) location 0 decorate 17(vcolor) xfbbuffer 0 decorate 17(vcolor) xfbstride 32 decorate 17(vcolor) offset 0 decorate 20(vcolor2) location 1 decorate 20(vcolor2) xfbbuffer 0 decorate 20(vcolor2) xfbstride 32 decorate 20(vcolor2) offset 16 binding transform feedback buffers once we have a pipeline which can emit transform feedback data, we need to bind buffers: vkcmdbindtransformfeedbackbuffersext(cmd, firstbinding, bindingcount, pbuffers, poffsets, psizes); to enable a buffer to be captured, vk_buffer_usage_transform_feedback_buffer_bit_ext is used. starting and stopping capture once we know where to write the vertex output data, we will begin and end captures. this needs to be done inside a render pass: vkcmdbegintransformfeedbackext(cmd, firstcounterbuffer, counterbuffercount, pcounterbuffers, pcounterbufferoffsets); a counter buffer allows us to handle scenarios where we end a transform feedback and continue capturing later. we would not necessarily know how many bytes were written by the last transform feedback, so it is critical that we can let the gpu maintain a byte counter for us. vkcmddraw(cmd, …); vkcmddrawindexed(cmd, …); then we can start rendering. vertex outputs are captured to the buffers in-order. vkcmdendtransformfeedbackext(cmd, firstcounterbuffer, counterbuffercount, pcounterbuffers, pcounterbufferoffsets); once we are done capturing, we end the transform feedback and, with the counter buffers, we can write the new buffer offsets into the counter buffer. indirectly drawing transform feedback results this feature is a precursor to the more flexible indirect draw feature we have in vulkan, but there was a time when this feature was the only efficient way to render transform feedbacked outputs. the fundamental problem is that we do not necessarily know exactly how many primitives have been rendered. therefore, to avoid stalling the cpu, it was required to be able to indirectly render the results with a special purpose api. vkcmddrawindirectbytecountext(cmd, instancecount, firstinstance, counterbuffer, counterbufferoffset, counteroffset, vertexstride); this works similarly to a normal indirect draw call, but instead of providing a vertex count, we give it a byte count and let the gpu perform the divide instead. this is nice, as otherwise we would have to dispatch a tiny compute kernel that converts a byte count to an indirect draw. queries the offset counter is sort of like a query, but if the transform feedback buffers overflow, any further writes are ignored. the vk_query_type_transform_feedback_stream_ext queries how many primitives were generated. it also lets you query how many primitives were attempted to be written. this makes it possible to detect overflow if that is desirable. vk_ext_line_rasterization line rasterization is a tricky subject and is not normally used for gaming applications since they do not scale with resolution and their exact behavior is not consistent across all gpu implementations. in the world of cad, however, this feature is critical, and older opengl apis had extensive support for quite fancy line rendering methods. this extension essentially brings back those workstation features. advanced line rendering can occasionally be useful for debug tooling and visualization as well. the feature zoo typedef struct vkphysicaldevicelinerasterizationfeaturesext { vkstructuretype stype; void* pnext; vkbool32 rectangularlines; vkbool32 bresenhamlines; vkbool32 smoothlines; vkbool32 stippledrectangularlines; vkbool32 stippledbresenhamlines; vkbool32 stippledsmoothlines; } vkphysicaldevicelinerasterizationfeaturesext; this extension supports a lot of different feature bits. i will try to summarize what they mean below. rectangular lines vs parallelogram when rendering normal lines in core vulkan, there are two ways lines can be rendered. if vkphysicaldevicelimits::strictlines is true, a line is rendered as if the line is a true, oriented rectangle. this is essentially what you would get if you rendered a scaled and rotated rectangle yourself. the hardware just expands the line along the perpendicular axis of the line axis. in non-strict rendering, we get a parallelogram. the line is extended either in x or y directions. (from vulkan specification) bresenham lines bresenham lines reformulate the line rendering algorithm where each pixel has a diamond shaped area around the pixel and coverage is based around intersection and exiting the area. the advantage here is that rendering line strips avoids overdraw. rectangle or parallelogram rendering does not guarantee this, which matters if you are rendering line strips with blending enabled. (from vulkan specification) smooth lines smooth lines work like rectangular lines, except the implementation can render a little further out to create a smooth edge. exact behavior is also completely unspecified, and we find the only instance of the word “aesthetic” in the entire specification, which is amusing. this is a wonderfully vague word to see in the vulkan specification, which is otherwise no-nonsense normative. this feature is designed to work in combination with alpha blending since the smooth coverage of the line rendering is multiplied into the alpha channel of render target 0’s output. line stipple a “classic” feature that will make most ihvs cringe a little. when rendering a line, it is possible to mask certain pixels in a pattern. a counter runs while rasterizing pixels in order and with line stipple you control a divider and mask which generates a fixed pattern for when to discard pixels. it is somewhat unclear if this feature is really needed when it is possible to use discard in the fragment shader, but alas, legacy features from the early 90s are sometimes used. there were no shaders back in those days. configuring rasterization pipeline state when creating a graphics pipeline, you can pass in some more data in pnext of rasterization state: typedef struct vkpipelinerasterizationlinestatecreateinfoext { vkstructuretype stype; const void* pnext; vklinerasterizationmodeext linerasterizationmode; vkbool32 stippledlineenable; uint32_t linestipplefactor; uint16_t linestipplepattern; } vkpipelinerasterizationlinestatecreateinfoext; typedef enum vklinerasterizationmodeext { vk_line_rasterization_mode_default_ext = 0, vk_line_rasterization_mode_rectangular_ext = 1, vk_line_rasterization_mode_bresenham_ext = 2, vk_line_rasterization_mode_rectangular_smooth_ext = 3, } vklinerasterizationmodeext; if line stipple is enabled, the line stipple factors can be baked into the pipeline, or be made a dynamic pipeline state using vk_dynamic_state_line_stipple_ext. in the case of dynamic line stipple, the line stipple factor and pattern can be modified dynamically with: vkcmdsetlinestippleext(cmd, factor, pattern); vk_ext_index_type_uint8 in opengl and opengl es, we have support for 8-bit index buffers. core vulkan and direct3d however only support 16-bit and 32-bit index buffers. since emulating index buffer formats is impractical with indirect draw calls being a thing, we need to be able to bind 8-bit index buffers. this extension does just that. this is probably the simplest extension we have look at so far: vkcmdbindindexbuffer(cmd, indexbuffer, offset, vk_index_type_uint8_ext); vkcmddrawindexed(cmd, …); conclusion i have been through the 'maintenance' and 'legacy support' extensions that are part of the new vulkan extensions for mobile. in the next three blogs, i will go through what i see as the 'game-changing' extensions from vulkan - the three that will help to transform your games during the development process. follow up thanks to hans-kristian arntzen and the team at arm for bringing this great content to the samsung developers community. we hope you find this information about vulkan extensions useful for developing your upcoming mobile games. the original version of this article can be viewed at arm community. the samsung developers site has many resources for developers looking to build for and integrate with samsung devices and services. stay in touch with the latest news by creating a free account or by subscribing to our monthly newsletter. visit the marketing resources page for information on promoting and distributing your apps and games. finally, our developer forum is an excellent way to stay up-to-date on all things related to the galaxy ecosystem.

Arm Developers

https://developer.samsung.com/galaxy-gamedev/blog/en-us/2021/06/21/new-vulkan-extensions-for-mobile-legacy-support-extensions

tutorials game, mobile

blog

Using Variable Rate Shading to Improve Performance on Mobile Games

in this article i would like to introduce a hardware optimisation technique called variable rate shading (vrs) and how this technique can benefit games on mobile phones. introduction traditionally, each shaded pixel in a rendered image is being shaded individually, meaning we can shade very high details anywhere in the image which, in theory, is great. however, in practice, this can lead to wasteful gpu calculations for areas where details are less important. in some cases, you do not need 1x1 shading of pixels to produce a high quality image. for example, for those areas that represent unlit surfaces caused by shadows naturally contain less details than brighter lit areas. moreover, areas which are out of focus due to camera post-effects and areas affected by motion blur naturally do not contain high details. in these cases we could benefit from letting multiple pixels be shaded by just a single calculation (like a 2x2 or 4x4 area of pixels) without losing any noticeable visual quality. the high resolution sky texture on the left looks very much like the lower resolution sky texture on the right. this is due to the smooth colour gradients and lack of high frequency colour variation. for those reasons, there is room for a lot of optimisation. you could argue that optimisation for handheld devices, like mobile phones, is more essential than on stationary devices, like games consoles, due to a couple of reasons. firstly, the hardware on handheld devices is often less powerful than conventional hardware due to smaller size and less electrical power supply. the compact size of the hardware for handheld devices are also the reason why they are more likely to suffer from temperature issues causing thermal throttling, where the performance slows down significantly. secondly, heavy graphics in games can quickly drain your phone's battery life. so, it is crucial to keep the gpu resources to a minimum when possible. variable rate shading is a way to help doing just that. how does variable rate shading work? in principle, variable rate shading is actually a very simple method which can be implemented without having to redesign an existing rendering pipeline. there are three ways to define areas to be optimised using variable rate shading: let an attachment in the form of an image serve as a mask. execute the optimisation on a per-triangle basis. let the vrs optimisation be based on a per-draw call. use an attachment as a mask you can provide the gpu with an image that serves as a mask. the mask contains information about what areas need to be rendered in a traditional manner by shading each pixel individually, and which areas need to be optimised by shading a group of pixels at once. the image below is visualising such a mask by colour-coding different areas: the blue area does not have any optimisation applied (1x1) as this area is where the player focuses on when driving. the green area is optimised by shading four pixels (2x2) by only one shading calculation, as this area contains less details due to motion blur. the red area can be optimised even more (4x4), as it is affected by a more aggressive motion blur. the yellow and purple areas are also shaded with less shading calculations. the areas defined in the image above could be static, at least while the player is driving the boat at top speed, as the boat is positioned at the centre of the image at all times. however, the level of optimisation could be reduced when another boat is passing by or when the boat slows down and therefore the motion blur is gradually reduced. there are times where a more dynamic approach is needed, as it sometimes can be difficult to know beforehand what areas should be optimised and what areas should be shaded in a traditional manner. in those cases, it could be beneficial to generate the mask more dynamically by rendering the geometry for the scene in an extra pass. simply colour the geometric elements in the scene and pass it to the gpu as a mask for the variable rate shading optimisation. if the scene is rendered by using deferred lighting, an extra pass may not be needed as the mask could be based on the default geometry pass required for deferred shading. optimisation based on primitives another way of using variable rate shading is taking advantage of other extensions as they allow you to define geometric elements to be optimised rather than using a mask. this can be done on a per-triangle basis or simply done by a per-draw call. defining geometric elements could be a more efficient approach as there is no need for generating a mask as well as needing less memory bandwidth. for the per-triangle basis extension, you are able to define the optimisation level in the vertex shader. for the per-draw call method, the optimisation level can be defined before the draw call takes place. keep in mind that the three methods can be combined if needed. the image below is a rendering pass where all objects in a scene are shaded in different colours to define what areas should be shaded in a traditional manner (meaning no optimisation) and what areas contain less details (therefore needing less gpu calculations). the areas defined above can be defined by all three methods. in general, by breaking a scene up in layers, where the elements nearest the camera have less optimisation and layers in the background have the most optimisation, would be an effective way to go about it. the image below shows the same scene, but this time we see the final output where vrs is on and off. as you may have noticed, it is very hard to tell any difference when the vrs optimisation is turned on or off. experiences with variable rate shading so far some commercial games have already successfully implemented variable rate shading. the image below is from wolfenstein young blood. as you may have noticed, there is barely any visual difference when vrs is on or off, but you are able to tell a difference in frame rate. in fact, the game performs, on average, 10% or higher when vrs is turned on. that may not sound like a lot, but considering that it is an easy optimisation to implement, there is barely any noticeable change in the visual quality. the 10% performance boost is on top of other optimisation techniques and it is actually not a bad performance boost after all. other games have shown an even a higher performance boost. for example, gears tactics has a performance boost up to 30% when using variable rate shading. the image below is from that game. virtual reality variable rate shading can benefit virtual reality as well. not only does virtual reality by nature require two rendered images (one image for each eye), but the player who wears the virtual mask naturally pays most attention to the central area of the rendered image. the areas of the rendered image that are seen from the corner of your eye naturally do not need the same amount of details as the central area of the rendered images. that means even though a static vrs mask can be used for a reasonable overall optimisation, using an eye tracker could result in an even more efficient optimisation and therefore less noticeable quality reduction. it is crucial to have a consistent high frame rate for virtual reality. if the frame rate is not relatively consistent or the rendering performance is suffering for a consistent low frame rate, it quickly gets uncomfortable to wear a vr headset and the player might even get dizzy and feel physically sick. by reducing the gpu calculations, using variable rate shading not only boosts the frame rate, it also uses less battery for mobile devices. this is a huge win for systems like samsung gear vr where a long battery life is much appreciated as the graphics are running on a galaxy mobile phone. the image below shows a variable rate shading mask generated by eye tracking technology for a virtual reality headset. the centre of the left and right images shade pixels in a traditional manner. the other colours represent different degrees of optimisation areas. which samsung devices support variable rate shading? all hardware listed here supports variable rate shading. mobile phones: samsung galaxy s22, s22+ and s22 ultra tablets: samsung tab s8, s8+ and s8 ultra the following graphics apis, vulkan and opengl es 2.0 (and higher), both support variable rate shading. the opengl extensions for the three ways of using variable rate shading are the following: gl_ext_fragment_shading_rate_attachment for allowing to send a mask to the gpu. gl_ext_fragment_shading_rate_primitive for per-triangle basis, where writing a value to gl_primitiveshadingrateext in the vertex shader defines the level of optimisation. gl_ext_fragment_shading_rate for per-draw call, where glshadingrateext should be called to define the optimisation level. the extension that enables variable rate shading for vulkan is vk_khr_fragment_shading_rate. conclusion in this article, we have established the following: variable rate shading is a hardware feature and is fairly easy to implement as it does not require any redesign of existing rendering pipelines. variable rate shading is an optimisation technique which reduces gpu calculations by allowing a group of pixels to be shaded by the same colour rather than each pixel individually. variable rate shading is particularly useful for mobile gaming as well as samsung gear vr, as it boosts performance and prolongs battery life. the level of optimisation can be defined by passing a mask to the gpu that contains areas of different optimisation levels. some implementations have proven to boost the framerate 10% or higher, while other implementations manage to increase the frame rate up to 30%. note: some images in this post are courtesy of ul solutions. additional resources on the samsung developers site the samsung developers site has many resources for developers looking to build for and integrate with samsung devices and services. stay in touch with the latest news by creating a free account and subscribing to our monthly newsletter. visit the marketing resources page for information on promoting and distributing your apps. finally, our developer forum is an excellent way to stay up-to-date on all things related to the galaxy ecosystem.

Søren Klit Lambæk

https://developer.samsung.com/galaxy-gamedev/blog/en-us/2022/11/22/using-variable-rate-shading-to-improve-performance-on-mobile-games

tutorials game, mobile

blog

New Game Changing Vulkan Extensions for Mobile: Buffer Device Address

the samsung developers team works with many companies in the mobile and gaming ecosystems. we're excited to support our partner, arm, as they bring timely and relevant content to developers looking to build games and high-performance experiences. this vulkan extensions series will help developers get the most out of the new and game-changing vulkan extensions on samsung mobile devices. android r is enabling a host of useful vulkan extensions for mobile, with three being key 'game changers'. these are set to improve the state of graphics apis for modern applications, enabling new use cases and changing how developers can design graphics renderers going forward. you can expect to see these features across a variety of android smartphones, such as the new samsung galaxy s21, and existing samsung galaxy s models like the samsung galaxy s20. the first blog explored the first game changer extension for vulkan – ‘descriptor indexing'. this blog explores the second game changer extension – ‘buffer device address.’ vk_khr_buffer_device_address vk_khr_buffer_device_address is a monumental extension that adds a unique feature to vulkan that none of the competing graphics apis support. pointer support is something that has always been limited in graphics apis, for good reason. pointers complicate a lot of things, especially for shader compilers. it is also near impossible to deal with plain pointers in legacy graphics apis, which rely on implicit synchronization. there are two key aspects to buffer_device_address (bda). first, it is possible to query a gpu virtual address from a vkbuffer. this is a plain uint64_t. this address can be written anywhere you like, in uniform buffers, push constants, or storage buffers, to name a few. the key aspect which makes this extension unique is that a spir-v shader can load an address from a buffer and treat it as a pointer to storage buffer memory immediately. pointer casting, pointer arithmetic and all sorts of clever trickery can be done inside the shader. there are many use cases for this feature. some are performance-related, and some are new use cases that have not been possible before. getting the gpu virtual address (va) there are some hoops to jump through here. first, when allocating vkdevicememory, we must flag that the memory supports bda: vkmemoryallocateinfo info = {…}; vkmemoryallocateflagsinfo flags = {…}; flags.flags = vk_memory_allocate_device_address_bit_khr; vkallocatememory(device, &info, null, &memory); similarly, when creating a vkbuffer, we add the vk_buffer_usage_shader_device_address_bit_khr usage flag. once we have created a buffer, we can query the va: vkbufferdeviceaddressinfokhr info = {…}; info.buffer = buffer; vkdevicesize va = vkgetbufferdeviceaddresskhr(device, &info); from here, this 64-bit value can be placed in a buffer. you can of course offset this va. alignment is never an issue as shaders specify explicit alignment later. a note on debugging when using bda, there are some extra features that drivers must support. since a pointer does not necessarily exist when replaying an application capture in a debug tool, the driver must be able to guarantee that virtual addresses returned by the driver remain stable across runs. to that end, debug tools supply the expected va and the driver allocates that va range. applications do not care that much about this, but it is important to note that even if you can use bda, you might not be able to debug with it. typedef struct vkphysicaldevicebufferdeviceaddressfeatures { vkstructuretype stype; void* pnext; vkbool32 bufferdeviceaddress; vkbool32 bufferdeviceaddresscapturereplay; vkbool32 bufferdeviceaddressmultidevice; } vkphysicaldevicebufferdeviceaddressfeatures; if bufferdeviceaddresscapturereplay is supported, tools like renderdoc can support bda. using a pointer in a shader in vulkan glsl, there is the gl_ext_buffer_reference extension which allows us to declare a pointer type. a pointer like this can be placed in a buffer, or we can convert to and from integers: #version 450 #extension gl_ext_buffer_reference : require #extension gl_ext_buffer_reference_uvec2 : require layout(local_size_x = 64) in; // these define pointer types. layout(buffer_reference, std430, buffer_reference_align = 16) readonly buffer readvec4 { vec4 values[]; }; layout(buffer_reference, std430, buffer_reference_align = 16) writeonly buffer writevec4 { vec4 values[]; }; layout(buffer_reference, std430, buffer_reference_align = 4) readonly buffer unalignedvec4 { vec4 value; }; layout(push_constant, std430) uniform registers { readvec4 src; writevec4 dst; } registers; placing raw pointers in push constants avoids all indirection for getting to a buffer. if the driver allows it, the pointers can be placed directly in gpu registers before the shader begins executing. not all devices support 64-bit integers, but it is possible to cast uvec2 <-> pointer. doing address computation like this is fine. uvec2 uadd_64_32(uvec2 addr, uint offset) { uint carry; addr.x = uaddcarry(addr.x, offset, carry); addr.y += carry; return addr; } void main() { uint index = gl_globalinvocationid.x; registers.dst.values[index] = registers.src.values[index]; uvec2 addr = uvec2(registers.src); addr = uadd_64_32(addr, 20 * index); cast a uvec2 to address and load a vec4 from it. this address is aligned to 4 bytes. registers.dst.values[index + 1024] = unalignedvec4(addr).value; } pointer or offsets? using raw pointers is not always the best idea. a natural use case you could consider for pointers is that you have tree structures or list structures in gpu memory. with pointers, you can jump around as much as you want, and even write new pointers to buffers. however, a pointer is 64-bit and a typical performance consideration is to use 32-bit offsets (or even 16-bit offsets) if possible. using offsets is the way to go if you can guarantee that all buffers live inside a single vkbuffer. on the other hand, the pointer approach can access any vkbuffer at any time without having to use descriptors. therein lies the key strength of bda. extreme hackery: physical pointer as specialization constants this is a life saver in certain situations where you are desperate to debug something without any available descriptor set. a black magic hack is to place a bda inside a specialization constant. this allows for accessing a pointer without using any descriptors. do note that this breaks all forms of pipeline caching and is only suitable for debug code. do not ship this kind of code. perform this dark sorcery at your own risk: #version 450 #extension gl_ext_buffer_reference : require #extension gl_ext_buffer_reference_uvec2 : require layout(local_size_x = 64) in; layout(constant_id = 0) const uint debug_addr_lo = 0; layout(constant_id = 1) const uint debug_addr_hi = 0; layout(buffer_reference, std430, buffer_reference_align = 4) buffer debugcounter { uint value; }; void main() { debugcounter counter = debugcounter(uvec2(debug_addr_lo, debug_addr_hi)); atomicadd(counter.value, 1u); } emitting spir-v with buffer_device_address in spir-v, there are some things to note. bda is an especially useful feature for layering other apis due to its extreme flexibility in how we access memory. therefore, generating bda code yourself is a reasonable use case to assume as well. enables bda in shaders. _opcapability physicalstoragebufferaddresses opextension "spv_khr_physical_storage_buffer"_ the memory model is physicalstoragebuffer64 and not logical anymore. _opmemorymodel physicalstoragebuffer64 glsl450_ the buffer reference types are declared basically just like ssbos. _opdecorate %_runtimearr_v4float arraystride 16 opmemberdecorate %readvec4 0 nonwritable opmemberdecorate %readvec4 0 offset 0 opdecorate %readvec4 block opdecorate %_runtimearr_v4float_0 arraystride 16 opmemberdecorate %writevec4 0 nonreadable opmemberdecorate %writevec4 0 offset 0 opdecorate %writevec4 block opmemberdecorate %unalignedvec4 0 nonwritable opmemberdecorate %unalignedvec4 0 offset 0 opdecorate %unalignedvec4 block_ declare a pointer to the blocks. physicalstoragebuffer is the storage class to use. optypeforwardpointer %_ptr_physicalstoragebuffer_writevec4 physicalstoragebuffer %_ptr_physicalstoragebuffer_readvec4 = optypepointer physicalstoragebuffer %readvec4 %_ptr_physicalstoragebuffer_writevec4 = optypepointer physicalstoragebuffer %writevec4 %_ptr_physicalstoragebuffer_unalignedvec4 = optypepointer physicalstoragebuffer %unalignedvec4 load a physical pointer from pushconstant. _%55 = opaccesschain %_ptr_pushconstant__ptr_physicalstoragebuffer_writevec4 %registers %int_1 %56 = opload %_ptr_physicalstoragebuffer_writevec4 %55_ access chain into it. _%66 = opaccesschain %_ptr_physicalstoragebuffer_v4float %56 %int_0 %40_ aligned must be specified when dereferencing physical pointers. pointers can have any arbitrary address and must be explicitly aligned, so the compiler knows what to do. opstore %66 %65 aligned 16 for pointers, spir-v can bitcast between integers and pointers seamlessly, for example: %61 = opload %_ptr_physicalstoragebuffer_readvec4 %60 %70 = opbitcast %v2uint %61 // do math on %70 %86 = opbitcast %_ptr_physicalstoragebuffer_unalignedvec4 %some_address conclusion we have already explored two key vulkan extension game changers through this blog and the previous one. the third and final part of this game changer blog series will explore ‘timeline semaphores’ and how developers can use this new extension to improve the development experience and enhance their games. follow up thanks to hans-kristian arntzen and the team at arm for bringing this great content to the samsung developers community. we hope you find this information about vulkan extensions useful for developing your upcoming mobile games. the samsung developers site has many resources for developers looking to build for and integrate with samsung devices and services. stay in touch with the latest news by creating a free account or by subscribing to our monthly newsletter. visit the marketing resources page for information on promoting and distributing your apps and games. finally, our developer forum is an excellent way to stay up-to-date on all things related to the galaxy ecosystem.

Arm Developers

https://developer.samsung.com/galaxy-gamedev/blog/en-us/2021/07/06/new-game-changing-vulkan-extensions-for-mobile-buffer-device-address

tutorials game

blog

Create ASTC Textures Faster With the New astcenc 2.0 Open Source Compression Tool

adaptive scalable texture compression (astc) is an advanced lossy texture compression format, developed by arm and amd and released as royalty-free open standard by the khronos group. it supports a wide range of 2d and 3d color formats with a flexible choice of bitrates, enabling content creators to compress almost any texture asset, using a level of compression appropriate to their quality and performance requirements. astc is increasingly becoming the texture compression format of choice for mobile 3d applications using the opengl es and vulkan apis. astc’s high compression ratios are a perfect match for the mobile market that values smaller download sizes and optimized memory usage to improve energy efficiency and battery life. astc 2d color formats and bitrates astcenc 2.0 the ‘astcenc’ astc compression tool was first developed by arm while astc was progressing through the khronos standardization process seven years ago. astcenc has become widely used as the de facto reference encoder for astc, as it leverages all format features, including the full set of available block sizes and color profiles, to deliver high-quality encoded textures that are possible when effectively using astc’s flexible capabilities. today, arm is delighted to announce astcenc 2.0! this is a major update which provides multiple significant improvements for middleware and content creators. apache 2.0 open source license the original astcenc software was released under an arm end user license agreement. to make it easier for developers to use, adapt, and contribute to astcenc development, including integration of the compressor into application runtimes, arm relicensed the astcenc 1.x source code on github in january 2020 under the standard apache 2.0 open source license. the new astcenc 2.0 source code is now also available on github under apache 2.0. compression performance astcenc 1.x emphasized high image quality over fast compression speed. some developers have told arm they would love to use astcenc for its superior image quality, but compression was too slow to use in their tooling pipelines. the importance of this was reflected in the recent astc developer survey organized by khronos where developer responses rated compression speed above image quality in the list of factors that determine texture format choices. for version 2.0, arm reviewed the heuristics and quality refinement passes used by the astcenc compressor—optimizing those that were adding value and removing those that simply didn’t justify their added runtime cost. in addition, hand-coded vectorized code was added to the most compute intensive sections of the codec, supporting sse4.2 and avx2 simd instruction sets. overall, these optimizations have resulted in up to 3x faster compression times when using avx2, while typically losing less than 0.1 db psnr in image quality. a very worthwhile tradeoff for most developers. astcenc 2.0 - significantly faster astc encoding command line improvements the tool now supports a clearer set of compression modes that directly map to astc format profiles exposed by the khronos api support and api extensions. textures compressed using the ldr compression modes (linear or srgb) will be compatible with all hardware implementing opengl es 3.2, the opengl es khr_texture_compression_astc_ldr extension, or the vulkan astc optional feature. textures compressed using the hdr compression mode will require hardware implementing an appropriate api extension, such as khr_texture_compression_astc_hdr. in addition, astcenc 2.0 now supports commonly requested input and output file formats: loading ldr images in bmp, jpeg, png, and tga formats loading hdr images in openexr and radiance hdr formats loading compressed textures in the “.astc” file format provided by astcenc, and the dds and ktx container formats storing ldr images into bmp, png, and tga formats storing hdr images into openexr and radiance hdr formats storing compressed texturesinto the “.astc” file format provided by astcenc, and the dds or ktx container formats core codec library finally, the core codec is now separable from the command line front-end logic, enabling the astcenc compressor to be integrated directly into applications as a library. the core codec library interface api provides a programmatic mechanism to manage codec configuration, texture compression, and texture decompression. this api enables use of the core codec library to process data stored in memory buffers, leaving file management to the application. it supports parallel processing for compression of a single image with multiple threads or compressing multiple images in parallel. using astcenc 2.0 you can download astcenc 2.0 on github today, with full source code and pre-built binaries available for windows, macos, and linux hosts. for more information about using the tool, please refer to the project documentation: getting started: learn about the high-level operation of the compressor. format overview: learn about the astc data format and how the underlying encoding works. efficient encoding: learn about using the command line to effectively compress textures, and the encoding and sampling needed to get functional equivalents to other texture formats that exist on the market today. arm have also published an astc guide, which gives an overview of the format and some of the available tools, including astcenc . arm astc guide: an overview of astc and available astc tools. if you have any questions, feedback, or pull requests, please get in touch via the github issue tracker or the arm mali developer community forums: https://github.com/arm-software/astc-encoder https://community.arm.com/graphics/ khronos® and vulkan® are registered trademarks, and anari™, webgl™, gltf™, nnef™, openvx™, spir™, spir-v™, sycl™, openvg™ and 3d commerce™ are trademarks of the khronos group inc. openxr™ is a trademark owned by the khronos group inc. and is registered as a trademark in china, the european union, japan and the united kingdom. opencl™ is a trademark of apple inc. and opengl® is a registered trademark and the opengl es™ and opengl sc™ logos are trademarks of hewlett packard enterprise used under license by khronos. all other product names, trademarks, and/or company names are used solely for identification and belong to their respective owners.

Peter Harris

https://developer.samsung.com/galaxy-gamedev/blog/en-us/2020/09/09/create-astc-textures-faster-with-the-new-astcenc-20-open-source-compression-tool

Accelerate game performance based on SceneSDK

Using Conservative Morphological Anti-Aliasing to Improve Game Visuals

New Game Changing Vulkan Extensions for Mobile: Timeline Semaphores

New Vulkan Extensions for Mobile: Maintenance Extensions

New Vulkan Extensions for Mobile: Legacy Support Extensions

Using Variable Rate Shading to Improve Performance on Mobile Games

New Game Changing Vulkan Extensions for Mobile: Buffer Device Address

Create ASTC Textures Faster With the New astcenc 2.0 Open Source Compression Tool

Didn’t find what you were looking for?

FAQ

Join the Forum

Get Support

Accelerate game performance based on SceneSDK

Using Conservative Morphological Anti-Aliasing to Improve Game Visuals

New Game Changing Vulkan Extensions for Mobile: Timeline Semaphores

New Vulkan Extensions for Mobile: Maintenance Extensions

New Vulkan Extensions for Mobile: Legacy Support Extensions

Using Variable Rate Shading to Improve Performance on Mobile Games

New Game Changing Vulkan Extensions for Mobile: Buffer Device Address

Create ASTC Textures Faster With the New astcenc 2.0 Open Source Compression Tool

Didn’t find what you were looking for?

FAQ

Join the Forum

Get Support

Manage Your Cookies

Essential Cookies

Analytical/Performance Cookies

Functionality Cookies

Advertising Cookies

Preferences Submitted