Filter
-
Content Type
-
Category
Mobile/Wearable
Visual Display
Digital Appliance
Platform
Recommendations
Filter
tutorials game, mobile
blogin this article i would like to introduce a hardware optimisation technique called variable rate shading (vrs) and how this technique can benefit games on mobile phones. introduction traditionally, each shaded pixel in a rendered image is being shaded individually, meaning we can shade very high details anywhere in the image which, in theory, is great. however, in practice, this can lead to wasteful gpu calculations for areas where details are less important. in some cases, you do not need 1x1 shading of pixels to produce a high quality image. for example, for those areas that represent unlit surfaces caused by shadows naturally contain less details than brighter lit areas. moreover, areas which are out of focus due to camera post-effects and areas affected by motion blur naturally do not contain high details. in these cases we could benefit from letting multiple pixels be shaded by just a single calculation (like a 2x2 or 4x4 area of pixels) without losing any noticeable visual quality. the high resolution sky texture on the left looks very much like the lower resolution sky texture on the right. this is due to the smooth colour gradients and lack of high frequency colour variation. for those reasons, there is room for a lot of optimisation. you could argue that optimisation for handheld devices, like mobile phones, is more essential than on stationary devices, like games consoles, due to a couple of reasons. firstly, the hardware on handheld devices is often less powerful than conventional hardware due to smaller size and less electrical power supply. the compact size of the hardware for handheld devices are also the reason why they are more likely to suffer from temperature issues causing thermal throttling, where the performance slows down significantly. secondly, heavy graphics in games can quickly drain your phone's battery life. so, it is crucial to keep the gpu resources to a minimum when possible. variable rate shading is a way to help doing just that. how does variable rate shading work? in principle, variable rate shading is actually a very simple method which can be implemented without having to redesign an existing rendering pipeline. there are three ways to define areas to be optimised using variable rate shading: let an attachment in the form of an image serve as a mask. execute the optimisation on a per-triangle basis. let the vrs optimisation be based on a per-draw call. use an attachment as a mask you can provide the gpu with an image that serves as a mask. the mask contains information about what areas need to be rendered in a traditional manner by shading each pixel individually, and which areas need to be optimised by shading a group of pixels at once. the image below is visualising such a mask by colour-coding different areas: the blue area does not have any optimisation applied (1x1) as this area is where the player focuses on when driving. the green area is optimised by shading four pixels (2x2) by only one shading calculation, as this area contains less details due to motion blur. the red area can be optimised even more (4x4), as it is affected by a more aggressive motion blur. the yellow and purple areas are also shaded with less shading calculations. the areas defined in the image above could be static, at least while the player is driving the boat at top speed, as the boat is positioned at the centre of the image at all times. however, the level of optimisation could be reduced when another boat is passing by or when the boat slows down and therefore the motion blur is gradually reduced. there are times where a more dynamic approach is needed, as it sometimes can be difficult to know beforehand what areas should be optimised and what areas should be shaded in a traditional manner. in those cases, it could be beneficial to generate the mask more dynamically by rendering the geometry for the scene in an extra pass. simply colour the geometric elements in the scene and pass it to the gpu as a mask for the variable rate shading optimisation. if the scene is rendered by using deferred lighting, an extra pass may not be needed as the mask could be based on the default geometry pass required for deferred shading. optimisation based on primitives another way of using variable rate shading is taking advantage of other extensions as they allow you to define geometric elements to be optimised rather than using a mask. this can be done on a per-triangle basis or simply done by a per-draw call. defining geometric elements could be a more efficient approach as there is no need for generating a mask as well as needing less memory bandwidth. for the per-triangle basis extension, you are able to define the optimisation level in the vertex shader. for the per-draw call method, the optimisation level can be defined before the draw call takes place. keep in mind that the three methods can be combined if needed. the image below is a rendering pass where all objects in a scene are shaded in different colours to define what areas should be shaded in a traditional manner (meaning no optimisation) and what areas contain less details (therefore needing less gpu calculations). the areas defined above can be defined by all three methods. in general, by breaking a scene up in layers, where the elements nearest the camera have less optimisation and layers in the background have the most optimisation, would be an effective way to go about it. the image below shows the same scene, but this time we see the final output where vrs is on and off. as you may have noticed, it is very hard to tell any difference when the vrs optimisation is turned on or off. experiences with variable rate shading so far some commercial games have already successfully implemented variable rate shading. the image below is from wolfenstein young blood. as you may have noticed, there is barely any visual difference when vrs is on or off, but you are able to tell a difference in frame rate. in fact, the game performs, on average, 10% or higher when vrs is turned on. that may not sound like a lot, but considering that it is an easy optimisation to implement, there is barely any noticeable change in the visual quality. the 10% performance boost is on top of other optimisation techniques and it is actually not a bad performance boost after all. other games have shown an even a higher performance boost. for example, gears tactics has a performance boost up to 30% when using variable rate shading. the image below is from that game. virtual reality variable rate shading can benefit virtual reality as well. not only does virtual reality by nature require two rendered images (one image for each eye), but the player who wears the virtual mask naturally pays most attention to the central area of the rendered image. the areas of the rendered image that are seen from the corner of your eye naturally do not need the same amount of details as the central area of the rendered images. that means even though a static vrs mask can be used for a reasonable overall optimisation, using an eye tracker could result in an even more efficient optimisation and therefore less noticeable quality reduction. it is crucial to have a consistent high frame rate for virtual reality. if the frame rate is not relatively consistent or the rendering performance is suffering for a consistent low frame rate, it quickly gets uncomfortable to wear a vr headset and the player might even get dizzy and feel physically sick. by reducing the gpu calculations, using variable rate shading not only boosts the frame rate, it also uses less battery for mobile devices. this is a huge win for systems like samsung gear vr where a long battery life is much appreciated as the graphics are running on a galaxy mobile phone. the image below shows a variable rate shading mask generated by eye tracking technology for a virtual reality headset. the centre of the left and right images shade pixels in a traditional manner. the other colours represent different degrees of optimisation areas. which samsung devices support variable rate shading? all hardware listed here supports variable rate shading. mobile phones: samsung galaxy s22, s22+ and s22 ultra tablets: samsung tab s8, s8+ and s8 ultra the following graphics apis, vulkan and opengl es 2.0 (and higher), both support variable rate shading. the opengl extensions for the three ways of using variable rate shading are the following: gl_ext_fragment_shading_rate_attachment for allowing to send a mask to the gpu. gl_ext_fragment_shading_rate_primitive for per-triangle basis, where writing a value to gl_primitiveshadingrateext in the vertex shader defines the level of optimisation. gl_ext_fragment_shading_rate for per-draw call, where glshadingrateext should be called to define the optimisation level. the extension that enables variable rate shading for vulkan is vk_khr_fragment_shading_rate. conclusion in this article, we have established the following: variable rate shading is a hardware feature and is fairly easy to implement as it does not require any redesign of existing rendering pipelines. variable rate shading is an optimisation technique which reduces gpu calculations by allowing a group of pixels to be shaded by the same colour rather than each pixel individually. variable rate shading is particularly useful for mobile gaming as well as samsung gear vr, as it boosts performance and prolongs battery life. the level of optimisation can be defined by passing a mask to the gpu that contains areas of different optimisation levels. some implementations have proven to boost the framerate 10% or higher, while other implementations manage to increase the frame rate up to 30%. note: some images in this post are courtesy of ul solutions. additional resources on the samsung developers site the samsung developers site has many resources for developers looking to build for and integrate with samsung devices and services. stay in touch with the latest news by creating a free account and subscribing to our monthly newsletter. visit the marketing resources page for information on promoting and distributing your apps. finally, our developer forum is an excellent way to stay up-to-date on all things related to the galaxy ecosystem.
Søren Klit Lambæk
tutorials game, mobile
blogwith the increasing popularity of foldable phones such as the galaxy z fold3 and galaxy z flip3, apps on these devices are adopting its foldable features. in this blog, you can get started on how to utilize these foldable features on android game apps. we focus on creating a java file containing an implementation of the android jetpack windowmanager library that can be imported into game engines like unity or unreal engine. this creates an interface allowing developers to retrieve information about the folding feature on the device. at the end of this blog, you can go deeper in learning by going to code lab. android jetpack windowmanager android jetpack, in their own words, is "a suite of libraries to help developers follow best practices, reduce boilerplate code, and write code that works consistently across android versions and devices so that developers can focus on the code they care about." windowmanager is one of these libraries, and is intended to help application developers support new device form factors and multi-window environments. the library had its 1.0.0 release in january 2022 for targeted foldable devices. according to its documentation, future versions will be extended to more display types and window features. creating the android jetpack windowmanager setup as previously mentioned, we are creating a java file that can be imported into either unity or unreal engine 4, to create an interface for retrieving information on the folding feature and pass it over to the native or engine side of your applications. set up the foldablehelper class and data storage class create a file called foldablehelper.java in visual studio or any source code editor. let's start off by giving it a package name of package com.samsung.android.gamedev.foldable; next, let's import all the necessary libraries and classes in this file: //android imports import android.app.activity; import android.graphics.rect; import android.os.handler; import android.os.looper; import android.util.log; //android jetpack windowmanager imports import androidx.annotation.nonnull; import androidx.core.util.consumer; import androidx.window.java.layout.windowinfotrackercallbackadapter; import androidx.window.layout.displayfeature; import androidx.window.layout.foldingfeature; import androidx.window.layout.windowinfotracker; import androidx.window.layout.windowlayoutinfo; import androidx.window.layout.windowmetrics; import androidx.window.layout.windowmetricscalculator; //java imports import java.util.list; import java.util.concurrent.executor; start by creating a class, foldablehelper, that is going to contain all of our helper functions. let's then create variables to store a callback object as well as windowinfotrackercallbackadapter and windowmetricscalculator. let's also create a temporary declaration of the native function to pass the data from java to the native side of application once we start working in the game engines. public class foldablehelper { private static layoutstatechangecallback layoutstatechangecallback; private static windowinfotrackercallbackadapter wit; private static windowmetricscalculator wmc; public static native void onlayoutchanged(foldablelayoutinfo resultinfo); } let's create a storage class to hold the data received from the windowmanager library. an instance of this class will also be passed to the native code to transfer the data. public static class foldablelayoutinfo { public static int undefined = -1; // hinge orientation public static int hinge_orientation_horizontal = 0; public static int hinge_orientation_vertical = 1; // state public static int state_flat = 0; public static int state_half_opened = 1; // occlusion type public static int occlusion_type_none = 0; public static int occlusion_type_full = 1; rect currentmetrics = new rect(); rect maxmetrics = new rect(); int hingeorientation = undefined; int state = undefined; int occlusiontype = undefined; boolean isseparating = false; rect bounds = new rect(); } initialize the windowinfotracker since we are working in java and the windowmanager library is written in kotlin, we have to use the windowinfotrackercallbackadapter. this is an interface provided by android to enable the use of the windowinfotracker from java. the window info tracker is how we receive information about any foldable features inside the window's bounds. next is to create windowmetricscalculator, which lets us retrieve the window metrics of an activity. window metrics consists of the windows' current and maximum bounds. we also create a new layoutstatechangecallback object. this object is passed into the window info tracker as a listener object and is called every time the layout of the device changes (for our purposes this is when the foldable state changes). public static void init(activity activity) { //create window info tracker wit = new windowinfotrackercallbackadapter(windowinfotracker.companion.getorcreate(activity)); //create window metrics calculator wmc = windowmetricscalculator.companion.getorcreate(); //create callback object layoutstatechangecallback = new layoutstatechangecallback(activity); } set up and attach the callback listener in this step, let's attach the layoutstatechangecallback to the windowinfotrackercallbackadapter as a listener. the addwindowlayoutinfolistener function takes three parameters: the activity to attach the listener to, an executor, and a consumer of windowlayoutinfo. we will set up the executor and consumer in a moment. the adding of the listener is kept separate from the initialization, since the first windowlayoutinfo is not emitted until activity.onstart has been called. as such, we'll likely not be needing to attach the listener until during or after onstart, but we can still set up the windowinfotracker and windowmetricscalculator ahead of time. public static void start(activity activity) { wit.addwindowlayoutinfolistener(activity, runonuithreadexecutor(), layoutstatechangecallback); } now, let's create the executor for the listener. this executor is straightforward and simply runs the command on the mainlooper of our activity. it is possible to set this up to run on a custom thread, however this is not going to be covered in this blog. for more information, we recommend checking the official documentation for the jetpack windowmanager. static executor runonuithreadexecutor() { return new myexecutor(); } static class myexecutor implements executor { handler handler = new handler(looper.getmainlooper()); @override public void execute(runnable command) { handler.post(command); } } we're going to create the basic layout of our layoutstatechangecallback. this consumes windowlayoutinfo and implements consumer<windowlayoutinfo>. for now, let's simply lay out the class and give it some functionality a little bit later. static class layoutstatechangecallback implements consumer<windowlayoutinfo> { private final activity activity; public layoutstatechangecallback(activity activity) { this.activity = activity; } } if the use of the listener is no longer needed, we want a way to remove it and the windowinfotrackercallbackadapter contains a function to do just that. public static void stop() { wit.removewindowlayoutinfolistener(layoutstatechangecallback); } this just tidies things up for us and ensures that the listener is cleaned up when we no longer need it. next, we're going to add some functionality to the layoutstatechangecallback class. we are going to process windowlayoutinfo into foldablelayoutinfo we created previously. using java native interface (jni), we are going to send that information over to the native side using the function onlayoutchanged. note: this doesn't actually do anything yet, but we cover how to set this up in unreal engine and in unity through code lab tutorials. static class layoutstatechangecallback implements consumer<windowlayoutinfo> { @override public void accept(windowlayoutinfo windowlayoutinfo) { foldablelayoutinfo resultinfo = updatelayout(windowlayoutinfo, activity); onlayoutchanged(resultinfo); } } let's implement the updatelayout function to process windowlayoutinfo and return a foldablelayoutinfo. firstly, create a foldablelayoutinfo that contains the processed information. follow this up by getting the window metrics, both maximum metrics and current metrics. private static foldablelayoutinfo updatelayout(windowlayoutinfo windowlayoutinfo, activity activity) { foldablelayoutinfo retlayoutinfo = new foldablelayoutinfo(); windowmetrics wm = wmc.computecurrentwindowmetrics(activity); retlayoutinfo.currentmetrics = wm.getbounds(); wm = wmc.computemaximumwindowmetrics(activity); retlayoutinfo.maxmetrics = wm.getbounds(); } get the displayfeatures present in the current window bounds using windowlayoutinfo.getdisplayfeatures. currently, the api only has one type of displayfeature: foldingfeatures, however in the future there will likely be more as screen types evolve. at this point, let's use a for loop to iterate through the resulting list until it finds a foldingfeature. once it detects a folding feature, it starts processing its data: orientation, state, seperation type, and its bounds. then, store these data in foldablelayoutinfo we've created at the start of the function call. you can learn more about these data by going to the jetpack windowmanager documentation. private static foldablelayoutinfo updatelayout(windowlayoutinfo windowlayoutinfo, activity activity) { foldablelayoutinfo retlayoutinfo = new foldablelayoutinfo(); windowmetrics wm = wmc.computecurrentwindowmetrics(activity); retlayoutinfo.currentmetrics = wm.getbounds(); wm = wmc.computemaximumwindowmetrics(activity); retlayoutinfo.maxmetrics = wm.getbounds(); list<displayfeature> displayfeatures = windowlayoutinfo.getdisplayfeatures(); if (!displayfeatures.isempty()) { for (displayfeature displayfeature : displayfeatures) { foldingfeature foldingfeature = (foldingfeature) displayfeature; if (foldingfeature != null) { if (foldingfeature.getorientation() == foldingfeature.orientation.horizontal) { retlayoutinfo.hingeorientation = foldablelayoutinfo.hinge_orientation_horizontal; } else { retlayoutinfo.hingeorientation = foldablelayoutinfo.hinge_orientation_vertical; } if (foldingfeature.getstate() == foldingfeature.state.flat) { retlayoutinfo.state = foldablelayoutinfo.state_flat; } else { retlayoutinfo.state = foldablelayoutinfo.state_half_opened; } if (foldingfeature.getocclusiontype() == foldingfeature.occlusiontype.none) { retlayoutinfo.occlusiontype = foldablelayoutinfo.occlusion_type_none; } else { retlayoutinfo.occlusiontype = foldablelayoutinfo.occlusion_type_full; } retlayoutinfo.isseparating = foldingfeature.isseparating(); retlayoutinfo.bounds = foldingfeature.getbounds(); return retlayoutinfo; } } } return retlayoutinfo; } if there's no folding feature detected, it simply returns the foldablelayoutinfo without setting its data leaving it with undefined (-1) values. conclusion the java file you have now created should be usable in new or existing unity and unreal engine projects, to provide access to the information on the folding feature. continue learning about it by going to the code lab tutorials showing how to use the file created here, to implement flex mode detection and usage in game applications. additional resources on the samsung developers site the samsung developers site has many resources for developers looking to build for and integrate with samsung devices and services. stay in touch with the latest news by creating a free account and subscribing to our monthly newsletter. visit the marketing resources page for information on promoting and distributing your apps. finally, our developer forum is an excellent way to stay up-to-date on all things related to the galaxy ecosystem.
Lochlann Henry Ramsay-Edwards
tutorials game, mobile
blogthe samsung developers team works with many companies in the mobile and gaming ecosystems. we're excited to support our partner, arm, as they bring timely and relevant content to developers looking to build games and high-performance experiences. this vulkan extensions series will help developers get the most out of the new and game-changing vulkan extensions on samsung mobile devices. android is enabling a host of useful new vulkan extensions for mobile. these new extensions are set to improve the state of graphics apis for modern applications, enabling new use cases and changing how developers can design graphics renderers going forward. i have already provided information about ‘maintenance extensions’. however, another important extension that i explore in this blog is ‘legacy support extensions’. vulkan is increasingly being used as a portable “hal”. the power and flexibility of the api allows for great layered implementations. there is a lot of effort spent in the ecosystem enabling legacy graphics apis to run efficiently on top of vulkan. the bright future for driver developers is a world where gpu drivers only implement vulkan, and where legacy apis can be implemented on top of that driver. to that end, there are several features which are generally considered backwards today. they should not be used in new applications unless absolutely required. these extensions exist to facilitate old applications which need to keep running through api translation layers such as angle, dxvk, zink, and so on. vk_ext_transform_feedback speaking the name of this extension causes the general angst level to rise in a room of driver developers. in the world of direct3d, this feature is also known as stream-out. the core feature of this extension is that whenever you render geometry, you can capture the resulting geometry data (position and vertex outputs) into a buffer. the key complication from an implementation point of view is that the result is ordered. this means there is no 1:1 relation for input vertex to output data since this extension is supposed to work with indexed rendering, as well as strip types (and even geometry shaders and tessellation, oh my!). this feature was invented in a world before compute shaders were conceived. the only real method to perform buffer <-> buffer computation was to make use of transform feedback, vertex shaders and rasterizationdiscard. over time, the functionality of transform feedback was extended in various ways, but today it is essentially obsoleted by compute shaders. there are, however, two niches where this extension still makes sense - graphics debuggers and api translation layers. transform feedback is extremely difficult to emulate in the more complicated cases. setting up shaders in vertex-like shader stages, you need to set up which vertex outputs to capture to a buffer. the shader itself controls the memory layout of the output data. this is unlike other apis, where you use the graphics api to specify which outputs to capture based on the name of the variable. here is an example vulkan glsl shader: #version 450 layout(xfb_stride = 32, xfb_offset = 0, xfb_buffer = 0, location = 0) out vec4 vcolor; layout(xfb_stride = 32, xfb_offset = 16, xfb_buffer = 0, location = 1) out vec4 vcolor2; layout(xfb_buffer = 1, xfb_stride = 16) out gl_pervertex { layout(xfb_offset = 0) vec4 gl_position; }; void main() { gl_position = vec4(1.0); vcolor = vec4(2.0); vcolor2 = vec4(3.0); } the resulting spir-v will then look something like: capability transformfeedback executionmode 4 xfb decorate 8(gl_pervertex) block decorate 10 xfbbuffer 1 decorate 10 xfbstride 16 decorate 17(vcolor) location 0 decorate 17(vcolor) xfbbuffer 0 decorate 17(vcolor) xfbstride 32 decorate 17(vcolor) offset 0 decorate 20(vcolor2) location 1 decorate 20(vcolor2) xfbbuffer 0 decorate 20(vcolor2) xfbstride 32 decorate 20(vcolor2) offset 16 binding transform feedback buffers once we have a pipeline which can emit transform feedback data, we need to bind buffers: vkcmdbindtransformfeedbackbuffersext(cmd, firstbinding, bindingcount, pbuffers, poffsets, psizes); to enable a buffer to be captured, vk_buffer_usage_transform_feedback_buffer_bit_ext is used. starting and stopping capture once we know where to write the vertex output data, we will begin and end captures. this needs to be done inside a render pass: vkcmdbegintransformfeedbackext(cmd, firstcounterbuffer, counterbuffercount, pcounterbuffers, pcounterbufferoffsets); a counter buffer allows us to handle scenarios where we end a transform feedback and continue capturing later. we would not necessarily know how many bytes were written by the last transform feedback, so it is critical that we can let the gpu maintain a byte counter for us. vkcmddraw(cmd, …); vkcmddrawindexed(cmd, …); then we can start rendering. vertex outputs are captured to the buffers in-order. vkcmdendtransformfeedbackext(cmd, firstcounterbuffer, counterbuffercount, pcounterbuffers, pcounterbufferoffsets); once we are done capturing, we end the transform feedback and, with the counter buffers, we can write the new buffer offsets into the counter buffer. indirectly drawing transform feedback results this feature is a precursor to the more flexible indirect draw feature we have in vulkan, but there was a time when this feature was the only efficient way to render transform feedbacked outputs. the fundamental problem is that we do not necessarily know exactly how many primitives have been rendered. therefore, to avoid stalling the cpu, it was required to be able to indirectly render the results with a special purpose api. vkcmddrawindirectbytecountext(cmd, instancecount, firstinstance, counterbuffer, counterbufferoffset, counteroffset, vertexstride); this works similarly to a normal indirect draw call, but instead of providing a vertex count, we give it a byte count and let the gpu perform the divide instead. this is nice, as otherwise we would have to dispatch a tiny compute kernel that converts a byte count to an indirect draw. queries the offset counter is sort of like a query, but if the transform feedback buffers overflow, any further writes are ignored. the vk_query_type_transform_feedback_stream_ext queries how many primitives were generated. it also lets you query how many primitives were attempted to be written. this makes it possible to detect overflow if that is desirable. vk_ext_line_rasterization line rasterization is a tricky subject and is not normally used for gaming applications since they do not scale with resolution and their exact behavior is not consistent across all gpu implementations. in the world of cad, however, this feature is critical, and older opengl apis had extensive support for quite fancy line rendering methods. this extension essentially brings back those workstation features. advanced line rendering can occasionally be useful for debug tooling and visualization as well. the feature zoo typedef struct vkphysicaldevicelinerasterizationfeaturesext { vkstructuretype stype; void* pnext; vkbool32 rectangularlines; vkbool32 bresenhamlines; vkbool32 smoothlines; vkbool32 stippledrectangularlines; vkbool32 stippledbresenhamlines; vkbool32 stippledsmoothlines; } vkphysicaldevicelinerasterizationfeaturesext; this extension supports a lot of different feature bits. i will try to summarize what they mean below. rectangular lines vs parallelogram when rendering normal lines in core vulkan, there are two ways lines can be rendered. if vkphysicaldevicelimits::strictlines is true, a line is rendered as if the line is a true, oriented rectangle. this is essentially what you would get if you rendered a scaled and rotated rectangle yourself. the hardware just expands the line along the perpendicular axis of the line axis. in non-strict rendering, we get a parallelogram. the line is extended either in x or y directions. (from vulkan specification) bresenham lines bresenham lines reformulate the line rendering algorithm where each pixel has a diamond shaped area around the pixel and coverage is based around intersection and exiting the area. the advantage here is that rendering line strips avoids overdraw. rectangle or parallelogram rendering does not guarantee this, which matters if you are rendering line strips with blending enabled. (from vulkan specification) smooth lines smooth lines work like rectangular lines, except the implementation can render a little further out to create a smooth edge. exact behavior is also completely unspecified, and we find the only instance of the word “aesthetic” in the entire specification, which is amusing. this is a wonderfully vague word to see in the vulkan specification, which is otherwise no-nonsense normative. this feature is designed to work in combination with alpha blending since the smooth coverage of the line rendering is multiplied into the alpha channel of render target 0’s output. line stipple a “classic” feature that will make most ihvs cringe a little. when rendering a line, it is possible to mask certain pixels in a pattern. a counter runs while rasterizing pixels in order and with line stipple you control a divider and mask which generates a fixed pattern for when to discard pixels. it is somewhat unclear if this feature is really needed when it is possible to use discard in the fragment shader, but alas, legacy features from the early 90s are sometimes used. there were no shaders back in those days. configuring rasterization pipeline state when creating a graphics pipeline, you can pass in some more data in pnext of rasterization state: typedef struct vkpipelinerasterizationlinestatecreateinfoext { vkstructuretype stype; const void* pnext; vklinerasterizationmodeext linerasterizationmode; vkbool32 stippledlineenable; uint32_t linestipplefactor; uint16_t linestipplepattern; } vkpipelinerasterizationlinestatecreateinfoext; typedef enum vklinerasterizationmodeext { vk_line_rasterization_mode_default_ext = 0, vk_line_rasterization_mode_rectangular_ext = 1, vk_line_rasterization_mode_bresenham_ext = 2, vk_line_rasterization_mode_rectangular_smooth_ext = 3, } vklinerasterizationmodeext; if line stipple is enabled, the line stipple factors can be baked into the pipeline, or be made a dynamic pipeline state using vk_dynamic_state_line_stipple_ext. in the case of dynamic line stipple, the line stipple factor and pattern can be modified dynamically with: vkcmdsetlinestippleext(cmd, factor, pattern); vk_ext_index_type_uint8 in opengl and opengl es, we have support for 8-bit index buffers. core vulkan and direct3d however only support 16-bit and 32-bit index buffers. since emulating index buffer formats is impractical with indirect draw calls being a thing, we need to be able to bind 8-bit index buffers. this extension does just that. this is probably the simplest extension we have look at so far: vkcmdbindindexbuffer(cmd, indexbuffer, offset, vk_index_type_uint8_ext); vkcmddrawindexed(cmd, …); conclusion i have been through the 'maintenance' and 'legacy support' extensions that are part of the new vulkan extensions for mobile. in the next three blogs, i will go through what i see as the 'game-changing' extensions from vulkan - the three that will help to transform your games during the development process. follow up thanks to hans-kristian arntzen and the team at arm for bringing this great content to the samsung developers community. we hope you find this information about vulkan extensions useful for developing your upcoming mobile games. the original version of this article can be viewed at arm community. the samsung developers site has many resources for developers looking to build for and integrate with samsung devices and services. stay in touch with the latest news by creating a free account or by subscribing to our monthly newsletter. visit the marketing resources page for information on promoting and distributing your apps and games. finally, our developer forum is an excellent way to stay up-to-date on all things related to the galaxy ecosystem.
Arm Developers
tutorials game, mobile
bloganti-aliasing is an important addition to any game to improve visual quality by smoothing out the jagged edges of a scene. msaa (multisample anti-aliasing) is one of the oldest methods to achieve this and is still the preferred solution for mobile. however it is only suitable for forward rendering and, with mobile performance improving year over year, deferred rendering is becoming more common, necessitating the use of post-process aa. this leaves slim pickings as such algorithms tend to be too expensive for mobile gpus with fxaa (fast approximate anti-aliasing) being the only ‘cheap’ option among them. fxaa may be performant enough but it only has simple colour discontinuity shape detection, leading to an often unwanted softening of the image. its kernel is also limited in size, so it struggles to anti-alias longer edges effectively. space module scene with cmaa applied. conservative morphological anti-aliasing conservative morphological anti-aliasing (cmaa) is a post-process aa solution originally developed by intel for their low power integrated gpus 1. its design goals are to be a better alternative to fxaa by: being minimally invasive so it can be acceptable as a replacement in a wide range of applications, including worst case scenarios such as text, repeating patterns, certain geometries (power lines, mesh fences, foliage), and moving images. running efficiently on low-medium range gpu hardware, such as integrated gpus (or, in our case, mobile gpus). we have repurposed this desktop-developed algorithm and come up with a hybrid between the original 1.3 version and the updated 2.0 version 2 to make the best use of mobile hardware. a demo app was created using khronos’ vulkan samples as a framework (which could also be done with gles) to implement this experiment. the sample has a drop down menu for easy switching between the different aa solutions and presents a frametime and bandwidth overlay. cmaa has four basic logical steps: image analysis for colour discontinuities (afterwards stored in a local compressed 'edge' buffer). the method used is not unique to cmaa. extracting locally dominant edges with a small kernel. (unique variation of existing algorithms.) handling of simple shapes. handling of symmetrical long edge shapes. (unique take on the original mlaa shape handling algorithm.) pass 1 edge detection result captured in renderdoc. a full screen edge detection pass is done in a fragment shader and the resulting colour discontinuity values are written into a colour attachment. our implementation uses the pixels’ luminance values to find edge discontinuities for speed and simplicity. an edge exists if the contrast between neighbouring pixels is above an empirically determined threshold. pass 2 neighbouring edges considered for local contrast adaptation. a local contrast adaptation is performed for each detected edge by comparing the value of the previous pass against the values of its closest neighbours by creating a threshold from the average and largest of these, as described by the formula below. any that pass the threshold are written into an image as a confirmed edge. threshold = (avg+avgxy) * (1.0 - nondominantedgeremovalamount) + maxe * (nondominantedgeremovalamount); nondominantedgeremovalamount is another empirically determined variable. pass 3 this pass collects all the edges for each pixel from the previous pass and packs them into a new image for the final pass. this pass also does the first part of edge blending. the detected edges are used to look for 2, 3 and 4 edges in a pixel and then blend in the colours from the adjacent pixels. this helps avoid the unnecessary blending of straight edges. pass 4 the final pass does long edge blending by identifying z-shapes in the detected edges. for each detected z-shape, the length of the edge is traced in both directions until it reaches the end or until it runs into a perpendicular edge. pixel blending is then performed along the traced edges proportional to their distance from the centre. before and after of z-shape detection. results image comparison shows a typical scenario for aa. cmaa manages high quality anti-aliasing while retaining sharpness on straight edges. cmaa demonstrates itself as a superior solution to aliasing than fxaa by avoiding the latter’s limitations. it maintains a crisper look to the overall image and won’t smudge thin lines, all while still providing effective anti-aliasing to curved edges. it also provides a significant performance advantage to qualcomm devices and only a small penalty to arm. image comparison shows a weakness of fxaa where it smudges thin lined geometry into the background. cmaa shows no such weakness and retains the colour of the railing while adding anti-aliasing effectively. msaa is still a clear winner and our recommended solution if your game allows for it to be resolved within a single render pass. for any case where that is impractical, cmaa is overall a better alternative than fxaa and should be strongly considered. graph shows the increase in frametime for each aa method across a range of samsung devices. follow up this site has many resources for developers looking to build for and integrate with samsung devices and services. stay in touch with the latest news by creating a free account or by subscribing to our monthly newsletter. visit the marketing resources page for information on promoting and distributing your apps. finally, our developer forum is an excellent way to stay up-to-date on all things related to the galaxy ecosystem. references filip strugar and leigh davies: conservative morphological anti-aliasing (cmaa) – march 2014. filip strugar and adam t lake: conservative morphological anti-aliasing 2.0 – april 2018.
Samsung GameDev Team
success story game, mobile
blogintroduction in recent years, the mobile game market has been growing fast. along with hardware upgrades, the implementation of mobile games is more complicated, and the loading process and the display of some scenes consume a lot of cpu and gpu resources. so samsung worked with game vendors, including the tencent game performance amelioration (mtgpa) team, to improve game user experience based on scenesdk. scenesdk focuses on performance optimization, with the game vendor's cooperation, by combining the abilities of device manufacturers to control hardware resources and the abilities of games to sync up scenario information. it can maximize the game experience. currently it includes many items such as scene guarantee and frequency reduction notification. it can get game information, send a mobile device's status to a game and supports 40+ games, including many popular games. optimization solutions scene protection scene protection divides a game into different game scenes according to certain rules (such as coarse-grained loading, lobby, single game round, ultimate kill, aiming and shooting, and much more), and then provide finer-grained performance guarantees for different game scenes. considering that hardware resources are limited, if the protection of all the scenes is the same, the actual protection effect is not ideal because the hardware is fully loaded at the beginning. the system high temperature protection is triggered quickly, the cpu and gpu are forced to reduce frequency, and the rendering performance will be even worse. therefore, the game could send game events, like loading, starting, lobby or scene loading, to scenesdk on the mobile device's side. the game information is sent in the json format {sceneid: value}. it’s flexible and can be extended with more sceneids if needed. the main value information is shown in the table below. after getting the scene info, the scenesdk service changes cpu/gpu frequency to improve game performance based on different game scenarios. during gameplay, it is necessary to classify the scene according to the importance level. the strategy of hierarchical protection is slightly different according to the underlying adjustment capabilities of each manufacturer, but the core is the highest-level scene, which is fully protected. for the protection of different levels of scenes, the effect can be shown as below. for the highest priority (critical) scene, the fps (frames per second) is more stable. for the lowest level scene, fps slowly declines without affecting the experience. at the same time, the protection of the scene switching is also effective in real time. as shown in the following figure, when the scene level is switched from low (1) to critical (3), the cpu frequency starts to increase, and the fps also gradually starts to increase. the comparison is as follows: frequency reduction notification the frequency reduction notification is to inform the game of the system cpu frequency reduction. the extent of cpu frequency reduction varies from manufacturer to manufacturer. such adjustments inevitably result in equipment that just meets the performance needs or already fails to meet the performance needs of the game so that it starts to freeze or becomes stuck. therefore, if the game can adjust the configuration items related to performance consumption instantly, by temporarily reducing or disabling some functions and using such notifications, a stuck game can be avoided to ensure the best player experience. variable refresh rate a key part of the technology used in the latest samsung flagship model is the ability to not only run conventional mobile refresh rate limits, but also to dynamically modify them based on game requirements at runtime. a common misconception is that if a device has the ability to run at 120 fps, it should always do so. however, most games support multiple options that are lower than 120 fps or don’t support 120 fps. if a game does not support 120 fps, a 120 fps refresh rate is not required and may cost more power consumption. the ideal approach is to make use of the maximum refresh rate only when it has the greatest benefit. so when the game sends the json string {“target fps”: value} to the mobile device, the device changes the refresh rate to save power by not running in the highest refresh rate. summary this table shows the benefit of using scenesdk at launch time. this table shows better performance during two rounds of testing with scenesdk on and scenesdk off. this table shows lower power consumption when dynamic refresh rate is applied. overall, scenesdk could be a good option to improve your game performance. please feel free to contact us if you need more information.
Xiangguo Qi
tutorials game, mobile
blogthe samsung developers team works with many companies in the mobile and gaming ecosystems. we're excited to support our partner, arm, as they bring timely and relevant content to developers looking to build games and high-performance experiences. this vulkan extensions series will help developers get the most out of the new and game-changing vulkan extensions on samsung mobile devices. in previous blogs, we have already explored two key vulkan extension game changers that will be enabled by android r. these are descriptor indexing and buffer device address. in this blog, we explore the third and final game changer, which is 'timeline semaphores'. the introduction of timeline semaphores is a large improvement to the synchronization model of vulkan and is a required feature in vulkan 1.2. it solves some fundamental grievances with the existing synchronization apis in vulkan. the problems with vkfence and vksemaphore in earlier vulkan extensions, there are two distinct synchronization objects for dealing with cpu <-> gpu synchronization and gpu queue <-> gpu queue synchronization. the vkfence object only deals with gpu -> cpu synchronization. due to the explicit nature of vulkan, you must keep track of when the gpu completes the work you submit to it. vkqueuesubmit(queue, …, fence); the previous code is the way we would use a fence, and later this fence can be waited on. when the fence signals, we know it is safe to free resources, read back data written by gpu, and so on. overall, the vkfence interface was never a real problem in practice, except that it feels strange to have two entirely different api objects which essentially do the same thing. vksemaphore on the other hand has some quirks which makes it difficult to use properly in sophisticated applications. vksemaphore by default is a binary semaphore. the fundamental problem with binary semaphores is that we can only wait for a semaphore once. after we have waited for it, it automatically becomes unsignaled again. this binary nature is very annoying to deal with when we use multiple queues. for example, consider a scenario where we perform some work in the graphics queue, and want to synchronize that work with two different compute queues. if we know this scenario is coming up, we will then have to allocate two vksemaphore objects, signal both objects, and wait for each of them in the different compute queues. this works, but we might not have the knowledge up front that this scenario will play out. often where we are dealing with multiple queues, we have to be somewhat conservative and signal semaphore objects we never end up waiting for. this leads to another problem … a signaled semaphore, which is never waited for, is basically a dead and useless semaphore and should be destroyed. we cannot reset a vksemaphore object on the cpu, so we cannot ever signal it again if we want to recycle vksemaphore objects. a workaround would be to wait for the semaphore on the gpu in a random queue just to unsignal it, but this feels like a gross hack. it could also potentially cause performance issues, as waiting for a semaphore is a full gpu memory barrier. object bloat is another considerable pitfall of the existing apis. for every synchronization point we need, we require a new object. all these objects must be managed, and their lifetimes must be considered. this creates a lot of annoying “bloat” for engines. the timeline – fixing object bloat – fixing multiple waits the first observation we can make of a vulkan queue is that submissions should generally complete in-order. to signal a synchronization object in vkqueuesubmit, the gpu waits for all previously submitted work to the queue, which includes the signaling operation of previous synchronization objects. rather than assigning one object per submission, we synchronize in terms of number of submissions. a plain uint64_t counter can be used for each queue. when a submission completes, the number is monotonically increased, usually by one each time. this counter is contained inside a single timeline semaphore object. rather than waiting for a specific synchronization object which matches a particular submission, we could wait for a single object and specify “wait until graphics queue submission #157 completes.” we can wait for any value multiple times as we wish, so there is no binary semaphore problem. essentially, for each vkqueue we can create a single timeline semaphore on startup and leave it alone (uint64_t will not overflow until the heat death of the sun, do not worry about it). this is extremely convenient and makes it so much easier to implement complicated dependency management schemes. unifying vkfence and vksemaphore timeline semaphores can be used very effectively on cpu as well: vksemaphorewaitinfokhr info = { vk_structure_type_semaphore_wait_info_khr }; info.semaphorecount = 1; info.psemaphores = &semaphore; info.pvalues = &value; vkwaitsemaphoreskhr(device, &info, timeout); this completely removes the need to use vkfence. another advantage of this method is that multiple threads can wait for a timeline semaphore. with vkfence, only one thread could access a vkfence at any one time. a timeline semaphore can even be signaled from the cpu as well, although this feature feels somewhat niche. it allows use cases where you submit work to the gpu early, but then 'kick' the submission using vksignalsemaphorekhr. the accompanying sample demonstrates a particular scenario where this function might be useful: vksemaphoresignalinfokhr info = { vk_structure_type_semaphore_signal_info_khr }; info.semaphore = semaphore; info.value = value; vksignalsemaphorekhr(device, &info); creating a timeline semaphore when creating a semaphore, you can specify the type of semaphore and give it an initial value: vksemaphorecreateinfo info = { vk_structure_type_semaphore_create_info }; vksemaphoretypecreateinfokhr type_info = { vk_structure_type_semaphore_type_create_info_khr }; type_info.semaphoretype = vk_semaphore_type_timeline_khr; type_info.initialvalue = 0; info.pnext = &type_info; vkcreatesemaphore(device, &info, null, &semaphore); signaling and waiting on timeline semaphores when submitting work with vkqueuesubmit, you can chain another struct which provides counter values when using timeline semaphores, for example: vksubmitinfo submit = { vk_structure_type_submit_info }; submit.waitsemaphorecount = 1; submit.pwaitsemaphores = &compute_queue_semaphore; submit.pwaitdststagemask = &wait_stage; submit.commandbuffercount = 1; submit.pcommandbuffers = &cmd; submit.signalsemaphorecount = 1; submit.psignalsemaphores = &graphics_queue_semaphore; vktimelinesemaphoresubmitinfokhr timeline = { vk_structure_type_timeline_semaphore_submit_info_khr }; timeline.waitsemaphorevaluecount = 1; timeline.pwaitsemaphorevalues = &wait_value; timeline.signalsemaphorevaluecount = 1; timeline.psignalsemaphorevalues = &signal_value; submit.pnext = &timeline; signal_value++; // generally, you bump the timeline value once per submission. vkqueuesubmit(queue, 1, &submit, vk_null_handle); out of order signal and wait a strong requirement of vulkan binary semaphores is that signals must be submitted before a wait on a semaphore can be submitted. this makes it easy to guarantee that deadlocks do not occur on the gpu, but it is also somewhat inflexible. in an application with many vulkan queues and a task-based architecture, it is reasonable to submit work that is somewhat out of order. however, this still uses synchronization objects to ensure the right ordering when executing on the gpu. with timeline semaphores, the application can agree on the timeline values to use ahead of time, then go ahead and build commands and submit out of order. the driver is responsible for figuring out the submission order required to make it work. however, the application gets more ways to shoot itself in the foot with this approach. this is because it is possible to create a deadlock with multiple queues where queue a waits for queue b, and queue b waits for queue a at the same time. ease of porting it is no secret that timeline semaphores are inherited largely from d3d12’s fence objects. from a portability angle, timeline semaphores make it much easier to have compatibility across the apis. caveats as the specification stands right now, you cannot use timeline semaphores with swap chains. this is generally not a big problem as synchronization with the swap chain tends to be explicit operations renderers need to take care of. another potential caveat to consider is that the timeline semaphore might not have a direct kernel equivalent on current platforms, which means some extra emulation to handle it, especially the out-of-order submission feature. as the timeline synchronization model becomes the de-facto standard, i expect platforms to get more native support for it. conclusion all three key vulkan extension game changers improve the overall development and gaming experience through improving graphics and enabling new gaming use cases. we hope that we gave you enough samples to get you started as you try out these new vulkan extensions to help bring your games to life follow up thanks to hans-kristian arntzen and the team at arm for bringing this great content to the samsung developers community. we hope you find this information about vulkan extensions useful for developing your upcoming mobile games. the samsung developers site has many resources for developers looking to build for and integrate with samsung devices and services. stay in touch with the latest news by creating a free account or by subscribing to our monthly newsletter. visit the marketing resources page for information on promoting and distributing your apps and games. finally, our developer forum is an excellent way to stay up-to-date on all things related to the galaxy ecosystem.
Arm Developers
tutorials game, mobile
blogthe samsung developers team works with many companies in the mobile and gaming ecosystems. we're excited to support our partner, arm, as they bring timely and relevant content to developers looking to build games and high-performance experiences. this vulkan extensions series will help developers get the most out of the new and game-changing vulkan extensions on samsung mobile devices. android r is enabling a host of useful vulkan extensions for mobile, with three being key 'game changers'. these are set to improve the state of graphics apis for modern applications, enabling new use cases and changing how developers can design graphics renderers going forward. you can expect to see these features across a variety of android smartphones, such as the new samsung galaxy s21, and existing samsung galaxy s models like the samsung galaxy s20. the first blog explored the first game changer extension for vulkan – ‘descriptor indexing'. this blog explores the second game changer extension – ‘buffer device address.’ vk_khr_buffer_device_address vk_khr_buffer_device_address is a monumental extension that adds a unique feature to vulkan that none of the competing graphics apis support. pointer support is something that has always been limited in graphics apis, for good reason. pointers complicate a lot of things, especially for shader compilers. it is also near impossible to deal with plain pointers in legacy graphics apis, which rely on implicit synchronization. there are two key aspects to buffer_device_address (bda). first, it is possible to query a gpu virtual address from a vkbuffer. this is a plain uint64_t. this address can be written anywhere you like, in uniform buffers, push constants, or storage buffers, to name a few. the key aspect which makes this extension unique is that a spir-v shader can load an address from a buffer and treat it as a pointer to storage buffer memory immediately. pointer casting, pointer arithmetic and all sorts of clever trickery can be done inside the shader. there are many use cases for this feature. some are performance-related, and some are new use cases that have not been possible before. getting the gpu virtual address (va) there are some hoops to jump through here. first, when allocating vkdevicememory, we must flag that the memory supports bda: vkmemoryallocateinfo info = {…}; vkmemoryallocateflagsinfo flags = {…}; flags.flags = vk_memory_allocate_device_address_bit_khr; vkallocatememory(device, &info, null, &memory); similarly, when creating a vkbuffer, we add the vk_buffer_usage_shader_device_address_bit_khr usage flag. once we have created a buffer, we can query the va: vkbufferdeviceaddressinfokhr info = {…}; info.buffer = buffer; vkdevicesize va = vkgetbufferdeviceaddresskhr(device, &info); from here, this 64-bit value can be placed in a buffer. you can of course offset this va. alignment is never an issue as shaders specify explicit alignment later. a note on debugging when using bda, there are some extra features that drivers must support. since a pointer does not necessarily exist when replaying an application capture in a debug tool, the driver must be able to guarantee that virtual addresses returned by the driver remain stable across runs. to that end, debug tools supply the expected va and the driver allocates that va range. applications do not care that much about this, but it is important to note that even if you can use bda, you might not be able to debug with it. typedef struct vkphysicaldevicebufferdeviceaddressfeatures { vkstructuretype stype; void* pnext; vkbool32 bufferdeviceaddress; vkbool32 bufferdeviceaddresscapturereplay; vkbool32 bufferdeviceaddressmultidevice; } vkphysicaldevicebufferdeviceaddressfeatures; if bufferdeviceaddresscapturereplay is supported, tools like renderdoc can support bda. using a pointer in a shader in vulkan glsl, there is the gl_ext_buffer_reference extension which allows us to declare a pointer type. a pointer like this can be placed in a buffer, or we can convert to and from integers: #version 450 #extension gl_ext_buffer_reference : require #extension gl_ext_buffer_reference_uvec2 : require layout(local_size_x = 64) in; // these define pointer types. layout(buffer_reference, std430, buffer_reference_align = 16) readonly buffer readvec4 { vec4 values[]; }; layout(buffer_reference, std430, buffer_reference_align = 16) writeonly buffer writevec4 { vec4 values[]; }; layout(buffer_reference, std430, buffer_reference_align = 4) readonly buffer unalignedvec4 { vec4 value; }; layout(push_constant, std430) uniform registers { readvec4 src; writevec4 dst; } registers; placing raw pointers in push constants avoids all indirection for getting to a buffer. if the driver allows it, the pointers can be placed directly in gpu registers before the shader begins executing. not all devices support 64-bit integers, but it is possible to cast uvec2 <-> pointer. doing address computation like this is fine. uvec2 uadd_64_32(uvec2 addr, uint offset) { uint carry; addr.x = uaddcarry(addr.x, offset, carry); addr.y += carry; return addr; } void main() { uint index = gl_globalinvocationid.x; registers.dst.values[index] = registers.src.values[index]; uvec2 addr = uvec2(registers.src); addr = uadd_64_32(addr, 20 * index); cast a uvec2 to address and load a vec4 from it. this address is aligned to 4 bytes. registers.dst.values[index + 1024] = unalignedvec4(addr).value; } pointer or offsets? using raw pointers is not always the best idea. a natural use case you could consider for pointers is that you have tree structures or list structures in gpu memory. with pointers, you can jump around as much as you want, and even write new pointers to buffers. however, a pointer is 64-bit and a typical performance consideration is to use 32-bit offsets (or even 16-bit offsets) if possible. using offsets is the way to go if you can guarantee that all buffers live inside a single vkbuffer. on the other hand, the pointer approach can access any vkbuffer at any time without having to use descriptors. therein lies the key strength of bda. extreme hackery: physical pointer as specialization constants this is a life saver in certain situations where you are desperate to debug something without any available descriptor set. a black magic hack is to place a bda inside a specialization constant. this allows for accessing a pointer without using any descriptors. do note that this breaks all forms of pipeline caching and is only suitable for debug code. do not ship this kind of code. perform this dark sorcery at your own risk: #version 450 #extension gl_ext_buffer_reference : require #extension gl_ext_buffer_reference_uvec2 : require layout(local_size_x = 64) in; layout(constant_id = 0) const uint debug_addr_lo = 0; layout(constant_id = 1) const uint debug_addr_hi = 0; layout(buffer_reference, std430, buffer_reference_align = 4) buffer debugcounter { uint value; }; void main() { debugcounter counter = debugcounter(uvec2(debug_addr_lo, debug_addr_hi)); atomicadd(counter.value, 1u); } emitting spir-v with buffer_device_address in spir-v, there are some things to note. bda is an especially useful feature for layering other apis due to its extreme flexibility in how we access memory. therefore, generating bda code yourself is a reasonable use case to assume as well. enables bda in shaders. _opcapability physicalstoragebufferaddresses opextension "spv_khr_physical_storage_buffer"_ the memory model is physicalstoragebuffer64 and not logical anymore. _opmemorymodel physicalstoragebuffer64 glsl450_ the buffer reference types are declared basically just like ssbos. _opdecorate %_runtimearr_v4float arraystride 16 opmemberdecorate %readvec4 0 nonwritable opmemberdecorate %readvec4 0 offset 0 opdecorate %readvec4 block opdecorate %_runtimearr_v4float_0 arraystride 16 opmemberdecorate %writevec4 0 nonreadable opmemberdecorate %writevec4 0 offset 0 opdecorate %writevec4 block opmemberdecorate %unalignedvec4 0 nonwritable opmemberdecorate %unalignedvec4 0 offset 0 opdecorate %unalignedvec4 block_ declare a pointer to the blocks. physicalstoragebuffer is the storage class to use. optypeforwardpointer %_ptr_physicalstoragebuffer_writevec4 physicalstoragebuffer %_ptr_physicalstoragebuffer_readvec4 = optypepointer physicalstoragebuffer %readvec4 %_ptr_physicalstoragebuffer_writevec4 = optypepointer physicalstoragebuffer %writevec4 %_ptr_physicalstoragebuffer_unalignedvec4 = optypepointer physicalstoragebuffer %unalignedvec4 load a physical pointer from pushconstant. _%55 = opaccesschain %_ptr_pushconstant__ptr_physicalstoragebuffer_writevec4 %registers %int_1 %56 = opload %_ptr_physicalstoragebuffer_writevec4 %55_ access chain into it. _%66 = opaccesschain %_ptr_physicalstoragebuffer_v4float %56 %int_0 %40_ aligned must be specified when dereferencing physical pointers. pointers can have any arbitrary address and must be explicitly aligned, so the compiler knows what to do. opstore %66 %65 aligned 16 for pointers, spir-v can bitcast between integers and pointers seamlessly, for example: %61 = opload %_ptr_physicalstoragebuffer_readvec4 %60 %70 = opbitcast %v2uint %61 // do math on %70 %86 = opbitcast %_ptr_physicalstoragebuffer_unalignedvec4 %some_address conclusion we have already explored two key vulkan extension game changers through this blog and the previous one. the third and final part of this game changer blog series will explore ‘timeline semaphores’ and how developers can use this new extension to improve the development experience and enhance their games. follow up thanks to hans-kristian arntzen and the team at arm for bringing this great content to the samsung developers community. we hope you find this information about vulkan extensions useful for developing your upcoming mobile games. the samsung developers site has many resources for developers looking to build for and integrate with samsung devices and services. stay in touch with the latest news by creating a free account or by subscribing to our monthly newsletter. visit the marketing resources page for information on promoting and distributing your apps and games. finally, our developer forum is an excellent way to stay up-to-date on all things related to the galaxy ecosystem.
Arm Developers
tutorials game
blogadaptive scalable texture compression (astc) is an advanced lossy texture compression format, developed by arm and amd and released as royalty-free open standard by the khronos group. it supports a wide range of 2d and 3d color formats with a flexible choice of bitrates, enabling content creators to compress almost any texture asset, using a level of compression appropriate to their quality and performance requirements. astc is increasingly becoming the texture compression format of choice for mobile 3d applications using the opengl es and vulkan apis. astc’s high compression ratios are a perfect match for the mobile market that values smaller download sizes and optimized memory usage to improve energy efficiency and battery life. astc 2d color formats and bitrates astcenc 2.0 the ‘astcenc’ astc compression tool was first developed by arm while astc was progressing through the khronos standardization process seven years ago. astcenc has become widely used as the de facto reference encoder for astc, as it leverages all format features, including the full set of available block sizes and color profiles, to deliver high-quality encoded textures that are possible when effectively using astc’s flexible capabilities. today, arm is delighted to announce astcenc 2.0! this is a major update which provides multiple significant improvements for middleware and content creators. apache 2.0 open source license the original astcenc software was released under an arm end user license agreement. to make it easier for developers to use, adapt, and contribute to astcenc development, including integration of the compressor into application runtimes, arm relicensed the astcenc 1.x source code on github in january 2020 under the standard apache 2.0 open source license. the new astcenc 2.0 source code is now also available on github under apache 2.0. compression performance astcenc 1.x emphasized high image quality over fast compression speed. some developers have told arm they would love to use astcenc for its superior image quality, but compression was too slow to use in their tooling pipelines. the importance of this was reflected in the recent astc developer survey organized by khronos where developer responses rated compression speed above image quality in the list of factors that determine texture format choices. for version 2.0, arm reviewed the heuristics and quality refinement passes used by the astcenc compressor—optimizing those that were adding value and removing those that simply didn’t justify their added runtime cost. in addition, hand-coded vectorized code was added to the most compute intensive sections of the codec, supporting sse4.2 and avx2 simd instruction sets. overall, these optimizations have resulted in up to 3x faster compression times when using avx2, while typically losing less than 0.1 db psnr in image quality. a very worthwhile tradeoff for most developers. astcenc 2.0 - significantly faster astc encoding command line improvements the tool now supports a clearer set of compression modes that directly map to astc format profiles exposed by the khronos api support and api extensions. textures compressed using the ldr compression modes (linear or srgb) will be compatible with all hardware implementing opengl es 3.2, the opengl es khr_texture_compression_astc_ldr extension, or the vulkan astc optional feature. textures compressed using the hdr compression mode will require hardware implementing an appropriate api extension, such as khr_texture_compression_astc_hdr. in addition, astcenc 2.0 now supports commonly requested input and output file formats: loading ldr images in bmp, jpeg, png, and tga formats loading hdr images in openexr and radiance hdr formats loading compressed textures in the “.astc” file format provided by astcenc, and the dds and ktx container formats storing ldr images into bmp, png, and tga formats storing hdr images into openexr and radiance hdr formats storing compressed texturesinto the “.astc” file format provided by astcenc, and the dds or ktx container formats core codec library finally, the core codec is now separable from the command line front-end logic, enabling the astcenc compressor to be integrated directly into applications as a library. the core codec library interface api provides a programmatic mechanism to manage codec configuration, texture compression, and texture decompression. this api enables use of the core codec library to process data stored in memory buffers, leaving file management to the application. it supports parallel processing for compression of a single image with multiple threads or compressing multiple images in parallel. using astcenc 2.0 you can download astcenc 2.0 on github today, with full source code and pre-built binaries available for windows, macos, and linux hosts. for more information about using the tool, please refer to the project documentation: getting started: learn about the high-level operation of the compressor. format overview: learn about the astc data format and how the underlying encoding works. efficient encoding: learn about using the command line to effectively compress textures, and the encoding and sampling needed to get functional equivalents to other texture formats that exist on the market today. arm have also published an astc guide, which gives an overview of the format and some of the available tools, including astcenc . arm astc guide: an overview of astc and available astc tools. if you have any questions, feedback, or pull requests, please get in touch via the github issue tracker or the arm mali developer community forums: https://github.com/arm-software/astc-encoder https://community.arm.com/graphics/ khronos® and vulkan® are registered trademarks, and anari™, webgl™, gltf™, nnef™, openvx™, spir™, spir-v™, sycl™, openvg™ and 3d commerce™ are trademarks of the khronos group inc. openxr™ is a trademark owned by the khronos group inc. and is registered as a trademark in china, the european union, japan and the united kingdom. opencl™ is a trademark of apple inc. and opengl® is a registered trademark and the opengl es™ and opengl sc™ logos are trademarks of hewlett packard enterprise used under license by khronos. all other product names, trademarks, and/or company names are used solely for identification and belong to their respective owners.
Peter Harris
We use cookies to improve your experience on our website and to show you relevant advertising. Manage you settings for our cookies below.
These cookies are essential as they enable you to move around the website. This category cannot be disabled.
These cookies collect information about how you use our website. for example which pages you visit most often. All information these cookies collect is used to improve how the website works.
These cookies allow our website to remember choices you make (such as your user name, language or the region your are in) and tailor the website to provide enhanced features and content for you.
These cookies gather information about your browser habits. They remember that you've visited our website and share this information with other organizations such as advertisers.
You have successfully updated your cookie preferences.