OpenGL ES Usage Recommendations
This OpenGL ES usage guide assumes the reader is already familiar with the API, but wants to know how to use it effectively across a wide range of Galaxy devices. It also assumes the reader is familiar with the fundamentals of Tile Based Rendering (TBR) GPU architectures commonly found in mobile devices. If you are new to OpenGL ES, you can get started with our recommended SDKs that introduce the API concepts though code examples. You can also learn more about GPU architectures here and here.
Before reading this document, we recommend familiarizing yourself with our Game Asset Optimization Recommendations.
Understand your target
When developing high-performance applications, it's essential to understand the capabilities and performance characteristics of the APIs and hardware you are targeting.
For OpenGL ES, this includes:
Maximum graphics API version
- e.g. Maximum texture resolution
Like any performance-focused API, there are situations in OpenGL ES where undefined behavior can occur if the API user doesn't adhere to the specification. KHR_debug /Debug Output (OpenGL ES 3.2), glGetError() and graphics API debugging tools are useful for identifying API misuse. Debug carries a cost in an OpenGL ES implementation, so consider GL_KHR_no_error in production code. It is also important to test your application across a wide range of devices, chipsets and GPU architectures to identify bugs early as possible in your development cycle.
Consider GPU architecture differences.
Check capabilities and extensions at run-time. Implement fall-backs.
Test your application on lots of devices.
Please see our Game Asset Optimization Recommendations.
Efficient buffer updates
OpenGL ES API call submission is asynchronous from GPU execution. This can lead to scenarios where an application needs to modify a buffer for frame N whilst an in-flight GPU task for frame N-1 still needs to read the previous buffer data. A simple solution to this problem is for the driver to synchronize CPU and GPU access by stalling the pipeline until the in-flight GPU tasks complete.
Alternatively, an OpenGL ES driver may orphan buffers to avoid stalling. Orphaning duplicates buffers to ensure in-flight frames can access the original data from the orphaned buffer. A newly-allocated buffer tied to the GL handle can then be modified by the application developer independently of the in-flight GPU render.
- Buffer orphaning is silent, but can cause out-of-memory errors!
Batch buffer updates to reduce the likelihood of orphaning.
If buffers are updated at different frequencies, e.g. cloth simulation vertex positions and texture coordinates, consider splitting into separate buffers.
When loading data from files, prefer mapped buffer updates (minimal copy overhead).
Consider multi-buffering to explicitly manage buffer access.
Efficient texture updates
Texture uploading refers to a host (CPU) to device (GPU) transfer. In OpenGL ES, this operation is performed by glTexStorage* or glTexImage* calls. Both functions block while host memory is copied to driver-owned memory. Once the copy has completed, the driver will asynchronously copy this data to GPU-owned memory that can be sampled. In most OpenGL ES implementations, the layout of the GPU accessible texture is optimized during the copy process. Driver optimizations, such as tiling and twiddling, make the GPU accessible texture more cache efficient to sample.
glTexImage allocates driver-owned storage for a specific mipmap layer with a specific dimension and format. It sets the internal format for that specific mipmap level, and then optionally uploads pixel data to the allocated memory block. One of the issues with glTexImage is that it is possible to change the internal format or size on-the-fly simply by calling glTexImage again. As the format can change, it would be wasteful for a driver to perform a copy between driver and GPU memory if the format might change later. Because of this, most drivers defer the driver memory to device memory copy until the first draw that samples the texture. Deferring the transfer to draw time can introduce framerate stutters.
glTexStorage also allocates driver-owned memory but creates all the mipmaps up-front given the base level's size. Unlike glTexImage, it does not initialize the memory. Memory is initialized with a call to glTexSubImage. As glTexStorage textures have an immutable format, the driver can kick asynchronous driver memory to device memory copies immediately after a glTexSubImage call completes. Kicking the copies up front avoids draw time stutters.
Texture warm-up is a technique that avoids draw-time frame rate stutters when glTexStorage and immutable format textures aren't available - for example, when targeting a device that only supports OpenGL ES 2.0. Because it is possible to modify the texture's properties for all mipmap levels, the driver has to defer the copy to GPU-owned memory until it can guarantee the format and size that will be used for rendering. This guarantee can only occur at draw-call time, which means that even if initializing textures with glTexImage calls at load-time, there may be some frame-rate stuttering occurring for every draw call that references a new texture object. This stuttering occurs because the GPU has to wait until the texture has been fully uploaded before being able to sample from it. To overcome this limitation and reduce the amount of frame stuttering, we can issue off-screen dummy draw calls that sample the texture when the level is loaded. If this technique is applied for all the textures in the scene, this cause of frame-rate stuttering will be avoided.
Texture memory allocated with glTexStorage and initialized with glTexSubImage is immutable. Since the parameters such as width, height and internal format are constant for all mipmap levels, the driver is able to copy the memory to GPU-owned memory immediately, instead of waiting to copy at draw-call time. This avoids having to issue dummy draw-call to force the host to device memory transfer. However it is still important to ensure that textures are created at loading/initialization stage for maximum efficiency, regardless of whether immutable texture objects or mutable texture objects are in use. If creating immutable texture objects in the main rendering loop there may still be some frame-rate stuttering as host-to-device memory transfer will need to be completed.
Allocating texture objects in the render loop can cause frame-rate stuttering
Using mutable texture objects can cause frame-rate stuttering due to the host to device memory transfer being deferred to draw call time.
Always upload textures at load time. Attempt never to do so in the render loop.
When available, use immutable format textures.
When immutable format textures aren't supported (OpenGL ES 2.0), use a texture warm up to avoid framerate stutters.
Similarly to buffers, texture orphaning may be performed by OpenGL ES drivers to avoid pipeline stalls. As textures tend to be accessed later in the graphics pipeline than buffers, texture orphaning is more common in OpenGL ES drivers than buffer orphaning.
A common cause for texture orphaning is poor texture atlas management. High-frequency atlas updates may cause a driver to orphan a texture more than once, which can cause unexpected out of memory errors. If regions of an atlas are updated as different frequencies, consider splitting them into separate textures:
Atlas A: Static data
Atlas B: Update on level load
Atlas C: Update every frame
Buffer orphaning is silent, but can cause out of memory errors!
Texture arrays may be implemented as a single, contiguous block of memory in OpenGL ES drivers. This means that modifying any texel in a texture array may cause the entire array to be orphaned! This cost can be avoided by splitting a texture array into multiple texture arrays that match their update frequency.
Batch texture updates to avoid repeated orphaning.
If texture contents are updated at different frequencies, consider splitting into separate textures.
Consider using multi-buffering to explicitly manage texture access.
GLSL ES supports precision qualifiers (low, medium and high). Precision qualifier hints enable developers to tell compilers where reduced precision can be used to improve the performance of ALU operations and, in turn, reduce the power consumption of the GPU.
It's valid for compilers to promote the requested precision of a variable, for example to use 32-bit floating point precision when low is specified by the developer. Compilers tend to do this when the instructions introduced for precision conversion cause more overhead than running the calculations at full precision.
- Beware of compilers promoting precision. A shader that runs perfectly on device A (promoted precision) may have artefacts on device B (honours precision qualifier)
Use reduced precision to improve performance and reduce power consumption.
Beware of compilers promoting precision. Test on lots of devices to catch shader precision artefacts early.
Beware of rendering errors that are obscured by low-precision operations on some devices.
In OpenGL ES, the hardware instruction sequences required to execute shaders on a given device may be specific both to the GPU and to the driver version. To avoid the complexity of developers generating binaries for every possible GPU and driver combination at development time, OpenGL ES instead requires shaders to be compiled and linked at run-time from GLSL source.
Compiling and linking shaders is a very expensive operation, especially in complex games that make use of a lot of shaders. This type of operation requires a lot of processing and memory. If not done at the right time, it can lead to frame stuttering. To avoid stuttering, it is recommended to compile and link shaders at load time.
To improve performance, Android devices support the the EGL_ANDROID_blob_cache extension. This extension caches shaders when they are first compiled. A hash of the shader source is used to uniquely identify a shader when a driver is called to compile it again. If a matching binary is available in the cache, the driver loads it instead of redundantly re-compiling the source. This mechanism isn't exposed to applications but is used automatically by the driver. The problem with this extension is that the shader cache is very small and its size cannot be queried. Because of this, complex games cannot rely on this mechanism to cache all shader binaries between application runs.
To decrease frame-rate stuttering and improve performance it is preferable to cache shaders manually. As part of OpenGL ES 3.0 it is possible to generate and use program binaries. In OpenGL ES 2.0, this is exposed via the OES_get_program_binary extension. This extension enables us to compile the shaders on the device and then retrieve the program binary from the driver, which can be stored for reuse later, this prevents any sort of binary compatibility issue across GPU vendors.
Compile all your shaders during the loading stage
Capture the program binary by calling glGetProgramBinary save to file system for later reuse
During later runs use glProgramBinary to load the binary version of the shader instead of re-compiling the source
There are some cases where an OTA update can change the driver and cause the new driver binary format to be incompatible with cached binary shaders. A program binary incompatibility can be detected by glProgramBinary link status. If an error status is returned, the binary should be considered invalid and a new binary should be compiled from shader source.
Compile all your shaders at load time.
Cache binaries after compilation to reduce load times of subsequent runs.
View frustum culling
The cheapest draw the driver and GPU will ever process is the draw that is never submitted. To avoid redundant driver and GPU processing, a common rendering engine optimization is to submit a draw to the graphics API only if it falls within, or intersects, the bounds of the view frustum. View frustum culling is usually cheap to execute on the CPU and should always be considered when rendering complex 3D scenes.
- Always try to use view frustum culling to avoid redundant driver and GPU processing.
Many GPU architectures rasterize and perform fragment shading operations for a pixel in the order that primitives are submitted. When more than one overlapping draw is opaque, this can lead to overdraw, where shaded framebuffer values are wastefully calculated for fragments which are later overwritten.
Game developers can combat overdraw by enabling depth testing and submitting opaque draws from front to back. This enables GPUs to perform depth tests early in the GPU pipeline then reject fragments from the additional pipeline stages if they are obscured by a primitive previously coloured at that fragment location.
- Sort opaque draws from front to back. Ordering doesn’t have to be perfect to give a good overdraw reduction. Find an algorithm that gives a good balance between CPU sorting overhead and overdraw reduction.
Minimize state changes
The cost of changing state depends on the state being modified, driver behavior and hardware behavior. This can range from a negligible impact to a significant performance penalty, such as stalling the pipeline until all in-flight operations complete. Minimizing state changes reduces the chance of performance penalties being incurred.
Tile-based GPU architectures are optimized for transforming and binning draws in a single render pass. Switching framebuffer bindings before all draws have been issued may result in partial renders to be kicked, which can introduce overdraw and costly system memory store/load operations that waste memory bandwidth.
- Bind a framebuffer once for a render pass, submit all draws to it.
Opaque, transparent and alpha test/discard draws
Blending and alpha test/discard operations enable transparent objects to be easily rendered by games. Commonly, draws in a game have the following properties:
|Draw type||Depth write enabled||Depth test enabled|
In addition to the general case described in Minimizing overdraw, overdraw can also be introduced by alpha test/discard and transparent draws being overwritten by opaque draws that are closer to the camera. To ensure early depth test rejection is used effectively, it is important for games to submit all draws that write to the depth buffer first. Doing so builds up the contents of the depth buffer so that early depth testing can reject subsequent obscured fragments.
Alpha test/discard draws may stall the pipeline when depth or stencil testing is enabled. This is because subsequent draws require the depth and stencil buffers to be up to date, and these buffers will not be updated until fragment visibility is known.
Submit all opaque draws in a render pass first.
Avoid draws with the discard keyword in fragment shaders. If required, render them after opaque and before transparent.
Submit transparent/blended draws last.
Mobile GPUs are optimized System-on-Chip area and power consumption. Some stages of the pipeline historically considered fixed function, e.g. blending, may be emulated on shader cores due to reduce the SoC area occupied by the GPU. To emulate this functionality, many drivers compile the game’s shader code up front then patch the binary with additional state and operations before the draw call is executed. Shader patching is also used to configure the inputs and outputs of a draw.
Patching is usually very efficient and is unlikely to introduce performance bottlenecks.
- Keep state changes to a minimum.
Don't stall the pipeline
Efficient OpenGL ES drivers behave in such a way that work is queued on the CPU and processed by the GPU in a pipelined fashion. This ensures that the GPU remains busy and has to process work as soon as it completes its current task. This in turn means that the GPU is busy processing older workloads while the CPU still queues up work for the GPU to process. To keep the pipeline full with minimal latency, the Android window surface manager controls the number of buffers queued up for rendering. The purpose of double- or triple-buffering is to ensure the GPU can continue writing to new surfaces while the compositor is reading from a previously written surface.
Stalling the pipeline can occur when a backwards pipeline dependency has been introduced. In the case of using glReadPixels, the pipeline is drained and the CPU blocks waiting for the GPU to complete rendering. This is because all queued draw commands in the GL driver queue have to be processed before downloading the pixels. This synchronization can be eliminated by using a Pixel Buffer Object (PBO), when a GL_PIXEL_PACK_BUFFER is bound a call to glReadPixels will return immediately but will not synchronously read the pixels.
A PBO works in such a way that you provide the buffer in advance and the data gets copied asynchronously when the GPU is ready. You can check (in most cases a few frames later) when the operation is complete by using a fence sync object on the CPU side. The downside of using PBOs is that you have to wait for the data to arrive, and the memory transfer itself can be expensive.
CPU and GPU work is executed in parallel. Avoid API calls that require one to block and wait for the other.
Prefer PBOs when glReadPixels functionality is required, e.g. capturing screenshots.
Framebuffer upload and resolve
Mobile devices have limited memory bandwidth. Additionally, memory bandwidth data transfers are power intensive to use so it's best to use it as little as possible.
In 3D graphics rendering, a framebuffer may need more than one attachment. In many cases, only some attachment data needs to be preserved - all other attachment data is temporary. For example, a colour buffer may be required for the rendered image and a depth buffer may be needed to ensure primitives are rendered in the intended order. In this scenario, the depth data doesn't need to be preserved so writing it from GPU memory to system memory wastes bandwidth. Additionally, the colour and depth buffer contents from frame N-1 may not be required for frame N. As uploading this data would redundantly use memory bandwidth, we want to tell the driver those operations aren't required.
Avoiding redundant attachment uploads
In many OpenGL ES drivers, a glClear operation for an attachment tells the driver that the previous attachment contents aren't required and a fast clear path should be taken. In this case, the clear indicates that the attachment upload operation can be optimized out.
Avoiding redundant attachment resolves
Depending on the version of OpenGL ES you are targeting, you may need to use an extension to inform the driver that an attachment's output can be discarded.
OpenGL ES 3.0 and newer: glInvalidateFramebuffer
OpenGL ES 2.0: GL_EXT_discard_framebuffer
If you don't need to upload an attachment's previous contents at the start of a frame, use glClear. This will benefit most mobile GPUs. Consider removing glClear calls when a GPU vendor recommends against the use of glClear.
If you don't need to preserve/resolve an attachment, inform the driver with glValidateFramebuffer or DiscardFramebufferEXT. This benefits all GPU architectures.
- Arm Mali Best Practices(includes both OpenGL ES and Vulkan APIs)
- Arm Mali OpenGL ES developer resources
- Arm Mali OpenGL ES sample code