Understand the limitations of your target hardware and how to profile the GPU to optimize the rendering of your graphics. Try these tips and best practices to reduce the workload on the GPU.
Find many more best practices in the free e-book, Optimize your game performance for console and PC.
Use draw call batching
To draw a GameObject onscreen, Unity issues a draw call to the graphics API (e.g., OpenGL, Vulkan, or Direct3D). Each draw call is resource intensive.
State changes between draw calls, such as switching materials, can cause performance overhead on the CPU side. PC and console hardware can push many draw calls, but the overhead of each call is still high enough to warrant trying to reduce them. On mobile devices, draw call optimization is vital. You can achieve this with draw call batching.
Draw call batching minimizes these state changes and reduces the CPU cost of rendering objects. Unity can combine multiple objects into fewer batches using several techniques with the High Definition Render Pipeline (HDRP) or Universal Render Pipeline (URP):
- SRP batching: Enable the SRP Batcher in the Pipeline Asset under Advanced. When using compatible shaders, the SRP Batcher reduces the GPU setup between draw calls and makes material data persistent in GPU memory. This can also speed up your CPU rendering times significantly. Use fewer shader variants with minimal keywords to improve SRP batching. Consult SRP documentation to see how your project can take advantage of this rendering workflow.
- GPU instancing: If you have a large number of identical objects with the same mesh and material, use GPU instancing to batch them through graphics hardware. To enable GPU instancing, select your material in the Project window of the Inspector, then check Enable instancing.
- Static batching: For nonmoving geometry, Unity can reduce draw calls for meshes sharing the same material. This is more efficient than dynamic batching, but uses more memory. Mark all meshes that never move as Batching Static in the Inspector. Unity combines all static meshes into one large mesh at build time. The StaticBatchingUtility class also allows you to create these static batches at runtime (for example, after generating a procedural level of nonmoving parts).
- Dynamic batching: For small meshes, Unity can group and transform vertices on the CPU, then draw them all in one go. Note, however, that you should not use this unless you have enough low-poly meshes (no more than 300 vertices each and 900 total vertex attributes). Otherwise, enabling it will waste CPU time looking for small meshes to batch.
You can maximize batching in a few simple ways:
- Use as few textures in a scene as possible. Fewer textures require fewer unique materials, making them easier to batch. Additionally, use Texture Atlases wherever possible.
- Always bake lightmaps at the largest atlas size possible. Fewer lightmaps require fewer material state changes, but keep an eye on the memory footprint.
- Be careful not to instance materials unintentionally. Accessing Renderer.material in scripts duplicates the material and returns a reference to the new copy. This breaks any existing batch that already includes the material. If you wish to access the batched object’s material, use Renderer.sharedMaterial instead.
- Keep an eye on the number of static and dynamic batch counts versus the total draw call count by using the Profiler or the rendering stats during optimization.
Refer to the Draw call batching documentation for more information.
Check the Frame Debugger
Use the Frame Debugger to freeze playback on a single frame and step through the process of how Unity constructs a scene. In doing so, you can identify optimization opportunities. Look for GameObjects that render unnecessarily, and disable them to reduce draw calls per frame.
One main advantage of the Frame Debugger is that you can relate a draw call to a specific GameObject in the scene. This makes it easier to investigate certain issues that might not be possible in external frame debuggers.
Note: The Frame Debugger does not show individual draw calls or state changes. While only native GPU Profilers can give you detailed draw call and timing information, the Frame Debugger can still be very helpful for debugging pipeline problems or batching issues.
Read the Frame Debugger documentation for more details.
Optimize fill rate and reduce overdraw
Fill rate refers to the number of pixels that the GPU can render to the screen each second. If your game is limited by fill rate, this means that it’s trying to draw more pixels per frame than the GPU can handle.
Drawing on top of the same pixel multiple times is called overdraw. Overdraw decreases fill rate and costs extra memory bandwidth. The most common causes of overdraw are:
- Overlapping opaque or transparent geometry
- Complex shaders, often with multiple render passes
- Unoptimized particles
- Overlapping UI elements
While you should minimize its effect, there is no one-size-fits-all approach to solving overdraw. Experiment with the following techniques to reduce its impact.
Reduce the batch count
As with other platforms, optimization on consoles will often mean reducing draw call batches. Here are a few techniques that might help:
- Use Occlusion culling to remove objects hidden behind foreground objects and reduce overdraw. Be aware that this requires additional CPU processing, so use the Profiler to verify that moving work from the GPU to CPU is actually beneficial.
- GPU instancing can also reduce your batches if you have many objects that share the same mesh and material. Limiting the number of models in your scene can improve performance. If it’s done artfully, you can build a complex scene without making it look repetitive.
- The SRP Batcher can reduce the GPU setup between draw calls by batching Bind and Draw GPU commands. To benefit from this SRP batching, use as many materials as needed, but restrict them to a small number of compatible shaders (e.g., Lit and Unlit shaders in URP and HDRP).
Pay attention to culling
Culling occurs for each camera and can have a large impact on performance, especially when multiple cameras are enabled concurrently. Unity uses two types of culling:
- Frustum culling is performed automatically on every camera. It ensures that GameObjects outside of the View Frustum are not rendered to save on performance.
- You can set per-layer culling distances manually via Camera.layerCullDistances. This allows you to cull small GameObjects at a distance shorter than the default farClipPlane property. To do this, organize GameObjects into Layers. Use the .layerCullDistances array to assign each of the 32 layers a value less than the farClipPlane (or use 0 to default to the farClipPlane).
- Unity culls by layer first. It only keeps the GameObjects on layers that the Camera uses. Afterward, Frustum culling removes any GameObjects outside the Camera Frustum.
- Frustum culling is performed as a series of jobs that harness available worker threads. Each layer culling test is quick (essentially just a bit mask operation). However, this cost could still add up with a large number of GameObjects. If this becomes a problem for your project, you might need to implement some system to divide your world into “sectors” and disable sectors that are outside the Camera Frustum in order to relieve some of the pressure on Unity’s Layer/Frustum culling system.
- Occlusion culling removes any GameObjects from the Game view if the Camera cannot see them. Objects hidden behind other objects can potentially still render and cost resources. Use Occlusion culling to discard them.
- For example, rendering a room is unnecessary if a door is closed and the Camera cannot see into the room. If you enable Occlusion culling, it can significantly increase performance but also use more disk space, CPU time, and RAM. Unity bakes the Occlusion data during the build and then needs to load it from disk to RAM while loading a scene.
- While Frustum culling outside the Camera view is automatic, Occlusion culling is a baked process. Simply mark your objects as Static, Occluders, or Occludees, then bake via the Window > Rendering > Occlusion culling dialog.
See the Working with Occlusion culling tutorial to learn more.
Leverage dynamic resolution
The Allow Dynamic Resolution Camera setting allows you to dynamically scale individual render targets to reduce workload on the GPU. In cases where the application’s frame rate reduces, you can gradually scale down the resolution to maintain a consistent frame rate.
Unity triggers this scaling if performance data suggests that the frame rate is about to decrease as a result of being GPU bound. You can also preemptively trigger this scaling manually with a script. This is useful if you are approaching a GPU-intensive section of the application. If scaled gradually, dynamic resolution can be almost unnoticeable.
Refer to the Dynamic resolution manual page for a list of supported platforms.
Check multiple Camera views
Sometimes you need to render from more than one point of view during your game. For instance, it’s common in a first-person shooter (FPS) to draw the player’s weapon and the environment separately with different fields of view (FOV). This prevents the foreground objects from looking distorted through the wide-angle FOV of the background.
You could use Camera Stacking in URP to render more than one Camera view. However, there is still significant culling and rendering done for each camera. Each camera incurs some overhead, whether it’s doing meaningful work or not.
Only use Camera components required for rendering. On mobile platforms, every active camera can use up to 1 ms of CPU time, even when rendering nothing.
Use Level of Detail
As objects move into the distance, Level of Detail (LOD) can adjust or switch them to use lower-resolution meshes with simpler materials and shaders. This strengthens GPU performance.
See the Working with LODs course on Unity Learn for more details.
Profile post-processing effects
Profile your post-processing effects to see their cost on the GPU. Some fullscreen effects, like Bloom and Depth of Field, can be expensive, so it’s worth experimenting until you find a desired balance between visual quality and performance.
Post-processing does not fluctuate much at runtime. Once you’ve determined your Volume Overrides, allot post-processing effects a static portion of your total frame budget.
Avoid Tessellation shaders
Tessellation subdivides a shape into smaller versions of itself, which can enhance detail through increased geometry. Though there are examples where Tessellation makes the most sense, such as the tree bark in the Unity demo Book of the Dead, try to avoid Tessellation on consoles as they are expensive on the GPU.
Read more about the Book of the Dead demo here.
Replace geometry shaders with compute shaders
Like Tessellation shaders, geometry and vertex shaders can run twice per frame on the GPU – once during the depth prepass, and again during the shadow pass.
If you want to generate or modify vertex data on the GPU, a compute shader is often the better choice, especially compared to a geometry shader. Doing the work in a compute shader means that the vertex shader that actually renders the geometry can operate much quicker.
Learn more about Shader core concepts.
Aim for good wavefront occupancy
When you send a draw call to the GPU, that work splits into many wavefronts that Unity distributes throughout the available SIMDs within the GPU. Each SIMD has a maximum number of wavefronts that can be running at the same time.
Wavefront occupancy refers to how many wavefronts are currently in use relative to the maximum. It measures how well you are using the GPU’s potential. The profiling tools for console development show wavefront occupancy in great detail.
In the above example from Unity’s Book of the Dead, vertex shader wavefronts appear in green and pixel shader wavefronts appear in blue. On the bottom graph, many vertex shader wavefronts appear without much pixel shader activity. This shows an underutilization of the GPU.
If you’re doing a lot of vertex shader work that doesn’t result in pixels, that might indicate an inefficiency. While low wavefront occupancy is not necessarily bad, it’s a metric you can use to start optimizing your shaders and checking for other bottlenecks. For instance, if you have a stall due to memory or compute operations, increasing occupancy can enhance performance. On the other hand, too many in-flight wavefronts can cause cache thrashing and decrease performance.
Utilize async compute
If you have intervals where you are underutilizing the GPU, you can leverage async compute to move compute shader work in parallel to your graphics queue. For example, during shadow map generation, the GPU performs depth-only rendering. Very little pixel shader work takes place at this point and many wavefronts remain unoccupied.
If you can synchronize some compute shader work with the depth-only rendering, this makes for a better overall use of the GPU. The unused wavefronts can help with Screen Space Ambient Occlusion (SSAO) or any task that complements the current work.
Watch this session on Optimizing performance for high-end consoles from Unite.