GPU Lightmapper: A Technical Deep Dive
The Lighting Team is going all-in on iteration speed. We designed the Progressive Lightmapper with that goal in mind. Our goal is to provide quick feedback on any changes that you make to the lighting in your project. In 2018.3 we introduced a preview of the GPU version of the Progressive Lightmapper. Now we’re heading towards the feature and visual quality parity with its CPU sibling. We aim to make the GPU version an order of magnitude faster than the CPU version. This brings interactive lightmapping to artistic workflows, with great improvements to team productivity.
With this in mind, we have chosen to use RadeonRays: an open source ray tracing library from AMD. Unity and AMD have collaborated on the GPU Lightmapper to implement several key features and optimizations. Namely: power sampling, rays compaction, and custom BVH traversal.
The design goal of the GPU Lightmapper was to offer the same features of the CPU Lightmapper while achieving higher performance:
- Unbiased interactive lightmapping
- Feature parity between CPU and GPU backends
- Compute based solution
- Wavefront path tracing for maximum performance
We know that iteration time is the key to empowering artists to improve visual quality and unleash creativity. Interactive lightmapping is the goal here. Not just impressive overall bake times, we also want the user experience to offer immediate feedback.
We needed to solve a bunch of interesting problems to achieve this. In this post, we will explore some of the decisions we have made.
For the Lightmapper to offer progressive updates to the user, we needed to make some design decisions.
We don’t cache irradiance or visibility when doing direct lighting (direct lighting could be cached and reused for indirect lighting). In general, we don’t cache any data and prefer computation steps that are small enough to not create stalls, and provide a progressive and interactive display while baking.
Scenes can potentially be very large and contain many lightmaps. To ensure that work is spent where it offers the most benefit to the user, it is important to focus baking on the currently visible area. To do this, we first detect which of the lightmaps contain most unconverged visible texels on a screen, then we render those lightmaps and prioritize the visible texels (off-screen texels will be baked once all the visible ones have converged).
A texel is defined as visible if it’s in the current camera frustum and if it isn’t occluded by any Scene static geometries.
We do this culling on the GPU (to take advantage of fast ray tracing). Here is the flow of a culling job.
The culling jobs have two outputs:
- A culling map buffer, storing whether each texel of the lightmap is visible. This culling map buffer is then used by the rendering jobs.
- An integer representing the number of visible texels for the current lightmap. This integer will be asynchronously read back by the CPU to adjust lightmap scheduling in the future.
In the video below, we can see the effect of the culling. The bake is stopped midway for demo purposes. So when the Scene view moves, we can see not yet baked texels (i.e. black) that aren’t visible from the initial camera position and direction.
For performance reasons, the visibility information is updated only every time the camera state ‘stabilizes’. Also, supersampling isn’t taken into account.
GPUs are optimized for taking huge batches of data and performing the same operation on all of it; they’re optimized for throughput. What’s more, the GPU achieves this acceleration while being more power- and cost-efficient than a many-core CPU. However, GPUs are not as good as a CPUs in terms of latency (intentionally, by the design of the hardware). That’s why we use a data-driven pipeline with no CPU-GPU sync points to get the most out of the GPU’s inherently parallel computation nature.
However, raw performance isn’t enough. User experience is what matters, and we measure it in visual impact over time aka convergence rate. So we also need efficient algorithms.
GPUs are meant to be used on large data sets, and they‘re capable of high throughput at the cost of latency. Also, they’re usually driven by a queue of commands filled ahead of time by the CPU. The goal of that continuous stream of large commands is to make sure we can saturate the GPU with work. Let’s look at the key recipes we are using to maximize throughput and thus raw performance.
The way we approach the GPU lightmapping data pipeline is based on the following principles:
1. We prepare the data once.
At this point, CPU and GPU might be in sync in order to reduce memory allocation.
2. Once the bake has started, no CPU-GPU sync points are allowed.
The CPU is sending a predefined workload to the GPU. This workload will be over-conservative in some cases (for example using 4 bounces but all indirect rays finished after the 2nd bounce then we still have enqueued kernels that will be executed but early out).
3. The GPU cannot spawn rays nor kernels.
Rather, it might be asked to process empty jobs (or very small ones). To handle those cases efficiently, kernels are written in a way where data and instruction coherency is maximized. We handle this via data `compaction`, more on this later.
4. We don’t want any CPU-GPU sync points, nor any sort of GPU bubbles once the bake has started.
For example, some OpenCL commands can create small GPU bubbles (i.e. moments where the GPU have nothing to process), such as clEnqueueFillBuffer or clEnqueueReadBuffer (even in the asynchronous versions), so we avoid them as much as possible. Also, data processing needs to remain on the GPU for as long as possible (i.e. rendering and compositing up to completion). When we need to bring data back to the CPU for additional processing, we will do so asynchronously and neither have it send back to the GPU again. For example, seam stitching is a CPU post-process at the moment.
5. CPU will adapt the GPU load in an asynchronous fashion.
Changing the lightmap being rendered when the camera view changes or when a lightmap is fully converged will incur some latency. CPU threads generate and handle those readback events using a lockless queue to avoid mutex contention.
One of the key features of the GPU architecture is wide SIMD instruction support. SIMD stands for Single Instruction Multiple Data. A set of instructions will be executed sequentially in lockstep on a given amount of data inside of what is called a warp/wavefront. The size of those wavefronts/warps is 64, 32 or 16 values (depending on the GPU architecture). Therefore a single instruction will apply the same transformation to multiple data - single instruction multiple data. However, for greater flexibility, the GPU is also able to support divergent code paths in its SIMD implementation. To do this it can disable some threads while working on a subset before rejoining. This is called SIMT: Single instruction multiple threads. However, this comes at a cost as divergent code paths within a wavefront/warp will only profit from a fraction of the SIMD unit. Read this excellent blog post for more info.
Finally, a neat extension of the SIMT idea is the ability of the GPU to keep around many warps/wavefronts per SIMD core. If a wavefront/warp is waiting for slow memory access, the scheduler can switch to another wavefront/warp and continue working on that in the meantime (providing there is enough pending work). For this to really work however, the amount of resources needed per context needs to be low, so that the occupancy (the amount of pending work) can be high.
Summing up we should aim for:
- Many threads in flight
- Avoiding divergent branches
- Good occupancy
Having good occupancy is all about the kernel code and is too broad of a subject to be a part of this blog post. Here are some great resources:
- Understanding Latency Hiding on GPUs by Vasily Volkov (NVIDIA)
- Intro to GPU Scalarization by Francesco Cifariello (Unity Technologies)
In general, the goal is to use local resources sparsely, especially vector registers and local shared memory.
Let’s take a look at what could be the flow for baking direct lighting on the GPU. This section mostly covers lightmaps however, Light Probes work in a very similar way, except that they don’t have visibility or occupancy data.
There are a few problems here:
- Lightmap occupancy in that example is 44% (4 occupied texels over 9), so only 44% of the GPU threads will actually produce usable work! On top of that, useful data is sparse in memory so we will pay for bandwidth even for unoccupied texels. In practice, lightmap occupancy is usually between 50% to 70% hence a huge potential gain.
- Data set is too small. The example is showing a 3x3 lightmap for simplicity but even the common case of a 512x512 lightmap will be a too small data set for recent GPUs to attain top efficiency.
- In an earlier section, we talked about view prioritization and the culling job. The two points above are even truer as some occupied texels won’t be baked because they are not currently visible in the Scene view, lowering occupancy and overall data set even more.
How do we solve that? As part of a collaboration with AMD, ray compaction was added. The idea vastly improves both ray tracing and shading performance. In short, the idea is to create all the ray definitions in contiguous memory allowing all the threads in a warp/wavefront to work on hot data.
In practice you also need each ray to know the index of the texel it is related to, we store this in the ray payload. Also, we store the global compacted ray count.
Here is the flow with compaction:
Both the kernels that are shading and tracing the rays can now run only on hot memory and with minimal divergence in code paths.
What’s next? Well, we haven’t solved the fact that the data set could be too small for the GPU, especially if view prioritization is enabled. The next idea is to decorrelate the generation of rays from the gbuffer representation. With the naive approach, we only generate one ray per texel. Since we will eventually want to generate more rays anyway, we might as well generate several rays per texels up front. In this way, we can create more meaningful work for the GPU to chew on. Here is the flow:
Before compaction we generate many rays per texel and we call this expansion. We also generate meta information that is used in the gather step to accumulate into the correct destination texel.
Both the expansion and gather kernels are not executed very often. In practice we expand and then shade every light (for direct) or process all bounces (for indirect), to finally gather only once.
With these techniques we achieve our goal: we generate enough work to saturate the GPU and we spend bandwidth only on texels that matter.
These are the benefits of shooting multiple rays per texel:
- The set of active rays will always be a large data set even in view prioritization mode.
- Preparation, tracing, and shading are all working on very coherent data as the expansion kernel will create rays targeting the same texel in continuous memory.
- The expansion kernel handles occupancy and visibility, making the preparation kernel much simpler and thus faster.
- The size of the expanded/working data set buffers is decoupled from the size of the lightmap.
- The number of rays we shoot per texel can be driven by any algorithm, a natural expansion is going to be adaptive sampling.
Indirect lighting uses very similar ideas, albeit more complex:
With indirect light we have to perform multiple bounces, each one can discard random rays. Thus we do compaction iteratively to keep working on hot data.
The heuristic we currently use favors an equal amount of rays per texel. The goal is to get a very progressive output. However, a natural extension of this would be to improve these heuristics by using adaptive sampling, so to shoot more rays where the current results are noisy. Also, the heuristic could aim for a greater coherency, both in memory and in thread group execution, by being aware of the wavefront/warp size of the hardware.
Assets from ArchVizPRO baked with GPU Lightmapper.
There are many use cases for transparency/translucency. A common way to handle transparency and translucency is to cast a ray, detect intersection, fetch material and schedule a new ray if the encountered material is translucent or transparent. However, in our case, the GPU cannot spawn rays for performance reasons (please refer to the `Data-driven pipeline` section above). Also, we can’t reasonably ask the CPU to schedule enough rays in advance so we are sure that we handle the worst possible case, as this would be a major performance hit.
Thus we went for a hybrid solution. We handle translucency and transparency differently allowing to solve the issues above:
Transparency (when a material is not opaque because of holes in it). In that case, the ray can either go through or bounce off the material based on a probability distribution. Thus the workload prepared in advance by the CPU does not need to change, we are still Scene independent.
Translucency (when a material is filtering the light that goes through it). In that case, we approximate and do not consider refraction. In other words, we let the material color the light, but not change its direction. This allows us to handle translucency while walking the BVH, meaning we can handle easily a large number of cutout materials and scale very well with translucency complexity in the Scene.
However, there is a quirk; BVH traversal is out of order:
In the case of occlusion rays, this is actually fine as we are only interested in the attenuation from translucence of each intersected triangle along the ray. As multiplication is commutative, out of order BVH traversal is not a problem.
However for intersection rays what we want is to be able to stop on a triangle (in a probabilistic way when the triangle is transparent) and to collect translucence attenuation for each triangle from the ray origin to the hit point. As BVH traversal is out of order the solution we have chosen is to first only run the intersection to find the hit point, and mark the ray if any translucency was hit. For every marked ray, we thus generate an extra occlusion ray from the intersection ray origin to the intersection ray hit. To do this efficiently we use compaction when generating the occlusion rays, that means one will only pay the extra cost if the intersection ray was marked as needing translucency handling.
All of that was possible thanks to the open source nature of RadeonRays which was forked and customized to our needs as part of the collaboration with AMD.
We have seen what we do in regard to raw performance, great! However, it is only the first part of the puzzle. High samples per second are great but what really matters, in the end, is the bake time. In other words, we want to get the maximum out of every ray we cast. This last statement is actually the root of decades of ongoing research. Here are some great resources:
Ray Tracing: The Rest of Your Life
Unity GPU Lightmapper is a pure diffuse lightmapper. This simplifies the interaction of the light with the materials a lot and also helps dampen fireflies and noise. However, there is still a lot we can do to improve the convergence rate. Here are some of the techniques that we use:
Russian roulette
At each bounce, we probabilistically kill the path based on accumulated albedo. One can find a great explanation in Eric Veach’s thesis (page 67).
Environment Multiple Importance Sampling (MIS)
HDR environments that exhibit high variance can cause a considerable amount of noise in the output, requiring huge sample counts to produce pleasing results. Therefore, we apply a combination of sampling strategies specifically tailored to evaluate the environment by analyzing it first, identifying important areas, and sampling accordingly. This approach, which is not exclusive to environmental sampling, is generally known as multiple importance sampling and was initially proposed in Eric Veach’s thesis (page 252). This was done in collaboration with Unity Labs Grenoble.
Many lights
At each bounce, we probabilistically select one direct light and we limit the number of lights affecting surfaces with a spatial grid structure. This was done in collaboration with AMD. We are currently investigating deeper in the many light problem as light selection sampling is critical to quality.
Denoising
Noise is removed by using an AI denoiser trained on outputs from a path tracer. See Jesper Mortensen’s Unity GDC 2019 presentation.
We have seen how a data-driven pipeline, the attention to raw performance and efficient algorithms are combined together to offer an interactive lightmapping experience with the GPU Lightmapper. Please note that the GPU Lightmapper is in active development and is constantly being improved.
The Lighting Team
PS: If you think this was a fun read, and are interested in taking up a new challenge, we’re currently looking for a Lighting Developer in Copenhagen, so get in touch!
---
Want to learn how to optimize graphics in Unity? Check out this tutorial.