What you will get from this page: Updated scripting performance and optimization tips from Ian Dundore that reflect the evolution of Unity’s architecture to support data-oriented design.
Unity’s evolving, and the old tricks might no longer be the best ways to squeeze performance out of the engine. In this article, you’ll get a rundown of a few Unity changes (from Unity 5.x to 2019.x) and how you can take advantage of them.
Scripting performance part I
One of the most difficult tasks during optimization is choosing how to optimize code once a hot spot has been discovered. Many factors are involved: platform-specific details of the operating system, CPU, and GPU; threading; memory access; distribution of input data; parameters that need scaling, and others. It’s difficult to know in advance which optimization will produce the biggest real-world benefit.
Generally, it’s a good idea to prototype optimizations in small test projects – you can iterate much faster. However, isolating code into a test project poses its own challenge: simply isolating a piece of code changes the environment in which it runs. Thread timings may differ; the managed heap may be smaller or less fragmented. Therefore, you must be careful in designing your tests.
Start by considering the inputs to your code and how the code reacts when you change its inputs.
- How does it react to highly coherent data that’s located serially in memory? How does it handle cache-incoherent data?
- How much code have you removed from the loop in which your code runs? Have you altered the usage of the processor’s instruction cache?
- What hardware is it running on? How well does that hardware implement branch prediction? How well does it execute micro-operations out of order? Does it have SIMD support?
- If you have a heavily multithreaded system and it’s running on more cores versus fewer, how does the system react?
- What are the scaling parameters of your code? Does it scale linearly as its input set grows, or more than linearly?
Effectively, you must think about exactly what your test harness measures.
Implications: Sample test
As an example, consider the following test of a simple operation: comparing two strings (see image above).
When the C# APIs compare two strings, they do locale-specific conversions to ensure that different characters match their counterparts from different cultures and you’ll notice that is pretty slow.
While most string APIs in C# are culture-sensitive, one is not: String.Equals.
If you open up String.CS from Unity’s Mono GitHub and look at String.Equals, you see a very simple function. It performs a couple of checks before passing control to a function called EqualsHelper, a private function that you cannot call directly without C# reflection.
EqualsHelper is a simple method. It walks through the strings 4 bytes at a time, comparing the raw bytes of the input strings. If it finds a mismatch, it stops and returns false.
But there are other ways to check for string equality. The most innocuous-looking is an overload of String.Equals that accepts two parameters: the string to compare against, and an enum called StringComparison.
The single-parameter overload of String.Equals does only a little work before passing control to EqualsHelper. So what does the two-parameter overload do?
The two-parameter overload’s code performs a few additional checks before entering a large switch statement. This statement tests the value of the input StringComparison enum. Since we’re seeking parity with the single-parameter overload, we want an ordinal comparison – a byte-by-byte comparison. In this case, control will flow past 4 checks before arriving at the StringComparison.Ordinal case, where the code then looks very similar to the single-parameter String.Equals overload. This means that if you used the two-parameter overload of String.Equals instead of the single-parameter overload, the processor would perform a few extra comparison operations. One could expect this to be slower, but it’s worth testing.
You don’t want to stop at testing just String.Equals when you are interested in all the ways to compare strings for equality. There is an overload of String.Compare which can perform ordinal comparisons, as well as a method called String.CompareOrdinal which itself has two different overloads.
See part II for more!
Scripting performance part II
As a reference implementation, let’s also write a simple hand-coded example. It’s just a little function with a length check that iterates over each character in the two input strings and checks if each is equal.
After examining the code for all of these, there are four different test cases that seem immediately useful:
- Two identical strings, to test worst-case performance.
- Two strings of random characters but of identical length, to bypass length checks.
- Two strings of random characters and identical length with an identical first character, to bypass an interesting optimization found only in String.CompareOrdinal.
- Two strings of random characters and differing lengths, to test best-case performance.
After running a few tests, String.Equals is the clear winner. This remains true regardless of platform, or scripting runtime version, or whether we’re using Mono or IL2CPP.
It’s worth noting that String.Equals is the method used by the string equality operator, ==, so don’t run out and change a == b to a.Equals(b) all over your code!
Actually, examining the results, it’s odd how much worse the hand-coded reference implementation is. Examining the IL2CPP code, we can see that Unity injects a bunch of array bounds checks and null checks when our code is cross-compiled.
These can be disabled. In your Unity install folder, find the IL2CPP subfolder. Inside that IL2CPP subfolder, you'll find IL2CPPSetOptionAttributes.cs. Drag this into your project, and you'll get access to the Il2CppSetOptionAttribute.
You can decorate types and methods with this attribute. You can configure it to disable automatic null checks, automatic array bounds checks, or both. Sometimes this can speed up code by a substantial amount. In this particular test case, it speeds up the hand-coded string comparison method by about 20%.
Transform component is one of the systems used by the greatest number of other systems in Unity, such as Animation, Physics, UI, Rendering. Since Unity 2017.4 and 2018.1, the Transform system is built on top of two key concepts: TransformHierarchy and TransformChangeDispatch.
If we examine the memory layout of a Unity scene, each root Transform would correspond to a contiguous data buffer. This buffer, called a TransformHierarchy, contains data for all of the Transforms beneath a root Transform. We pack transform data into TransformHierarchy so that we can perform Transform calculations efficiently. For example, when calculating the world position of a GameObject, having all local translation / rotation / scaling Transform data packed in one well laid-out data structure improves performance by minimizing cache misses.
In addition, a TransformHierarchy also stores metadata about each Transform it contains. This metadata includes a bitmask to track which of Unity’s other systems is interested in changes to a specific Transform. It also includes a bitmask which indicates whether a Transform is “dirty” for a specific system – if the Transform’s position, rotation, or scale has changed since the last time the Transform was marked as being “clean” by that system.
With this data, Unity can now create a list of dirty Transforms for each of its other internal systems. The system that handles these queries for dirty Transforms in a multithreaded manner is called TransformChangeDispatch. For example, the Physics system can query TransformChangeDispatch to fetch a list of Transforms whose data have changed since the last time the Physics system ran a FixedUpdate.
However, to assemble this list of changed Transforms, TransformChangeDispatch should not iterate over all Transforms in a scene. That would become very slow if a scene contained a lot of Transforms – especially since, in many cases, very few Transforms would have changed.
To fix this, TransformChangeDispatch tracks a list of dirty TransformHierarchy structures. Whenever a Transform changes, it marks itself as dirty, marks its children as dirty, and then registers its TransformHierarchy to TransformChangeDispatch. When another system inside Unity requests a list of changed Transforms, TransformChangeDispatch iterates over each Transform in each dirty TransformHierarchy structure. Transforms with the appropriate dirty bits set are added to a list and this list is returned to the system making the request.
Because of this architecture, the more you split your hierarchy, the better you make Unity's ability to track changes at a granular level. Although it may be acceptable to have a large hierarchy of Transforms that never change, having a constantly changing Transform within a mostly static hierarchy would sometimes force Unity to do significant useless scanning.
Moreover, TransformChangeDispatch uses Unity’s internal multithreading system to split up the work it needs to do when examining TransformHierarchy structures, and the smallest unit of a work item is per hierarchy. So splitting hierarchy into reasonable-sized chunks could also make your program more multi-threaded.
The scanning for dirty flags across TransformHierarchy structures incurs some overhead each time a system needs to request a list of changed Transforms from TransformChangeDispatch. Most of Unity’s internal systems request updates once per frame, immediately before they run. For example, the Animation system requests updates immediately before it evaluates all the active Animators in your scene. Similarly, the rendering system requests updates for all active Renderers in your scene before it begins culling the list of visible objects.
The physics system works differently than the other systems. Since Unity 2017.2, Unity’s physics system works on top of TransformChangeDispatch. Any time you perform a Raycast, Unity would need to query TransformChangeDispatch for a list of changed Transforms and apply them to the physics world. That could be expensive, depending on how big your Transform Hierarchies were and how your code called physics APIs. However, if we skip the TransformChangeDispatch query, the Raycast might be performed on out-of-date data.
Unity offers users the ability to pick which behaviour the Editor should choose using the Physics.autoSyncTransforms setting. This can be specified either in your physics settings in the Unity Editor or at runtime by setting the Physics.autoSyncTransforms property.
From Unity 2017.2 to Unity 2018.2, Physics.autoSyncTransforms defaults to true. In this case, Unity will automatically synchronize the physics world to Transform updates each time you call a physics query API like Raycast or Spherecast.
From Unity 2018.3 onwards, Physics.autoSyncTransforms defaults to false. In this case, the physics system will only query the TransformChangeDispatch system for changes at two specific times: immediately before running FixedUpdate, where physics simulation is performed, and (if there is any rigidbody performing RigidbodyInterpolation) before Update, where the interpolated physics simulation result is written back into the Unity scene.
Setting Physics.autoSyncTransforms to false will eliminate spikes due to TransformChangeDispatch and Physics scene updates from Physics queries. However, changes to Colliders will not be synchronized into the Physics scene until the next FixedUpdate. This means that if you disable autoSyncTransforms, move a Collider and then call Raycast with a Ray directed at the Collider’s new position, the Raycast might not hit the Collider. This is because the Raycast is operating on the last-updated version of the physics scene which has not yet been updated with the Collider’s new position.
This can result in some issues in your project. Before you perform any physics system query such as Raycast, you can call Physics.SyncTransforms to force the physics system to synchronize the physics world with the Unity scene. A recommended approach is call Physics.SyncTransforms once and then perform all of your physics queries in a batch.
The example above illustrates the difference between scattered and batched physics query.
Transforms: performance of scattered vs batched physics query
The performance difference between these two examples is striking and becomes even more dramatic when a scene contains only small Transform hierarchies (see example above).
Furthermore, if your project has a lot of raycast tasks, consider batching your physics queries together into a job through, for example, the RaycastCommand API. This allows your code to run outside of the main thread in parallel. Note that before you schedule a RaycastCommand job and if Physics.autoSynctransform is set to false, you would still need to call Physics.SyncTransforms to ensure your scene is up to date.
If you have a C# subsystem that needs to access TransformChangeDispatch from the script, the only way is through the Transform.hasChanged property, which is built internally on top of the TransformChangeDispatch system. After gathering the list of changed transforms from the main thread in this fashion, you can go through these transforms through a job using the IJobParallelForTransform API.
The Audio system
Internally, Unity uses a system called FMOD to play AudioClips. FMOD runs on its own threads, and those are responsible for decoding and mixing audio together. However, audio playback isn’t entirely free. There is some work performed on the main thread for each Audio Source active in a scene. Also, on platforms with fewer numbers of cores (such as older mobile phones), it’s possible for FMOD’s audio threads to compete for processor cores with Unity’s main and rendering threads.
On each frame, Unity loops over all active Audio Sources. For each Audio Source, Unity calculates the distance between the audio source and the active audio listener, and a few other parameters. This data is used to calculate volume attenuation, doppler shift, and other effects that can affect individual Audio Sources.
A common issue comes from the "Mute" checkbox on an Audio Source (see image above). You might think that setting Mute to true would eliminate any computation related to the muted Audio Source – but it doesn’t!
Instead, the “Mute” setting simply clamps the Volume parameter to zero, after all other Volume-related calculations are performed, including the distance check. Unity will also submit the muted Audio Source to FMOD, which will then ignore it. The calculation of Audio Source parameters and the submission of Audio Sources to FMOD will show up as AudiosSystem.Update in the Unity Profiler.
If you notice a lot of time allocated to that Profiler marker, check to see if you have a lot of active Audio Sources which are muted. If they are, consider disabling the muted Audio Source components instead of muting them, or disabling their GameObject. You can also call AudioSource.Stop, which will stop playback.
Another thing you can do is clamp the voice count in Unity’s Audio Settings. To do this, you can call AudioSettings.GetConfiguration, which returns a structure containing two values of interest: the virtual voice count, and the real voice count.
Reducing the number of Virtual Voices will reduce the number of Audio Sources which FMOD will examine when determining which Audio Sources to actually play back. Reducing the Real Voice count will reduce the number of Audio Sources which FMOD actually mixes together to produce your game’s audio.
To change the number of Virtual or Real voices that FMOD uses, you should change the appropriate values in the AudioConfiguration structure returned by AudioSettings.GetConfiguration, then reset the Audio system with the new configuration by passing the AudioConfiguration structure as a parameter to AudioSettings.Reset. Note that this interrupts audio playback, so do this when players won’t notice the change, such as during a loading screen or at startup time.
There are two different systems that can be used to play animations in Unity: the Animator system and the Animation system.
By “Animator system” we mean the system involving the Animator component, which is attached to GameObjects in order to animate them, and the AnimatorController asset, which is referenced by one or more Animators. This system was historically called Mecanim.
In an Animator Controller, you define states. These states can be either an Animation Clip or Blend Tree. States can be organized into Layers. For each frame, the active state on each Layer is evaluated, and the results from each active Layer are blended together and applied to the animated model. When transitioning between two states, both states are evaluated.
The other system is one we call the “Animation system” and it’s represented by the Animation component.. Each frame and each active Animation component linearly iterates through all the curves in its attached Animation Clip, evaluates those curves, and applies the results.
The difference between these two systems is not just features, but also underlying implementation details.
The Animator system is heavily multithreaded. Moreover, the Animator system bakes an animation clip into a “streamed animation clip” where multiple curves are stored in the same stream and keys for all curves are sorted by time, designed to reduce cache misses. In general, it scales well as the number of curves in its Animation Clips increase. Therefore, it performs very well when evaluating complex animations with a large number of curves. However, it has a fairly high overhead cost.
The Animation system, having relatively few features, has almost no overhead. Its performance scales less well with the number of curves in the Animation Clips being played.
This difference is most striking when the two systems are compared when playing back identical Animation Clips (see image above).
When playing back Animation Clips, try to choose the system that best suits the complexity of your content and the hardware on which your game will be running. Test your animations on your lowest-end hardware.
Generic vs humanoid rig
By default, Unity imports animated models with the Generic Rig, but developers often switch to the Humanoid Rig when animating a character. However, using the Humanoid Rig comes at a cost.
The Humanoid Rig brings two additional features to the Animator System: inverse kinematics (IK) and animation retargeting (which allows you to reuse animations across different avatars).
However, even if you don’t use IK or animation retargeting, the Animator of a Humanoid-rigged character computes IK and retargeting data for each frame. This consumes about 30–50% more CPU time than for the Generic Rig, where these calculations are not performed.
If you don’t need to take advantage of the specific features that the Humanoid Rig offers, you should use the Generic Rig.
Object pooling is a key strategy for avoiding performance spikes during gameplay. However, Animators have historically been difficult to use with object pooling. Whenever an Animator’s GameObject is enabled, it must build a list of pointers to the memory address of the properties the Animator is animating. This means querying the hierarchy for a component’s specific field, which can be expensive. This process is called an Animator Rebind, and it shows up in the Unity Profiler as Animator.Rebind.
The Animator Rebind is unavoidable at least once, for any scene. It involves recursively traversing all children of the GameObject the Animator is attached to, getting a hash of their name, and comparing these to the hash of each animation curve’s target path. Therefore, having children that a hierarchy is not animating adds extra cost to the binding process. Avoiding the Animator on a GameObject that has a huge number of children it’s not animating would help Rebind performance.
Rebind for MonoBehaviour is more expensive than that for built-in classes such as Transform. The Animator component would scan the fields of the MonoBehaviour to create a sorted list indexed by the hash of the fields’ name. Then, for each animation curve animating a field of the MonoBehaviour, a binary search is performed on that sorted list. Therefore, it may help reduce Rebind time if you keep the fields in the MonoBehaviour you are animating simple, and avoid large nested serializable structures.
After the inevitable initial Rebind, you should pool your GameObjects. Instead of enabling/disabling the whole GameObject, you can just enable/disable the Animator component to avoid rebinding.