What are you looking for?
Engine & platform
ML-Agents v2.0 release: Now supports training complex cooperative behaviors
May 5, 2021|15 Min
ML-Agents v2.0 release: Now supports training complex cooperative behaviors

About one year ago, we announced the release of the ML-Agents v1.0 Unity package, which was verified for the 2020.2 Editor release. Today, we’re delighted to announce the v2.0 release of the ML-Agents Unity package, currently on track to be verified for the 2021.2 Editor release. Over this past year, we’ve made more than fifteen key updates to the ML-Agents GitHub project, including improvements to the user workflow, new training algorithms and features, and a significant performance boost. In this blog post, we will highlight three core developments: The ability to train cooperative behaviors, enable agents to observe various entities in their environment, and harness task parameterization to support training multiple tasks. These advancements combined move ML-Agents closer to fully supporting complex cooperative environments.

In our 2020 end-of-year blog post, we provided a brief summary of all the progress we had made from our v1.0 release in May 2020 through December of that same year. We also unpacked three main algorithmic improvements that we had planned to focus on for the first half of 2021: Cooperative multi-agent behavior, the capacity of an agent to observe a varying number of entities, and establishing a single model to solve several tasks. We can now proudly say that all three major improvements are available in ML-Agents.

In addition to these three features, we made the following changes to the main ML-Agents package:

Added a number of capabilities that were previously part of our companion extensions Unity package – i.e., Grid Sensors component and Match-3 game boards

Enhanced memory allocation during inference. In some of our demo scenes, we observed up to 98% reduction.

Removed deprecated APIs and reduced our API footprint. These API-breaking adjustments necessitated our version upgrade from 1.x to 2.0. See our Release notes and Migration guide for further details on upgrading with ease.

In the remainder of this blog post, we will expand on the roles of cooperative behaviors, variable length observation, and task parameterization, along with two incremental improvements: Promotion of features from the extensions package, and overarching performance. We will also provide an update on our ML-Agents Cloud offering and share a preview of our exciting new game environment that will highlight complex cooperative behaviors, ahead of its release in just a few short weeks.

Training cooperative behaviors

In many environments, such as multiplayer games like Among Us, the players in the game must collaborate to solve the tasks at hand. While it may have been previously possible to train ML-Agents with multiple agents in the scene, you could not define specific agent groups with mutual goals up until Release 15 (March 2020). ML-Agents now explicitly supports training cooperative behaviors. This way, groups of agents can work toward a common goal, with the success of each individual tied to the success of the whole group.

In such scenarios, agents typically receive rewards as a group. So if a team of agents wins a game against an opposing team, everyone is rewarded, even the agents who did not directly contribute to this win, which makes it difficult to learn what to do independently. This is why we developed a novel multi-agent trainer (dubbed MA-POCA for Multi-Agent POsthumous Credit Assignment; full arXiv paper coming soon) to train a centralized critic – a neural network that acts as a “coach” for the whole group of agents.

With this addition, you can continue to give rewards to the team as a whole, but the agents will also learn how to best contribute to their shared achievement. Agents can even receive individual awards, so that they stay motivated and help each other attain their goals. During an episode, you can add or remove agents from the group, such as when agents spawn or die in a game. If agents are removed mid-episode, they will still be able to understand whether or not their actions contributed to the team winning later on. This empowers them to put the group first in their actions; even if they end up removing themselves from the game through self-sacrifice or other gameplay decision-making. By combining MA-POCA with self-play, you can also train teams of agents to play against each other.

More than this, we developed two new sample environments: Cooperative Push Block and Dungeon Escape. Cooperative Push Block showcases a task that requires multiple agents to complete. The video below exhibits Dungeon Escape, in which one agent must slay the dragon, causing it to be removed mid-episode, so that its teammates can pick up the key and escape the dungeon.

Read through our documentation for details on how to implement cooperative agents into your project.

Observing varying numbers of entities

One of the most commonly requested features for the toolkit has been to enable game characters’ reactions to varying numbers of entities. In video games, characters often learn how to deal with several enemies or items at once. To meet this demand, Release 15 (March 2020) now makes it possible to specify an arbitrary length array of observations called the “observation buffer.” Agents can learn how to utilize an arbitrary-sized buffer through an Attention Module that encodes and processes a varying number of observations.

The Attention Module serves as a great solution in situations where a game character must learn to avoid projectiles, for example, but the number of projectiles in the scene is not fixed. In this video, each projectile is represented by four values: Two for positioning and two for speed. For each projectile in the scene, these four values are appended to a buffer of projectile observations. The agent can then learn to ignore the projectiles that are not on a collision trajectory, and instead pay extra attention to the more dangerous projectiles.

What’s more, agents can learn the importance of entities based on their relations across entities in the scene. For instance, if agents must learn how to sort tiles in ascending order, they will be able to figure out which tile is the next correct tile based on the information of all other tiles. This new environment, dubbed Sorter, is now available as part of our example environments that you can download and use to get started.

Read through our documentation for details on how to implement variable length observations into your project.

Task parameterization: One model to rule them all

Video game characters often encounter several tasks in different game modes. One way to approach this challenge is to train multiple behaviors separately and then swap between them. However, it is preferable to train a single model that can complete multiple tasks at once. After all, a single model lowers the memory footprint in the final game, and by extension, shortens overall training time since the model can reuse some parts of the neural network across multiple tasks. To this end, we added the ability for a single model to encode multiple behaviors using HyperNetworks in our latest release (Release 17).

In practice, we use a new type of observation called a “goal signal,” as well as a small neural network called a “HyperNetwork,” to generate some of the weights of another bigger neural network. This bigger network is the one that informs the agent’s behavior and enables the behavior’s neural network to have different weights, depending on the goal of the agent, while maintaining some shared pieces across goals when necessary.

The following video shows an agent solving two tasks present in the ML-Agents examples (WallJump and PushBlock) at the same time. If the bottom color is green, the agent must push the block into the green zone. But if the top-right square is green, the agent must jump over the wall onto the green square.

Read through our documentation for details on how to implement task parameterization using goal signals in your project.

Features promoted from the extensions package
Grid Sensors
ML-Agents called Grid Sensor

In November 2020, we wrote about how Eidos developed a new type of sensor in ML-Agents called Grid Sensor. This Grid Sensor implementation was added to our extensions package at the time, before we went on to iterate on the implementation and promote it to this latest release of the main ML-Agents package.


In Release 10 (November 2020) of ML-Agents, we introduced a new Match-3 environment and added utilities to our extensions package to enable the training of Match-3 games. We’ve since partnered with Code Monkey to release a tutorial video. Similar to Grid Sensors, we made our utilities for training Match-3 games a part of the core ML-Agents package in our latest release.

Performance improvements

Our goal is to keep on improving ML-Agents. After hearing your feedback on the amount of memory allocated during inference, we promptly made significant allocation reductions. The table below shows a comparison of the garbage collection metrics (kilobytes per Academy step) in two of our example scenes between versions 1.0 (released May 2020) and 2.0 (released April 2021). These metrics exclude the memory used by Barracuda (the Unity Inference Engine that ML-Agents relies on for cross-platform inference):

a comparison of the garbage collection metrics in two of our example scenes between versions 1.0 and 2.0
ML-Agents Cloud: An update

In our v1.0 blog post, we first shared some details on ML-Agents Cloud. Our ML-Agents Cloud service lets you kick off multiple training sessions that run on our cloud infrastructure in parallel, so you can complete your experimentation in a timely manner. Today, ML-Agents Cloud’s core functionality gives you the ability to:

Upload your game builds with ML-Agents implemented (C#).

Start, pause, resume and stop training experiments. You can launch multiple experiments at the same time and leverage high-end machines to spawn many concurrent Unity instances for each training experiment – all with faster completion times.

Download results from multiple training experiments.

Throughout the rest of 2021, we plan to accelerate the development of ML-Agents Cloud, based on Alpha customer feedback. Additional functionalities will focus on the ability to visualize your results, manage your experiments from a web UI, and harness hyper-parameter tuning. In fact, we are still accepting applicants to the Alpha program today. If you are interested in signing up, please register here.

What’s coming up next?

In this post, we outlined three core improvements that move ML-Agents closer toward supporting complex cooperative games. We demonstrated each of these three improvements in isolation, and also discussed the sample environments recently added to the toolkit. What we did not yet reveal is another upcoming showcase environment called Dodgeball. Dodgeball is a team versus team game that highlights the way that all three features work together. Agents must reason in complex environments to solve multiple modes, cooperate with teammates, and observe varying entities in a scene. We plan to release this environment in the coming weeks, alongside another dedicated blog post. Till then, check out this sneak peek of our agents training to play Dodgeball.

Thank you

On behalf of the entire Unity ML-Agents team, we want to thank you all for joining us on this journey with ongoing support over the years.

If you’d like to work on this exciting intersection of Machine Learning and Games, we are hiring for several positions and encourage you to apply here.

Finally, we would love to hear your feedback! For any feedback regarding the Unity ML-Agents toolkit, please fill out the following survey or email us directly at ml-agents@unity3d.com. If you encounter any issues, do not hesitate to reach out to us on the ML-Agents GitHub issues page. For any other general comments or questions, please let us know on the Unity ML-Agents forums.