diff --git a/README.md b/README.md index dee7a1331..cff9eb1ab 100644 --- a/README.md +++ b/README.md @@ -9,149 +9,39 @@
- - Python Versions - - - Python Versions - - - License - - - Code Style - - - MyPy - - - ArXiv - -Open In Colab + +![Python Version](https://img.shields.io/python/required-version-toml?tomlFilePath=https%3A%2F%2Fraw.githubusercontent.com%2Finstadeepai%2FMava%2Fdevelop%2Fpyproject.toml) +[![Tests](https://github.com/instadeepai/Mava/actions/workflows/ci.yaml/badge.svg)](https://github.com/instadeepai/Mava/actions/workflows/ci.yaml) +[![License](https://img.shields.io/badge/License-Apache%202.0-orange.svg)](https://opensource.org/licenses/Apache-2.0) +[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff) +[![MyPy](http://www.mypy-lang.org/static/mypy_badge.svg)](http://mypy-lang.org/) +[![ArXiv](https://img.shields.io/badge/ArXiv-2410.01706-b31b1b.svg)](https://arxiv.org/abs/2410.01706) +[![Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/instadeepai/Mava/blob/develop/examples/Quickstart.ipynb)
+ ## Welcome to Mava! ๐Ÿฆ

-[**Installation**](#installation-) | [**Quickstart**](#quickstart-) +[**Installation**](#installation-) | [**Getting started**](#getting-started-)

-Mava provides simplified code for quickly iterating on ideas in multi-agent reinforcement learning (MARL) with useful implementations of MARL algorithms in JAX allowing for easy parallelisation across devices with JAX's `pmap`. Mava is a project originating in the Research Team at [InstaDeep](https://www.instadeep.com/). - -To join us in these efforts, please feel free to reach out, raise issues or read our [contribution guidelines](#contributing-) (or just star ๐ŸŒŸ to stay up to date with the latest developments)! +Mava allows researchers to experiment with multi-agent reinforcement learning (MARL) at lightning speed. The single-file JAX implementations are built for rapid research iteration - hack, modify, and test new ideas fast. Our [state-of-the-art algorithms][sable] scale seamlessly across devices. Created for researchers, by The Research Team at [InstaDeep](https://www.instadeep.com). -## Overview ๐Ÿฆœ +## Highlights ๐Ÿฆœ -Mava currently offers the following building blocks for MARL research: - -- ๐Ÿฅ‘ **Implementations of MARL algorithms**: Implementations of multi-agent PPO systems that follow both the Centralised Training with Decentralised Execution (CTDE) and Decentralised Training with Decentralised Execution (DTDE) MARL paradigms. -- ๐Ÿฌ **Environment Wrappers**: Example wrappers for mapping Jumanji environments to an environment that is compatible with Mava. At the moment, we support [Robotic Warehouse][jumanji_rware] and [Level-Based Foraging][jumanji_lbf] with plans to support more environments soon. We have also recently added support for the SMAX environment from [JaxMARL][jaxmarl]. -- ๐ŸŽ“ **Educational Material**: [Quickstart notebook][quickstart] to demonstrate how Mava can be used and to highlight the added value of JAX-based MARL. +- ๐Ÿฅ‘ **Implementations of MARL algorithms**: Implementations of current state-of-the-art MARL algorithms that are distributed and effectively make use of available accelerators. +- ๐Ÿฌ **Environment Wrappers**: We provide first class support to a few JAX based MARL environment suites through the use of wrappers, however new environments can be easily added by using existing wrappers as a guide. - ๐Ÿงช **Statistically robust evaluation**: Mava natively supports logging to json files which adhere to the standard suggested by [Gorsane et al. (2022)][toward_standard_eval]. This enables easy downstream experiment plotting and aggregation using the tools found in the [MARL-eval][marl_eval] library. - -## Performance and Speed ๐Ÿš€ - -### SMAX -For comparing Mavaโ€™s stability to other JAX-based baseline algorithms, we train Mavaโ€™s recurrent IPPO and MAPPO systems on a broad range of [SMAX][smax] tasks. In all cases we do not rerun baselines but instead take results for final win rates from the [JaxMARL technical report](https://arxiv.org/pdf/2311.10090.pdf). For the full SMAX experiments results, please see the following [page](docs/smax_benchmark.md). - -

- - legend - -

- -

- - Mava ff mappo tiny 2ag - - - Mava ff mappo tiny 4ag - - - Mava ff mappo small 4ag - -
-

Mava Recurrent IPPO and MAPPO performance on the 3s5z, 6h_vs_8z and 3s5z_vs_3s6z SMAX tasks.
-

- -### Robotic Warehouse - -All of the experiments below were performed using an NVIDIA Quadro RTX 4000 GPU with 8GB Memory. - -In order to show the utility of end-to-end JAX-based MARL systems and JAX-based environments we compare the speed of Mava against [EPyMARL][epymarl] as measured in total training wallclock time on simple [Robotic Warehouse][rware] (RWARE) tasks with 2 and 4 agents. Our aim is to illustrate the speed increases that are possible with using end-to-end Jax-based systems and we do not necessarily make an effort to achieve optimal performance. For EPyMARL, we use the hyperparameters as recommended by [Papoudakis et al. (2020)](https://arxiv.org/pdf/2006.07869.pdf) and for Mava we performed a basic grid search. In both cases, systems were trained up to 20 million total environment steps using 16 vectorised environments. - -

- - legend - -

- -

- - Mava ff mappo tiny 2ag - - - Mava ff mappo tiny 4ag - - - Mava ff mappo small 4ag - -
-

Mava feedforward MAPPO performance on the tiny-2ag, tiny-4ag and small-4ag RWARE tasks.
-

- - -### ๐Ÿ“Œ An important note on the differences in converged performance - -In order to benefit from the wallclock speed-ups afforded by JAX-based systems it is required that environments also be written in JAX. It is for this reason that Mava does not use the exact same version of the RWARE environment as EPyMARL but instead uses a JAX-based implementation of RWARE found in [Jumanji][jumanji_rware], under the name RobotWarehouse. One of the notable differences in the underlying environment logic is that RobotWarehouse will not attempt to resolve agent collisions but will instead terminate an episode when agents do collide. In our experiments, this appeared to make the environment more challenging. For this reason we show the performance of Mava on Jumanji with and without termination upon collision indicated with `w/o collision` in the figure legends. For a more detailed discussion, please see the following [page](docs/jumanji_rware_comparison.md). - -### Level-Based Foraging -Mava also supports [Jumanji][jumanji_lbf]'s LBF. We evaluate Mava's recurrent MAPPO system on LBF, against [EPyMARL][epymarl] (we used original [LBF](https://github.com/semitable/lb-foraging) for EPyMARL) in 2 and 4 agent settings up to 20 million timesteps. Both systems were trained using 16 vectorized environments. For the EPyMARL systems we use a NVIDIA A100 GPU and for the Mava systems we use a GeForce RTX 3050 laptop GPU with 4GB of memory. To show how Mava can generalise to different hardware, we also train the Mava systems on a TPU v3-8. We plan to publish comprehensive performance benchmarks for all Mava's algorithms across various LBF scenarios soon. - -

- - legend - -

- -

- - Mava ff mappo tiny 2ag - - - Mava ff mappo small 4ag - -
-

Mava Recurrent MAPPO performance on the 2s-8x8-2p-2f-coop, and 15x15-4p-3fz Level-Based Foraging tasks.
-

- -### ๐Ÿงจ Steps per second experiments using vectorised environments - -Furthermore, we illustrate the speed of Mava by showing the steps per second as the number of parallel environments is increased. These steps per second scaling plots were computed using a standard laptop GPU, specifically an RTX-3060 GPU with 6GB memory. - -

- - Mava sps - - - Mava ff mappo speed comparison - -
-

Mava steps per second scaling with increased vectorised environments and total training run time for 20M environment steps.
-

- -## Code Philosophy ๐Ÿง˜ - -The current code in Mava is adapted from [PureJaxRL][purejaxrl] which provides high-quality single-file implementations with research-friendly features. In turn, PureJaxRL is inspired by the code philosophy from [CleanRL][cleanrl]. Along this vein of easy-to-use and understandable RL codebases, Mava is not designed to be a modular library and is not meant to be imported. Our repository focuses on simplicity and clarity in its implementations while utilising the advantages offered by JAX such as `pmap` and `vmap`, making it an excellent resource for researchers and practitioners to build upon. +- ๐Ÿ–ฅ๏ธ **JAX Distrubution Architectures for Reinforcement Learning**: Mava supports both [Podracer][anakin_paper] architectures for scaling RL systems. The first of these is _Anakin_, which can be used when environments are written in JAX. This enables end-to-end JIT compilation of the full MARL training loop for fast experiment run times on hardware accelerators. The second is _Sebulba_, which can be used when environments are not written in JAX. Sebulba is particularly useful when running RL experiments where a hardware accelerator can interact with many CPU cores at a time. +- โšก **Blazingly fast experiments**: All of the above allow for very quick runtime for our experiments, especially when compared to other non-JAX based MARL libraries. ## Installation ๐ŸŽฌ -At the moment Mava is not meant to be installed as a library, but rather to be used as a research tool. - -You can use Mava by cloning the repo and pip installing as follows: +At the moment Mava is not meant to be installed as a library, but rather to be used as a research tool. We recommend cloning the Mava repo and pip installing as follows: ```bash git clone https://github.com/instadeepai/mava.git @@ -162,18 +52,18 @@ pip install -e . We have tested `Mava` on Python 3.11 and 3.12, but earlier versions may also work. Specifically, we use Python 3.10 for the Quickstart notebook on Google Colab since Colab uses Python 3.10 by default. Note that because the installation of JAX differs depending on your hardware accelerator, we advise users to explicitly install the correct JAX version (see the [official installation guide](https://github.com/google/jax#installation)). For more in-depth installation guides including Docker builds and virtual environments, please see our [detailed installation guide](docs/DETAILED_INSTALL.md). -## Quickstart โšก +## Getting started โšก -To get started with training your first Mava system, simply run one of the system files. e.g., +To get started with training your first Mava system, simply run one of the system files: ```bash -python mava/systems/ff_ippo.py +python mava/systems/ppo/anakin/ff_ippo.py ``` -Mava makes use of Hydra for config management. In order to see our default system configs please see the `mava/configs/` directory. A benefit of Hydra is that configs can either be set in config yaml files or overwritten from the terminal on the fly. For an example of running a system on the LBF environment, the above code can simply be adapted as follows: +Mava makes use of [Hydra](https://github.com/facebookresearch/hydra) for config management. In order to see our default system configs please see the `mava/configs/` directory. A benefit of Hydra is that configs can either be set in config yaml files or overwritten from the terminal on the fly. For an example of running a system on the Level-based Foraging environment, the above code can simply be adapted as follows: ```bash -python mava/systems/ff_ippo.py env=lbf +python mava/systems/ppo/anakin/ff_ippo.py env=lbf ``` Different scenarios can also be run by making the following config updates from the terminal: @@ -182,11 +72,72 @@ Different scenarios can also be run by making the following config updates from python mava/systems/ff_ippo.py env=rware env/scenario=tiny-4ag ``` -Additionally, we also have a [Quickstart notebook][quickstart] that can be used to quickly create and train your first Multi-agent system. +Additionally, we also have a [Quickstart notebook][quickstart] that can be used to quickly create and train your first multi-agent system. + +

Algorithms

+ +Mava has implementations of multiple on- and off-policy multi-agent algorithms that follow the independent learners (IL), centralised training with decentralised execution (CTDE) and heterogeneous agent learning paradigms. Aside from MARL learning paradigms, we also include implementations which follow the Anakin and Sebulba architectures to enable scalable training by default. The architecture that is relevant for a given problem depends on whether the environment being used in written in JAX or not. For more information on these paradigms, please see [here][anakin_paper]. -## Advanced Usage ๐Ÿ‘ฝ +| Algorithm | Variants | Continuous | Discrete | Anakin | Sebulba | Paper | Docs | +|------------|----------------|------------|----------|--------|---------|-------|------| +| PPO | [`ff_ippo.py`](mava/systems/ppo/anakin/ff_ippo.py) | โœ… | โœ… | โœ… | โœ… | [Link](https://arxiv.org/abs/2011.09533) | [Link](mava/systems/ppo/README.md) | +| | [`ff_mappo.py`](mava/systems/ppo/anakin/ff_mappo.py) | โœ… | โœ… | โœ… | | [Link](https://arxiv.org/abs/2103.01955) | [Link](mava/systems/ppo/README.md) | +| | [`rec_ippo.py`](mava/systems/ppo/anakin/rec_ippo.py) | โœ… | โœ… | โœ… | | [Link](https://arxiv.org/abs/2011.09533) | [Link](mava/systems/ppo/README.md) | +| | [`rec_mappo.py`](mava/systems/ppo/anakin/rec_mappo.py) | โœ… | โœ… | โœ… | | [Link](https://arxiv.org/abs/2103.01955) | [Link](mava/systems/ppo/README.md) | +| Q Learning | [`rec_iql.py`](mava/systems/q_learning/anakin/rec_iql.py) | | โœ… | โœ… | | [Link](https://arxiv.org/abs/1511.08779) | [Link](mava/systems/q_learning/README.md) | +| | [`rec_qmix.py`](mava/systems/q_learning/anakin/rec_qmix.py) | | โœ… | โœ… | | [Link](https://arxiv.org/abs/1803.11485) | [Link](mava/systems/q_learning/README.md) | +| SAC | [`ff_isac.py`](mava/systems/sac/anakin/ff_isac.py) | โœ… | | โœ… | | [Link](https://arxiv.org/abs/1801.01290) | [Link](mava/systems/sac/README.md) | +| | [`ff_masac.py`](mava/systems/sac/anakin/ff_masac.py) | โœ… | | โœ… | | | [Link](mava/systems/sac/README.md) | +| | [`ff_hasac.py`](mava/systems/sac/anakin/ff_hasac.py) | โœ… | | โœ… | | [Link](https://arxiv.org/abs/2306.10715) | [Link](mava/systems/sac/README.md) | +| MAT | [`mat.py`](mava/systems/mat/anakin/mat.py) | โœ… | โœ… | โœ… | | [Link](https://arxiv.org/abs/2205.14953) | [Link](mava/systems/mat/README.md) | +| Sable | [`ff_sable.py`](mava/systems/sable/anakin/ff_sable.py) | โœ… | โœ… | โœ… | | [Link](https://arxiv.org/abs/2410.01706) | [Link](mava/systems/sable/README.md) | +| | [`rec_sable.py`](mava/systems/sable/anakin/rec_sable.py) | โœ… | โœ… | โœ… | | [Link](https://arxiv.org/abs/2410.01706) | [Link](mava/systems/sable/README.md) | +

Environments

-Mava can be used in a wide array of advanced systems. As an example, we demonstrate recording experience data from one of our PPO systems into a [Flashbax](https://github.com/instadeepai/flashbax) `Vault`. This vault can then easily be integrated into offline MARL systems, such as those found in [OG-MARL](https://github.com/instadeepai/og-marl). See the [Advanced README](./examples/advanced_usage/README.md) for more information. +These are the environments which Mava supports _out of the box_, to add a new environment, please use the [existing wrapper implementations](mava/wrappers/) as an example. We also indicate whether the environment is implemented in JAX or not. JAX-based environments can be used with algorithms that follow the Anakin distribution architecture, while non-JAX environments can be used with algorithms following the Sebulba architecture. + + +| Environment | Action space | JAX | Non-JAX | Paper | JAX Source | Non-JAX Source | +|---------------------------------|---------------------|-----|-------|-------|------------|----------------| +| Mulit-Robot Warehouse | Discrete | โœ… | โœ… | [Link](http://arxiv.org/abs/2006.07869) | [Link](https://github.com/instadeepai/jumanji/tree/main/jumanji/environments/routing/robot_warehouse) | [Link](https://github.com/semitable/robotic-warehouse) | +| Level-based Foraging | Discrete | โœ… | โœ… | [Link](https://arxiv.org/abs/2006.07169) | [Link](https://github.com/instadeepai/jumanji/tree/main/jumanji/environments/routing/lbf) | [Link](https://github.com/semitable/lb-foraging) | +| StarCraft Multi-Agent Challenge | Discrete | โœ… | โœ… | [Link](https://arxiv.org/abs/1902.04043) | [Link](https://github.com/FLAIROx/JaxMARL/tree/main/jaxmarl/environments/smax) | [Link](https://github.com/uoe-agents/smaclite) | +| Multi-Agent Brax | Continuous | โœ… | | [Link](https://arxiv.org/abs/2003.06709) | [Link](https://github.com/FLAIROx/JaxMARL/tree/main/jaxmarl/environments/mabrax) | | +| Matrax | Discrete | โœ… | | [Link](https://www.cs.toronto.edu/~cebly/Papers/_download_/multirl.pdf) | [Link](https://github.com/instadeepai/matrax) | | +| Multi Particle Environments | Discrete/Continuous | โœ… | | [Link](https://arxiv.org/abs/1706.02275) | [Link](https://github.com/FLAIROx/JaxMARL/tree/main/jaxmarl/environments/mpe) | | + +## Performance and Speed ๐Ÿš€ +We have performed a rigorous benchmark across 45 different scenarios and 6 different environment suites to validate the performance of Mava's algorithm implementations. For more detailed results please see our [Sable paper][sable] and for all hyperparameters, please see the following [website](https://sites.google.com/view/sable-marl). + +

+ + Mava performance across 15 Robot Warehouse environments + + + Mava performance across 7 Level Based Foraging environments + + + Mava performance across 11 Smax environments + + + Mava performance across 4 Conneector environments + + + Mava performance across 5 MaBrax environments + + + Mava performance across 3 Multi-Particle environments + +
+ + Legend + +

Mava's algorithm performance: Each algorithm was tuned for 40 trials with the TPE optimizer and benchmarked over 10 seeds for each scenario. Environments from top left Multi-Robot Warehouse (aggregated over 15 scenarios) Level-based Foraging (aggregated over 7 scenarios) StarCraft Multi-Agent Challenge in JAX (aggregated over 11 scenarios) Connector (aggregated over 4 scenarios) Multi-Agent Brax (aggregated over 5 scenarios) Multi Particle Environments (aggregated over 3 scenarios)
+

+ +## Code Philosophy ๐Ÿง˜ + +The original code in Mava was adapted from [PureJaxRL][purejaxrl] which provides high-quality single-file implementations with research-friendly features. In turn, PureJaxRL is inspired by the code philosophy from [CleanRL][cleanrl]. Along this vein of easy-to-use and understandable RL codebases, Mava is not designed to be a modular library and is not meant to be imported. Our repository focuses on simplicity and clarity in its implementations while utilising the advantages offered by JAX such as `pmap` and `vmap`, making it an excellent resource for researchers and practitioners to build upon. A notable difference between Mava and CleanRL is that Mava creates small utilities for heavily re-used elements, such as networks and logging, we've found that this, in addition to Hydra configs, greatly improves the readability of the algorithms. ## Contributing ๐Ÿค @@ -196,17 +147,16 @@ Please read our [contributing docs](docs/CONTRIBUTING.md) for details on how to We plan to iteratively expand Mava in the following increments: -- ๐ŸŒด Support for more environments. -- ๐Ÿ” More robust recurrent systems. -- ๐ŸŒณ Support for non JAX-based environments. -- ๐Ÿฆพ Support for off-policy algorithms. -- ๐ŸŽ› Continuous action space environments and algorithms. +- [x] Support for more environments. +- [x] More robust recurrent systems. +- [x] Support for non JAX-based environments. +- [ ] Add Sebulba versions of more algorithms. +- [x] Support for off-policy algorithms. +- [x] Continuous action space environments and algorithms. +- [ ] Allow systems to easily scale across multiple TPUs/GPUs. Please do follow along as we develop this next phase! -## TensorFlow 2 Mava: -Originally Mava was written in Tensorflow 2. Support for the TF2-based framework and systems has now been fully **deprecated**. If you would still like to use it, please install `v0.1.3` of Mava (i.e. `pip install id-mava==0.1.3`). - ## See Also ๐Ÿ”Ž **InstaDeep's MARL ecosystem in JAX.** In particular, we suggest users check out the following sister repositories: @@ -260,3 +210,4 @@ The development of Mava was supported with Cloud TPUs from Google's [TPU Researc [toward_standard_eval]: https://arxiv.org/pdf/2209.10485.pdf [marl_eval]: https://github.com/instadeepai/marl-eval [smax]: https://github.com/FLAIROx/JaxMARL/tree/main/jaxmarl/environments/smax +[sable]: https://arxiv.org/pdf/2410.01706 diff --git a/docs/CONTRIBUTING.md b/docs/CONTRIBUTING.md index 791237ddc..b7a7a1461 100644 --- a/docs/CONTRIBUTING.md +++ b/docs/CONTRIBUTING.md @@ -39,7 +39,7 @@ pre-commit run --all-files ## Naming Conventions ### Branch Names -We name our feature and bugfix branches as follows - `feature/[BRANCH-NAME]`, `bugfix/[BRANCH-NAME]` or `maintenance/[BRANCH-NAME]`. Please ensure `[BRANCH-NAME]` is hyphen delimited. +We name our feature and bugfix branches as follows - `feat/[BRANCH-NAME]`, `fix/[BRANCH-NAME]`. Please ensure `[BRANCH-NAME]` is hyphen delimited. ### Commit Messages We follow the conventional commits [standard](https://www.conventionalcommits.org/en/v1.0.0/). diff --git a/docs/DETAILED_INSTALL.md b/docs/DETAILED_INSTALL.md index 28547c8aa..04a499b72 100644 --- a/docs/DETAILED_INSTALL.md +++ b/docs/DETAILED_INSTALL.md @@ -1,12 +1,11 @@ # Detailed installation guide ### Conda virtual environment -We recommend using `conda` for package management. These instructions should allow you to install and run mava. +We recommend using [uv](https://docs.astral.sh/uv/) for package management. These instructions should allow you to install and run mava. -1. Create and activate a virtual environment +1. Install `uv` ```bash -conda create -n mava python=3.12 -conda activate mava +curl -LsSf https://astral.sh/uv/install.sh | sh ``` 2. Clone mava @@ -15,19 +14,22 @@ git clone https://github.com/instadeepai/Mava.git cd mava ``` -3. Install the dependencies +3. Create and activate a virtual environment and install requirements ```bash -pip install -e . +uv venv -p=3.12 +source .venv/bin/activate +uv pip install -e . ``` -4. Install jax on your accelerator. The example below is for an NVIDIA GPU, please the [official install guide](https://github.com/google/jax#installation) for other accelerators +4. Install jax on your accelerator. The example below is for an NVIDIA GPU, please the [official install guide](https://github.com/google/jax#installation) for other accelerators. +Note that the Jax version we use will change over time, please check the [requirements.txt](../requirements/requirements.txt) for our latest tested Jax verion. ```bash -pip install "jax[cuda12]==0.4.30" +uv pip install "jax[cuda12]==0.4.30" ``` 5. Run a system! ```bash -python mava/systems/ppo/ff_ippo.py env=rware +python mava/systems/ppo/anakin/ff_ippo.py env=rware ``` ### Docker @@ -50,4 +52,4 @@ If you are having trouble with dependencies we recommend using our docker image For example, `make run example=mava/systems/ppo/ff_ippo.py`. - Alternatively, run bash inside a docker container with mava installed by running `make bash`, and from there systems can be run as follows: `python dir/to/system.py`. + Alternatively, run bash inside a docker container with Mava installed by running `make bash`, and from there systems can be run as follows: `python dir/to/system.py`. diff --git a/docs/images/algo_images/sable-arch.png b/docs/images/algo_images/sable-arch.png new file mode 100644 index 000000000..1fd92c6c8 Binary files /dev/null and b/docs/images/algo_images/sable-arch.png differ diff --git a/docs/images/benchmark_results/connector.png b/docs/images/benchmark_results/connector.png new file mode 100644 index 000000000..5931b9c59 Binary files /dev/null and b/docs/images/benchmark_results/connector.png differ diff --git a/docs/images/benchmark_results/lbf.png b/docs/images/benchmark_results/lbf.png new file mode 100644 index 000000000..34be25000 Binary files /dev/null and b/docs/images/benchmark_results/lbf.png differ diff --git a/docs/images/benchmark_results/legend.jpg b/docs/images/benchmark_results/legend.jpg new file mode 100644 index 000000000..3b9070a56 Binary files /dev/null and b/docs/images/benchmark_results/legend.jpg differ diff --git a/docs/images/benchmark_results/mabrax.png b/docs/images/benchmark_results/mabrax.png new file mode 100644 index 000000000..13d8edff4 Binary files /dev/null and b/docs/images/benchmark_results/mabrax.png differ diff --git a/docs/images/benchmark_results/mpe.png b/docs/images/benchmark_results/mpe.png new file mode 100644 index 000000000..76157b247 Binary files /dev/null and b/docs/images/benchmark_results/mpe.png differ diff --git a/docs/images/benchmark_results/rware.png b/docs/images/benchmark_results/rware.png new file mode 100644 index 000000000..91d9edf46 Binary files /dev/null and b/docs/images/benchmark_results/rware.png differ diff --git a/docs/images/benchmark_results/smax.png b/docs/images/benchmark_results/smax.png new file mode 100644 index 000000000..a4aeddd47 Binary files /dev/null and b/docs/images/benchmark_results/smax.png differ diff --git a/docs/images/lbf_results/15x15-4p-3f_rec_mappo.png b/docs/images/lbf_results/15x15-4p-3f_rec_mappo.png deleted file mode 100644 index fb01f398e..000000000 Binary files a/docs/images/lbf_results/15x15-4p-3f_rec_mappo.png and /dev/null differ diff --git a/docs/images/lbf_results/2s-8x8-2p-2f-coop_rec_mappo.png b/docs/images/lbf_results/2s-8x8-2p-2f-coop_rec_mappo.png deleted file mode 100644 index 081527a38..000000000 Binary files a/docs/images/lbf_results/2s-8x8-2p-2f-coop_rec_mappo.png and /dev/null differ diff --git a/docs/images/lbf_results/legend_rec_mappo.png b/docs/images/lbf_results/legend_rec_mappo.png deleted file mode 100644 index 489499f78..000000000 Binary files a/docs/images/lbf_results/legend_rec_mappo.png and /dev/null differ diff --git a/docs/images/rware_results/ff_ippo/small-4ag.png b/docs/images/rware_results/ff_ippo/small-4ag.png deleted file mode 100644 index a43b2d6d9..000000000 Binary files a/docs/images/rware_results/ff_ippo/small-4ag.png and /dev/null differ diff --git a/docs/images/rware_results/ff_ippo/tiny-2ag.png b/docs/images/rware_results/ff_ippo/tiny-2ag.png deleted file mode 100644 index df43e2077..000000000 Binary files a/docs/images/rware_results/ff_ippo/tiny-2ag.png and /dev/null differ diff --git a/docs/images/rware_results/ff_ippo/tiny-4ag.png b/docs/images/rware_results/ff_ippo/tiny-4ag.png deleted file mode 100644 index 3962e8e73..000000000 Binary files a/docs/images/rware_results/ff_ippo/tiny-4ag.png and /dev/null differ diff --git a/docs/images/rware_results/ff_mappo/main_readme/legend.png b/docs/images/rware_results/ff_mappo/main_readme/legend.png deleted file mode 100644 index c7239b719..000000000 Binary files a/docs/images/rware_results/ff_mappo/main_readme/legend.png and /dev/null differ diff --git a/docs/images/rware_results/ff_mappo/main_readme/small-4ag-1.png b/docs/images/rware_results/ff_mappo/main_readme/small-4ag-1.png deleted file mode 100644 index a899f0070..000000000 Binary files a/docs/images/rware_results/ff_mappo/main_readme/small-4ag-1.png and /dev/null differ diff --git a/docs/images/rware_results/ff_mappo/main_readme/tiny-2ag-1.png b/docs/images/rware_results/ff_mappo/main_readme/tiny-2ag-1.png deleted file mode 100644 index 6cd8086d2..000000000 Binary files a/docs/images/rware_results/ff_mappo/main_readme/tiny-2ag-1.png and /dev/null differ diff --git a/docs/images/rware_results/ff_mappo/main_readme/tiny-4ag-1.png b/docs/images/rware_results/ff_mappo/main_readme/tiny-4ag-1.png deleted file mode 100644 index 0a89c9dcd..000000000 Binary files a/docs/images/rware_results/ff_mappo/main_readme/tiny-4ag-1.png and /dev/null differ diff --git a/docs/images/rware_results/ff_mappo/small-4ag.png b/docs/images/rware_results/ff_mappo/small-4ag.png deleted file mode 100644 index 5ecfbbdfa..000000000 Binary files a/docs/images/rware_results/ff_mappo/small-4ag.png and /dev/null differ diff --git a/docs/images/rware_results/ff_mappo/tiny-2ag.png b/docs/images/rware_results/ff_mappo/tiny-2ag.png deleted file mode 100644 index e16f4bbfa..000000000 Binary files a/docs/images/rware_results/ff_mappo/tiny-2ag.png and /dev/null differ diff --git a/docs/images/rware_results/ff_mappo/tiny-4ag.png b/docs/images/rware_results/ff_mappo/tiny-4ag.png deleted file mode 100644 index 59f259c5c..000000000 Binary files a/docs/images/rware_results/ff_mappo/tiny-4ag.png and /dev/null differ diff --git a/docs/images/rware_results/rec_ippo/small-4ag.png b/docs/images/rware_results/rec_ippo/small-4ag.png deleted file mode 100644 index edab2f32c..000000000 Binary files a/docs/images/rware_results/rec_ippo/small-4ag.png and /dev/null differ diff --git a/docs/images/rware_results/rec_ippo/tiny-2ag.png b/docs/images/rware_results/rec_ippo/tiny-2ag.png deleted file mode 100644 index 82f2e25e2..000000000 Binary files a/docs/images/rware_results/rec_ippo/tiny-2ag.png and /dev/null differ diff --git a/docs/images/rware_results/rec_ippo/tiny-4ag.png b/docs/images/rware_results/rec_ippo/tiny-4ag.png deleted file mode 100644 index d224507dd..000000000 Binary files a/docs/images/rware_results/rec_ippo/tiny-4ag.png and /dev/null differ diff --git a/docs/images/rware_results/rec_mappo/small-4ag.png b/docs/images/rware_results/rec_mappo/small-4ag.png deleted file mode 100644 index 534847212..000000000 Binary files a/docs/images/rware_results/rec_mappo/small-4ag.png and /dev/null differ diff --git a/docs/images/rware_results/rec_mappo/tiny-2ag.png b/docs/images/rware_results/rec_mappo/tiny-2ag.png deleted file mode 100644 index 2927ca5cb..000000000 Binary files a/docs/images/rware_results/rec_mappo/tiny-2ag.png and /dev/null differ diff --git a/docs/images/rware_results/rec_mappo/tiny-4ag.png b/docs/images/rware_results/rec_mappo/tiny-4ag.png deleted file mode 100644 index ee5f390a4..000000000 Binary files a/docs/images/rware_results/rec_mappo/tiny-4ag.png and /dev/null differ diff --git a/docs/images/smax_results/10m_vs_11m.png b/docs/images/smax_results/10m_vs_11m.png deleted file mode 100644 index c50c8cccc..000000000 Binary files a/docs/images/smax_results/10m_vs_11m.png and /dev/null differ diff --git a/docs/images/smax_results/27m_vs_30m.png b/docs/images/smax_results/27m_vs_30m.png deleted file mode 100644 index 2ed84c583..000000000 Binary files a/docs/images/smax_results/27m_vs_30m.png and /dev/null differ diff --git a/docs/images/smax_results/2s3z.png b/docs/images/smax_results/2s3z.png deleted file mode 100644 index ca34009ee..000000000 Binary files a/docs/images/smax_results/2s3z.png and /dev/null differ diff --git a/docs/images/smax_results/3s5z.png b/docs/images/smax_results/3s5z.png deleted file mode 100644 index bc4f6fb6d..000000000 Binary files a/docs/images/smax_results/3s5z.png and /dev/null differ diff --git a/docs/images/smax_results/3s5z_vs_3s6z.png b/docs/images/smax_results/3s5z_vs_3s6z.png deleted file mode 100644 index db06e43fe..000000000 Binary files a/docs/images/smax_results/3s5z_vs_3s6z.png and /dev/null differ diff --git a/docs/images/smax_results/3s_vs_5z.png b/docs/images/smax_results/3s_vs_5z.png deleted file mode 100644 index db63fc843..000000000 Binary files a/docs/images/smax_results/3s_vs_5z.png and /dev/null differ diff --git a/docs/images/smax_results/5m_vs_6m.png b/docs/images/smax_results/5m_vs_6m.png deleted file mode 100644 index a52b5fb7d..000000000 Binary files a/docs/images/smax_results/5m_vs_6m.png and /dev/null differ diff --git a/docs/images/smax_results/6h_vs_8z.png b/docs/images/smax_results/6h_vs_8z.png deleted file mode 100644 index e76ae9bf2..000000000 Binary files a/docs/images/smax_results/6h_vs_8z.png and /dev/null differ diff --git a/docs/images/smax_results/legend.png b/docs/images/smax_results/legend.png deleted file mode 100644 index ed607b332..000000000 Binary files a/docs/images/smax_results/legend.png and /dev/null differ diff --git a/docs/images/speed_results/ff_mappo_speed_comparison.png b/docs/images/speed_results/ff_mappo_speed_comparison.png deleted file mode 100644 index 44f7ee821..000000000 Binary files a/docs/images/speed_results/ff_mappo_speed_comparison.png and /dev/null differ diff --git a/docs/images/speed_results/mava_sps_results.png b/docs/images/speed_results/mava_sps_results.png deleted file mode 100644 index 8393ea2bb..000000000 Binary files a/docs/images/speed_results/mava_sps_results.png and /dev/null differ diff --git a/docs/images/speed_results/speed.png b/docs/images/speed_results/speed.png new file mode 100644 index 000000000..0099d3319 Binary files /dev/null and b/docs/images/speed_results/speed.png differ diff --git a/docs/jumanji_rware_comparison.md b/docs/jumanji_rware_comparison.md deleted file mode 100644 index 3d041ef12..000000000 --- a/docs/jumanji_rware_comparison.md +++ /dev/null @@ -1,74 +0,0 @@ -# Differences in performance using Jumanji's version of RWARE - -There is a core difference in the way collisions are handled in the stateless JAX-based implementation of RWARE (called RobotWarehouse) found in [Jumanji][jumanji_rware] and the [original RWARE][original_rware] environment. - -As mentioned in the original repo, collisions are handled as follows: - > The dynamics of the environment are also of particular interest. Like a real, 3-dimensional warehouse, the robots can move beneath the shelves. Of course, when the robots are loaded, they must use the corridors, avoiding any standing shelves. -> ->Any collisions are resolved in a way that allows for maximum mobility. When two or more agents attempt to move to the same location, we prioritise the one that also blocks others. Otherwise, the selection is done arbitrarily. The visuals below demonstrate the resolution of various collisions. - -In contrast to the collision resolution strategy above, the current version of the Jumanji implementation will not handle collisions dynamically but instead terminates an episode upon agent collision. In our experience, this appeared to make the task at hand more challenging and made it easier for agents to get trapped in local optima where episodes are never rolled out for the maximum length. - -To investigate this, we ran our algorithms on a version of Jumanji's RWARE where episodes do not terminate upon agent collision, but rather multiple agents are allowed to occupy the same grid position. This setup is not identical to that of the original environment but represents a closer version to its dynamics, allowing agents to easily reach the end of an episode. - -Please see below for Mava's recurrent and feedforward implementations of IPPO and MAPPO on the regular version of Jumanji as well as the adapted version of Jumanji without termination upon agent collision. - -

- - Mava ff mappo tiny 2ag - - - Mava ff mappo tiny 4ag - - - Mava ff mappo small 4ag - -
-

Mava feedforward MAPPO performance on the tiny-2ag, tiny-4ag and small-4ag RWARE tasks.
-

- -

- - Mava ff ippo tiny 2ag - - - Mava ff ippo tiny 4ag - - - Mava ff ippo small 4ag - -
-

Mava feedforward IPPO performance on the tiny-2ag, tiny-4ag and small-4ag RWARE tasks.
-

- -

- - Mava rec ippo tiny 2ag - - - Mava rec ippo tiny 4ag - - - Mava rec ippo small 4ag - -
-

Mava recurrent IPPO performance on the tiny-2ag, tiny-4ag and small-4ag RWARE tasks.
-

- -

- - Mava rec mappo tiny 2ag - - - Mava rec mappo tiny 4ag - - - Mava rec mappo small 4ag - -
-

Mava recurrent MAPPO performance on the tiny-2ag, tiny-4ag and small-4ag RWARE tasks.
-

- - -[jumanji_rware]: https://instadeepai.github.io/jumanji/environments/robot_warehouse/ -[original_rware]: https://github.com/semitable/robotic-warehouse diff --git a/docs/smax_benchmark.md b/docs/smax_benchmark.md deleted file mode 100644 index dc840b51f..000000000 --- a/docs/smax_benchmark.md +++ /dev/null @@ -1,43 +0,0 @@ -# StarCraft Multi-Agent Challenge in JAX - -We trained Mavaโ€™s recurrent systems on eight SMAX scenarios. The outcomes were then compared to the final win rates reported by [Rutherford et al., 2023](https://arxiv.org/pdf/2311.10090.pdf). To ensure fair comparisons we also train Mava's system up to 10 million timesteps with 64 vectorised environments. - -
-
- -

- legend -

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
2s3z3s_vs_5z3s5z_vs_3s6z
2s3z3s_vs_5z3s5z_vs_3s6z
3s5z5m_vs_6m6h_vs_8z
3s5z5m_vs_6m6h_vs_8z
10m_vs_11m27m_vs_30m
10m_vs_11m27m_vs_30m
diff --git a/mava/systems/mat/README.md b/mava/systems/mat/README.md new file mode 100644 index 000000000..8f2bc67bf --- /dev/null +++ b/mava/systems/mat/README.md @@ -0,0 +1,6 @@ +# Multi-agent Transformer + +We provide an implementation of the Multi-agent Transformer algorithm in JAX. MAT casts cooperative multi-agent reinforcement learning as a sequence modelling problem where agent observations and actions are treated as a sequence. At each timestep the observations of all agents are encoded and then these encoded observations are used for auto-regressive action selection. + +## Relevant paper: +* [Multi-Agent Reinforcement Learning is a Sequence Modeling Problem](https://arxiv.org/pdf/2205.14953) diff --git a/mava/systems/ppo/README.md b/mava/systems/ppo/README.md new file mode 100644 index 000000000..75754f19a --- /dev/null +++ b/mava/systems/ppo/README.md @@ -0,0 +1,17 @@ +# Proximal Policy Optimization + +We provide the following four multi-agent extensions to [PPO](https://arxiv.org/pdf/1707.06347) following the Anakin architecture. + +* [ff-IPPO](../../systems/ppo/anakin/ff_ippo.py) +* [ff-MAPPO](../../systems/ppo/anakin/ff_mappo.py) +* [rec-IPPO](../../systems/ppo/anakin/rec_ippo.py) +* [rec-MAPPO](../../systems/ppo/anakin/rec_mappo.py) + +In all cases IPPO implies that it is an implementation following the independent learners MARL paradigm while MAPPO implies that the implementation follows the centralised training with decentralised execution paradigm by having a centralised critic during training. The `ff` or `rec` suffixes in the system names implies that the policy networks are MLPs or have a [GRU](https://arxiv.org/pdf/1406.1078) memory module to help learning despite partial observability in the environment. + +In addition to the Anakin-based implementations, we also include a Sebulba-based implementation of [ff-IPPO](../../systems/ppo/sebulba/ff_ippo.py) which can be used on environments that are not written in JAX and adhere to the Gymnasium API. + +## Relevant papers: +* [Proximal Policy Optimization Algorithms](https://arxiv.org/pdf/1707.06347) +* [The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games](https://arxiv.org/pdf/2103.01955) +* [Is Independent Learning All You Need in the StarCraft Multi-Agent Challenge?](https://arxiv.org/pdf/2011.09533) diff --git a/mava/systems/q_learning/README.md b/mava/systems/q_learning/README.md new file mode 100644 index 000000000..eef858cd9 --- /dev/null +++ b/mava/systems/q_learning/README.md @@ -0,0 +1,14 @@ +# Q Learning + +We provide two Q-Learning based systems that follow the independent learners and centralised training with decentralised execution paradigms: + +* [rec-IQL](../../systems/q_learning/anakin/rec_iql.py) +* [rec-QMIX](../../systems/q_learning/anakin/rec_qmix.py) + +`rec-IQL` is a multi-agent version of DQN that uses double DQN and has a GRU memory module and `rec-QMIX` is an implementation of QMIX in JAX that uses monontic value function decomposition. + +## Relevant papers: +* [Playing Atari with Deep Reinforcement Learning](https://arxiv.org/pdf/1312.5602) +* [Multiagent Cooperation and Competition with Deep Reinforcement Learning](https://arxiv.org/pdf/1511.08779) +* [QMIX: Monotonic Value Function Factorisation for +Deep Multi-Agent Reinforcement Learning](https://arxiv.org/pdf/1803.11485) diff --git a/mava/systems/sable/README.md b/mava/systems/sable/README.md new file mode 100644 index 000000000..92e19b693 --- /dev/null +++ b/mava/systems/sable/README.md @@ -0,0 +1,24 @@ +# Sable + +Sable is an algorithm that was developed by the research team at InstaDeep. It also casts MARL as a sequence modelling problem and leverages the [advantage decompostion theorem](https://arxiv.org/pdf/2108.08612) through auto-regressive action selection for convergence guarantees and can scale to thousands of agents by leveraging the memory efficiency of Retentive Networks. + +We provide two Anakin based implementations of Sable: +* [ff-sable](../../systems/sable/anakin/ff_sable.py) +* [rec-sable](../../systems/sable/anakin/rec_sable.py) + +Here the `ff` suffix implies that the algorithm retains no memory over time but treats only the agents as the sequence dimension while `rec` implies that the algorithms maintains memory over both agents and time for long context memory in partially observable environments. + +For an overview of how the algorithm works, please see the diagram below. For a more detailed overview please see our associated [paper](https://arxiv.org/pdf/2410.01706). + +

+ + Sable Arch + +

+ +*Sable architecture and execution.* The encoder receives all agent observations $o_t^1,\dots,o_t^N$ from the current timestep $t$ along with a hidden state $h\_{t-1}^{\text{enc}}$ representing past timesteps and produces encoded observations $\hat{o}\_t^1,\dots,\hat{o}\_t^N$, observation-values $v \left( \hat{o}\_t^1 \right),\dots,v \left( \hat{o}\_t^N \right) $, and a new hidden state $h_t^{\text{enc}}$. +The decoder performs recurrent retention over the current action $a_t^{m-1}$, followed by cross attention with the encoded observations, producing the next action $a_t^m$. The initial hidden states for recurrence over agents in the decoder at the current timestep are $( h\_{t-1}^{\text{dec}\_1},h\_{t-1}^{\text{dec}\_2})$, and by the end of the decoding process, it generates the updated hidden states $(h_t^{\text{dec}_1},h_t^{\text{dec}_2})$. + +## Relevant paper: +* [Performant, Memory Efficient and Scalable Multi-Agent Reinforcement Learning](https://arxiv.org/pdf/2410.01706) +* [Retentive Network: A Successor to Transformer for Large Language Models](https://arxiv.org/pdf/2307.08621) diff --git a/mava/systems/sac/README.md b/mava/systems/sac/README.md new file mode 100644 index 000000000..af7b27411 --- /dev/null +++ b/mava/systems/sac/README.md @@ -0,0 +1,16 @@ +# Soft Actor-Critic + +We provide the following three multi-agent extensions to the Soft Actor-Critic (SAC) algorithm. + +* [ff-ISAC](../../systems/sac/anakin/ff_isac.py) +* [ff-MASAC](../../systems/sac/anakin/ff_masac.py) +* [ff-HASAC](../../systems/sac/anakin/ff_hasac.py) + +`ISAC` is an implementation following the independent learners MARL paradigm while `MASAC` is an implementation that follows the centralised training with decentralised execution paradigm by having a centralised critic during training. `HASAC` follows the heterogeneous agent learning paradigm through sequential policy updates. The `ff` prefix to the algorithm names indicate that the algorithms use MLP-based policy networks. + +## Relevant papers +* [Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor](https://arxiv.org/pdf/1801.01290) +* [Multi-Agent Actor-Critic for Mixed +Cooperative-Competitive Environments](https://arxiv.org/pdf/1706.02275) +* [Robust Multi-Agent Control via Maximum Entropy +Heterogeneous-Agent Reinforcement Learning](https://arxiv.org/pdf/2306.10715)