From 8cac3864ad1e9e071b551bad27a9c4517507cffc Mon Sep 17 00:00:00 2001 From: Priya Kasimbeg Date: Mon, 9 Oct 2023 18:23:01 +0000 Subject: [PATCH 01/26] add disclaimber for double quotes --- README.md | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 1be096c2e..3ad226414 100644 --- a/README.md +++ b/README.md @@ -126,7 +126,15 @@ To use the Docker container as an interactive virtual environment, you can run a \ --keep_container_alive true ``` -2. Open a bash terminal + Note: You may have to use double quotes around `algorithmic-efficiency` in the mounting `-v` flag. If the above command fails try replacing the following line: + ```bash + -v $HOME/algorithmic-efficiency:/algorithmic-efficiency2 \ + ``` + with + ``` + -v $HOME"/algorithmic-efficiency:/algorithmic-efficiency" \ + ``` + - Open a bash terminal ```bash docker exec -it /bin/bash ``` From 8482b6a294ef1f9025f399c45c5fc3a8aa75cbd5 Mon Sep 17 00:00:00 2001 From: Priya Kasimbeg Date: Mon, 9 Oct 2023 23:00:58 +0000 Subject: [PATCH 02/26] init change log --- CHANGELOG.md | 3 +++ 1 file changed, 3 insertions(+) create mode 100644 CHANGELOG.md diff --git a/CHANGELOG.md b/CHANGELOG.md new file mode 100644 index 000000000..b69b92969 --- /dev/null +++ b/CHANGELOG.md @@ -0,0 +1,3 @@ +# Change log + +# TODO: algorithmic-efficiency 0.1.0 \ No newline at end of file From 18a24575858b205ad8d564f295eb558f8efed2f2 Mon Sep 17 00:00:00 2001 From: Priya Kasimbeg Date: Mon, 9 Oct 2023 23:20:31 +0000 Subject: [PATCH 03/26] add some faqs --- README.md | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) diff --git a/README.md b/README.md index 3ad226414..2dea5d39d 100644 --- a/README.md +++ b/README.md @@ -235,3 +235,28 @@ The JAX and PyTorch versions of the Criteo, FastMRI, Librispeech, OGBG, and WMT Since we use PyTorch's [`DistributedDataParallel`](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel) implementation, there is one Python process for each device. Depending on the hardware and the settings of the cluster, running a TensorFlow input pipeline in each Python process can lead to errors, since too many threads are created in each process. See [this PR thread](https://github.com/mlcommons/algorithmic-efficiency/pull/85) for more details. While this issue might not affect all setups, we currently implement a different strategy: we only run the TensorFlow input pipeline in one Python process (with `rank == 0`), and [broadcast](https://pytorch.org/docs/stable/distributed.html#torch.distributed.broadcast) the batches to all other devices. This introduces an additional communication overhead for each batch. See the [implementation for the WMT workload](https://github.com/mlcommons/algorithmic-efficiency/blob/main/algorithmic_efficiency/workloads/wmt/wmt_pytorch/workload.py#L215-L288) as an example. + +# FAQS +## Setup +### Why do I get a warning that GPU is not found? +If running with pytorch, we intentionally hide the GPUs from jax. So please disregard the following warning: +``` +W1003 ... xla_bridge.py:463] No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.) +``` + +## Platform +### My machine only has one GPU. How can I use this repo? +You can run this repo on a machine with arbitrary number of GPUs. However, the default batchsizes in our reference algorithms `algorithmic-efficiency/baselines/` and `algorithmic-efficiency/reference_algorithms` are tuned for a machine with 8 V100 GPUs. You may run into OOMs if you run these algorithms with fewer than 8 GPUs. To solve this +please reduce the batchsizes for the submission. + +### How do I run this on my SLURM cluster? +You may run into issues with `sudo` and `docker` on a SLURM cluster. To run the workloads in a SLURM cluster you can use Apptainer (previously Singularity), see this [section](using-singularity/apptainer-instead-of-docker). +### How can I run this on my AWS/GCP/Azure cloud project? +Yes you can use this repo on your cloud project. As noted above we recommend +Docker or Apptainer to ensure a similar environment as our scoring environment. +## Submissions +### Can submission be structured using multiple files? + +### How can I install custom dependencies? +### How can I know if my code can be run on benchmarking hardware? +### Are we allowed to use our own hardware to self-report the results? \ No newline at end of file From 2f5069193e12d2fe310e6ad578e7b6edf04aed0b Mon Sep 17 00:00:00 2001 From: Priya Kasimbeg Date: Mon, 9 Oct 2023 23:22:55 +0000 Subject: [PATCH 04/26] edit --- README.md | 1 - 1 file changed, 1 deletion(-) diff --git a/README.md b/README.md index 2dea5d39d..6c643f7fe 100644 --- a/README.md +++ b/README.md @@ -256,7 +256,6 @@ Yes you can use this repo on your cloud project. As noted above we recommend Docker or Apptainer to ensure a similar environment as our scoring environment. ## Submissions ### Can submission be structured using multiple files? - ### How can I install custom dependencies? ### How can I know if my code can be run on benchmarking hardware? ### Are we allowed to use our own hardware to self-report the results? \ No newline at end of file From 07c1d100e19435347802c71fa69a913cac4c4c0c Mon Sep 17 00:00:00 2001 From: Priya Kasimbeg Date: Wed, 11 Oct 2023 22:47:24 +0000 Subject: [PATCH 05/26] add some more faqs --- README.md | 18 ++++++++++++++---- getting_started.md | 2 +- 2 files changed, 15 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 6c643f7fe..444bf2827 100644 --- a/README.md +++ b/README.md @@ -252,10 +252,20 @@ please reduce the batchsizes for the submission. ### How do I run this on my SLURM cluster? You may run into issues with `sudo` and `docker` on a SLURM cluster. To run the workloads in a SLURM cluster you can use Apptainer (previously Singularity), see this [section](using-singularity/apptainer-instead-of-docker). ### How can I run this on my AWS/GCP/Azure cloud project? -Yes you can use this repo on your cloud project. As noted above we recommend -Docker or Apptainer to ensure a similar environment as our scoring environment. +Below instructions are for GCP. Depending on your virtual machine, you may have to install install the correct GPU drivers and the NVIDIA Docker toolkit. +1. If you don't have an VM instance yet, we recommmend to use the "Deep Learning on Linux" Image in Boot disk options. +2. To install the NVIDIA Docker toolkit, you can use `scripts/cloud-startup.sh` as a startup script for the VM. This will automate the installation of the NVIDIA GPU Drivers and NVIDIA Docker toolkit. + ## Submissions ### Can submission be structured using multiple files? -### How can I install custom dependencies? +Yes, your submission can be structured using multiple files. +### Can I install custom dependencies? +You may use custom dependencies as long as they do not conflict with any of the pinned packages in setup.cfg. +To include your custom dependencies in your submission, please include them in a requirements.txt file. ### How can I know if my code can be run on benchmarking hardware? -### Are we allowed to use our own hardware to self-report the results? \ No newline at end of file +The benchmarking hardware specifications are documented in the [Getting Started Document](./getting_started.md). +Please monitor your submission's memory usage so that it does not exceed the available memory +on the competition hardware. +### Are we allowed to use our own hardware to self-report the results? +No. However you are allowed to use your own hardware to report the best hyperparameter point to qualify for +a compute sponsorship offering a free evaluation on the full benchmark set, see [Rules](./RULES.md#qualification-set) \ No newline at end of file diff --git a/getting_started.md b/getting_started.md index 2942e632b..96e58edab 100644 --- a/getting_started.md +++ b/getting_started.md @@ -13,7 +13,7 @@ To get started you will have to make a few decisions and install the repository 1. Decide if you would like to develop your submission in either Pytorch or Jax. 2. Set up your workstation or VM. We recommend to use a setup similar to the [benchmarking hardware](https://github.com/mlcommons/algorithmic-efficiency/blob/main/RULES.md#benchmarking-hardware). The specs on the benchmarking machines are: - - 8 V100 GPUs + - 8 V100 GPUs - 240 GB in RAM - 2 TB in storage (for datasets). 3. Install the algorithmic package and dependencies, see [Installation](./README.md#installation). From 2c85991b4dda059dad436b0f13b59e880c6cba97 Mon Sep 17 00:00:00 2001 From: Priya Kasimbeg Date: Wed, 11 Oct 2023 23:15:53 +0000 Subject: [PATCH 06/26] add disclaimer on conformer pytorch workload --- CHANGELOG.md | 8 +++++++- README.md | 11 ++++++++++- 2 files changed, 17 insertions(+), 2 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index b69b92969..76dfa2b9a 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,3 +1,9 @@ # Change log -# TODO: algorithmic-efficiency 0.1.0 \ No newline at end of file +# TODO: algorithmic-efficiency 0.1.0 +First release of AlgoPerf benchmarking code. +Disclaimer: The Conformer Pytorch workload has memory fragmentation issue after upgrading to +Pytorch 2.0.1. To circumvent this issues we have tuned the pytorch memory allocation configuration, +which slows down the workload by a factor of 2x. For submitters, this means that the Conformer Pytorch +submission times will be about 2x compared to an identical jax submissions. +Tracking issue here: see issue/497(https://github.com/mlcommons/algorithmic-efficiency/issues/497). \ No newline at end of file diff --git a/README.md b/README.md index 444bf2827..0d1953ff7 100644 --- a/README.md +++ b/README.md @@ -229,13 +229,22 @@ The rules for the MLCommons Algorithmic Efficency benchmark can be found in the If you are interested in contributing to the work of the working group, feel free to [join the weekly meetings](https://mlcommons.org/en/groups/research-algorithms/), open issues. See our [CONTRIBUTING.md](CONTRIBUTING.md) for MLCommons contributing guidelines and setup and workflow instructions. -# Note on shared data pipelines between JAX and PyTorch +# Disclaimers + +# Shared data pipelines between JAX and PyTorch The JAX and PyTorch versions of the Criteo, FastMRI, Librispeech, OGBG, and WMT workloads are using the same TensorFlow input pipelines. Due to differences in how Jax and PyTorch distribute computations across devices, the PyTorch workloads have an additional overhead for these workloads. Since we use PyTorch's [`DistributedDataParallel`](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel) implementation, there is one Python process for each device. Depending on the hardware and the settings of the cluster, running a TensorFlow input pipeline in each Python process can lead to errors, since too many threads are created in each process. See [this PR thread](https://github.com/mlcommons/algorithmic-efficiency/pull/85) for more details. While this issue might not affect all setups, we currently implement a different strategy: we only run the TensorFlow input pipeline in one Python process (with `rank == 0`), and [broadcast](https://pytorch.org/docs/stable/distributed.html#torch.distributed.broadcast) the batches to all other devices. This introduces an additional communication overhead for each batch. See the [implementation for the WMT workload](https://github.com/mlcommons/algorithmic-efficiency/blob/main/algorithmic_efficiency/workloads/wmt/wmt_pytorch/workload.py#L215-L288) as an example. +# Conformer workload 2x slower in Pytorch vs Jax +The Conformer Pytorch workload has memory fragmentation issue after upgrading to +Pytorch 2.0.1, which led to out of memory errors. To circumvent this issues we have tuned the pytorch +memory allocation configuration, which slows down the workload by a factor of roughly 2x. For submitters, this +means that the Conformer Pytorch submission times will be roughly 2x compared to an identical jax submissions. +Tracking issue here: see issue/497(https://github.com/mlcommons/algorithmic-efficiency/issues/497). + # FAQS ## Setup ### Why do I get a warning that GPU is not found? From f6ba321b95cbcad6e068000a0f0667f80243ee5b Mon Sep 17 00:00:00 2001 From: Priya Kasimbeg Date: Wed, 11 Oct 2023 23:20:12 +0000 Subject: [PATCH 07/26] update readme --- CHANGELOG.md | 10 ++++++---- README.md | 12 +++++++++--- 2 files changed, 15 insertions(+), 7 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 76dfa2b9a..2a06c043f 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,9 +1,11 @@ # Change log # TODO: algorithmic-efficiency 0.1.0 -First release of AlgoPerf benchmarking code. -Disclaimer: The Conformer Pytorch workload has memory fragmentation issue after upgrading to +First release of AlgoPerf benchmarking code. + +**Disclaimer**: The Conformer Pytorch workload has memory fragmentation issue after upgrading to Pytorch 2.0.1. To circumvent this issues we have tuned the pytorch memory allocation configuration, which slows down the workload by a factor of 2x. For submitters, this means that the Conformer Pytorch -submission times will be about 2x compared to an identical jax submissions. -Tracking issue here: see issue/497(https://github.com/mlcommons/algorithmic-efficiency/issues/497). \ No newline at end of file +submission times will be about 2x compared to an identical jax submission. + +Tracking issue: [issue/497](https://github.com/mlcommons/algorithmic-efficiency/issues/497). \ No newline at end of file diff --git a/README.md b/README.md index 0d1953ff7..e849bfca0 100644 --- a/README.md +++ b/README.md @@ -29,9 +29,12 @@ - [Getting Started](#getting-started) - [Rules](#rules) - [Contributing](#contributing) +- [Diclaimers](#disclaimers) +- [FAQS](#faqs) - [Citing AlgoPerf Benchmark](#citing-algoperf-benchmark) + ## Installation You can install this package and dependences in a [python virtual environment](#virtual-environment) or use a [Docker/Singularity/Apptainer container](#install-in-docker) (recommended). @@ -231,14 +234,14 @@ If you are interested in contributing to the work of the working group, feel fre # Disclaimers -# Shared data pipelines between JAX and PyTorch +## Shared data pipelines between JAX and PyTorch The JAX and PyTorch versions of the Criteo, FastMRI, Librispeech, OGBG, and WMT workloads are using the same TensorFlow input pipelines. Due to differences in how Jax and PyTorch distribute computations across devices, the PyTorch workloads have an additional overhead for these workloads. Since we use PyTorch's [`DistributedDataParallel`](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel) implementation, there is one Python process for each device. Depending on the hardware and the settings of the cluster, running a TensorFlow input pipeline in each Python process can lead to errors, since too many threads are created in each process. See [this PR thread](https://github.com/mlcommons/algorithmic-efficiency/pull/85) for more details. While this issue might not affect all setups, we currently implement a different strategy: we only run the TensorFlow input pipeline in one Python process (with `rank == 0`), and [broadcast](https://pytorch.org/docs/stable/distributed.html#torch.distributed.broadcast) the batches to all other devices. This introduces an additional communication overhead for each batch. See the [implementation for the WMT workload](https://github.com/mlcommons/algorithmic-efficiency/blob/main/algorithmic_efficiency/workloads/wmt/wmt_pytorch/workload.py#L215-L288) as an example. -# Conformer workload 2x slower in Pytorch vs Jax +## Conformer workload 2x slower in Pytorch vs Jax The Conformer Pytorch workload has memory fragmentation issue after upgrading to Pytorch 2.0.1, which led to out of memory errors. To circumvent this issues we have tuned the pytorch memory allocation configuration, which slows down the workload by a factor of roughly 2x. For submitters, this @@ -277,4 +280,7 @@ Please monitor your submission's memory usage so that it does not exceed the ava on the competition hardware. ### Are we allowed to use our own hardware to self-report the results? No. However you are allowed to use your own hardware to report the best hyperparameter point to qualify for -a compute sponsorship offering a free evaluation on the full benchmark set, see [Rules](./RULES.md#qualification-set) \ No newline at end of file +a compute sponsorship offering a free evaluation on the full benchmark set, see [Rules](./RULES.md#qualification-set) + +# Citing AlgoPerf Benchmark +Todo: how to cite the algoperf benchmark? \ No newline at end of file From cc953cf005e91aa83ae2ec7d1a8bf4bb932fde22 Mon Sep 17 00:00:00 2001 From: Priya Kasimbeg Date: Wed, 11 Oct 2023 23:37:07 +0000 Subject: [PATCH 08/26] typo --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index e849bfca0..cb3c09c83 100644 --- a/README.md +++ b/README.md @@ -280,7 +280,7 @@ Please monitor your submission's memory usage so that it does not exceed the ava on the competition hardware. ### Are we allowed to use our own hardware to self-report the results? No. However you are allowed to use your own hardware to report the best hyperparameter point to qualify for -a compute sponsorship offering a free evaluation on the full benchmark set, see [Rules](./RULES.md#qualification-set) +a compute sponsorship offering a free evaluation on the full benchmark set, see the [Rules](./RULES.md#qualification-set). # Citing AlgoPerf Benchmark Todo: how to cite the algoperf benchmark? \ No newline at end of file From 3760f28a2764fdc6aba5fbbce4f741390d85cf69 Mon Sep 17 00:00:00 2001 From: Priya Kasimbeg Date: Wed, 11 Oct 2023 23:37:43 +0000 Subject: [PATCH 09/26] format --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index cb3c09c83..99f159a03 100644 --- a/README.md +++ b/README.md @@ -272,7 +272,7 @@ Below instructions are for GCP. Depending on your virtual machine, you may have ### Can submission be structured using multiple files? Yes, your submission can be structured using multiple files. ### Can I install custom dependencies? -You may use custom dependencies as long as they do not conflict with any of the pinned packages in setup.cfg. +You may use custom dependencies as long as they do not conflict with any of the pinned packages in `algorithmic-efficiency/setup.cfg`. To include your custom dependencies in your submission, please include them in a requirements.txt file. ### How can I know if my code can be run on benchmarking hardware? The benchmarking hardware specifications are documented in the [Getting Started Document](./getting_started.md). From b7bf1ca2b289a8cc942d9460a0328a97a1cceedc Mon Sep 17 00:00:00 2001 From: Priya Kasimbeg Date: Wed, 11 Oct 2023 23:38:37 +0000 Subject: [PATCH 10/26] formatting --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 99f159a03..c7f34f5bf 100644 --- a/README.md +++ b/README.md @@ -258,7 +258,7 @@ W1003 ... xla_bridge.py:463] No GPU/TPU found, falling back to CPU. (Set TF_CPP_ ## Platform ### My machine only has one GPU. How can I use this repo? -You can run this repo on a machine with arbitrary number of GPUs. However, the default batchsizes in our reference algorithms `algorithmic-efficiency/baselines/` and `algorithmic-efficiency/reference_algorithms` are tuned for a machine with 8 V100 GPUs. You may run into OOMs if you run these algorithms with fewer than 8 GPUs. To solve this +You can run this repo on a machine with arbitrary number of GPUs. However, the default batchsizes in our reference algorithms `algorithmic-efficiency/baselines` and `algorithmic-efficiency/reference_algorithms` are tuned for a machine with 8 V100 GPUs. You may run into OOMs if you run these algorithms with fewer than 8 GPUs. To solve this please reduce the batchsizes for the submission. ### How do I run this on my SLURM cluster? From 03c48230fea525b806295cb49ba1d08dbededba4 Mon Sep 17 00:00:00 2001 From: Priya Kasimbeg Date: Wed, 11 Oct 2023 23:38:59 +0000 Subject: [PATCH 11/26] clarify --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index c7f34f5bf..9af8fa445 100644 --- a/README.md +++ b/README.md @@ -258,7 +258,7 @@ W1003 ... xla_bridge.py:463] No GPU/TPU found, falling back to CPU. (Set TF_CPP_ ## Platform ### My machine only has one GPU. How can I use this repo? -You can run this repo on a machine with arbitrary number of GPUs. However, the default batchsizes in our reference algorithms `algorithmic-efficiency/baselines` and `algorithmic-efficiency/reference_algorithms` are tuned for a machine with 8 V100 GPUs. You may run into OOMs if you run these algorithms with fewer than 8 GPUs. To solve this +You can run this repo on a machine with arbitrary number of GPUs. However, the default batchsizes in our reference algorithms `algorithmic-efficiency/baselines` and `algorithmic-efficiency/reference_algorithms` are tuned for a machine with 8 16GB V100 GPUs. You may run into OOMs if you run these algorithms with fewer than 8 GPUs. To solve this please reduce the batchsizes for the submission. ### How do I run this on my SLURM cluster? From f32444c05b1e99ba5a8f7d875329bf1a0dca30cd Mon Sep 17 00:00:00 2001 From: Priya Kasimbeg Date: Wed, 11 Oct 2023 23:42:53 +0000 Subject: [PATCH 12/26] fix --- README.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 9af8fa445..ae81aba64 100644 --- a/README.md +++ b/README.md @@ -258,8 +258,10 @@ W1003 ... xla_bridge.py:463] No GPU/TPU found, falling back to CPU. (Set TF_CPP_ ## Platform ### My machine only has one GPU. How can I use this repo? -You can run this repo on a machine with arbitrary number of GPUs. However, the default batchsizes in our reference algorithms `algorithmic-efficiency/baselines` and `algorithmic-efficiency/reference_algorithms` are tuned for a machine with 8 16GB V100 GPUs. You may run into OOMs if you run these algorithms with fewer than 8 GPUs. To solve this -please reduce the batchsizes for the submission. +You can run this repo on a machine with arbitrary number of GPUs. However, the default batchsizes in our reference algorithms `algorithmic-efficiency/baselines` and `algorithmic-efficiency/reference_algorithms` are tuned for a machine with 8 16GB V100 GPUs. You may run into OOMs if you run these algorithms with fewer than 8 GPUs. If you run into these issues because you are using a machine with less total GPU memory, please reduce the batchsizes for the submission. Note that your final submission must 'fit' +on the benchmarking hardware, so if you are using fewer +GPUs with higher per gpu memory, please monitor your memory usage +to make make sure it will fit on a 8 16GB V100 GPUs. ### How do I run this on my SLURM cluster? You may run into issues with `sudo` and `docker` on a SLURM cluster. To run the workloads in a SLURM cluster you can use Apptainer (previously Singularity), see this [section](using-singularity/apptainer-instead-of-docker). From e6ef936e93447124c444248f3381c72d01d03059 Mon Sep 17 00:00:00 2001 From: Priya Kasimbeg Date: Wed, 11 Oct 2023 23:52:22 +0000 Subject: [PATCH 13/26] gcp instructions --- README.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index ae81aba64..348e963f4 100644 --- a/README.md +++ b/README.md @@ -266,8 +266,9 @@ to make make sure it will fit on a 8 16GB V100 GPUs. ### How do I run this on my SLURM cluster? You may run into issues with `sudo` and `docker` on a SLURM cluster. To run the workloads in a SLURM cluster you can use Apptainer (previously Singularity), see this [section](using-singularity/apptainer-instead-of-docker). ### How can I run this on my AWS/GCP/Azure cloud project? -Below instructions are for GCP. Depending on your virtual machine, you may have to install install the correct GPU drivers and the NVIDIA Docker toolkit. -1. If you don't have an VM instance yet, we recommmend to use the "Deep Learning on Linux" Image in Boot disk options. + Depending on your virtual machine, you may have to install install the correct GPU drivers and the NVIDIA Docker toolkit. For example, in GCP you can: +1. If you don't have an VM instance yet, we recommmend creating a +new Compute Instance with the "Deep Learning on Linux" Image in Boot disk options. 2. To install the NVIDIA Docker toolkit, you can use `scripts/cloud-startup.sh` as a startup script for the VM. This will automate the installation of the NVIDIA GPU Drivers and NVIDIA Docker toolkit. ## Submissions From 6d2ba7e2ed3560fb7e7bc97af51d54ba79fa5ab2 Mon Sep 17 00:00:00 2001 From: Priya Kasimbeg Date: Wed, 11 Oct 2023 23:54:27 +0000 Subject: [PATCH 14/26] typo --- README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 348e963f4..990d094bf 100644 --- a/README.md +++ b/README.md @@ -245,8 +245,8 @@ While this issue might not affect all setups, we currently implement a different The Conformer Pytorch workload has memory fragmentation issue after upgrading to Pytorch 2.0.1, which led to out of memory errors. To circumvent this issues we have tuned the pytorch memory allocation configuration, which slows down the workload by a factor of roughly 2x. For submitters, this -means that the Conformer Pytorch submission times will be roughly 2x compared to an identical jax submissions. -Tracking issue here: see issue/497(https://github.com/mlcommons/algorithmic-efficiency/issues/497). +means that the Conformer Pytorch submission times will be roughly 2x compared to an identical jax submission. +Tracking in issue/497(https://github.com/mlcommons/algorithmic-efficiency/issues/497). # FAQS ## Setup @@ -266,7 +266,7 @@ to make make sure it will fit on a 8 16GB V100 GPUs. ### How do I run this on my SLURM cluster? You may run into issues with `sudo` and `docker` on a SLURM cluster. To run the workloads in a SLURM cluster you can use Apptainer (previously Singularity), see this [section](using-singularity/apptainer-instead-of-docker). ### How can I run this on my AWS/GCP/Azure cloud project? - Depending on your virtual machine, you may have to install install the correct GPU drivers and the NVIDIA Docker toolkit. For example, in GCP you can: + Depending on your virtual machine, you may have to install install the correct GPU drivers and the NVIDIA Docker toolkit. For example, in GCP you will have to do the following. 1. If you don't have an VM instance yet, we recommmend creating a new Compute Instance with the "Deep Learning on Linux" Image in Boot disk options. 2. To install the NVIDIA Docker toolkit, you can use `scripts/cloud-startup.sh` as a startup script for the VM. This will automate the installation of the NVIDIA GPU Drivers and NVIDIA Docker toolkit. From 687c9a28fa32d9cc3aea98f53bdf4c09bbe7c49b Mon Sep 17 00:00:00 2001 From: Priya Kasimbeg Date: Thu, 12 Oct 2023 19:36:51 +0000 Subject: [PATCH 15/26] update --- README.md | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/README.md b/README.md index 990d094bf..adea6b574 100644 --- a/README.md +++ b/README.md @@ -249,14 +249,8 @@ means that the Conformer Pytorch submission times will be roughly 2x compared to Tracking in issue/497(https://github.com/mlcommons/algorithmic-efficiency/issues/497). # FAQS -## Setup -### Why do I get a warning that GPU is not found? -If running with pytorch, we intentionally hide the GPUs from jax. So please disregard the following warning: -``` -W1003 ... xla_bridge.py:463] No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.) -``` -## Platform +## Setup and Platform ### My machine only has one GPU. How can I use this repo? You can run this repo on a machine with arbitrary number of GPUs. However, the default batchsizes in our reference algorithms `algorithmic-efficiency/baselines` and `algorithmic-efficiency/reference_algorithms` are tuned for a machine with 8 16GB V100 GPUs. You may run into OOMs if you run these algorithms with fewer than 8 GPUs. If you run into these issues because you are using a machine with less total GPU memory, please reduce the batchsizes for the submission. Note that your final submission must 'fit' on the benchmarking hardware, so if you are using fewer From 9d7e05db67a1dd392ee6546885dd21e13131be9f Mon Sep 17 00:00:00 2001 From: Priya Kasimbeg Date: Thu, 12 Oct 2023 20:09:30 +0000 Subject: [PATCH 16/26] update conformer --- CHANGELOG.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 2a06c043f..ea3c7c046 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -6,6 +6,6 @@ First release of AlgoPerf benchmarking code. **Disclaimer**: The Conformer Pytorch workload has memory fragmentation issue after upgrading to Pytorch 2.0.1. To circumvent this issues we have tuned the pytorch memory allocation configuration, which slows down the workload by a factor of 2x. For submitters, this means that the Conformer Pytorch -submission times will be about 2x compared to an identical jax submission. +submission times will be about 2x slower. Tracking issue: [issue/497](https://github.com/mlcommons/algorithmic-efficiency/issues/497). \ No newline at end of file From f1e3bb09d31bed4391efd6600c11a0a93bba51d1 Mon Sep 17 00:00:00 2001 From: Priya Kasimbeg Date: Thu, 12 Oct 2023 20:13:18 +0000 Subject: [PATCH 17/26] update --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index adea6b574..5a0f146a8 100644 --- a/README.md +++ b/README.md @@ -241,11 +241,11 @@ The JAX and PyTorch versions of the Criteo, FastMRI, Librispeech, OGBG, and WMT Since we use PyTorch's [`DistributedDataParallel`](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel) implementation, there is one Python process for each device. Depending on the hardware and the settings of the cluster, running a TensorFlow input pipeline in each Python process can lead to errors, since too many threads are created in each process. See [this PR thread](https://github.com/mlcommons/algorithmic-efficiency/pull/85) for more details. While this issue might not affect all setups, we currently implement a different strategy: we only run the TensorFlow input pipeline in one Python process (with `rank == 0`), and [broadcast](https://pytorch.org/docs/stable/distributed.html#torch.distributed.broadcast) the batches to all other devices. This introduces an additional communication overhead for each batch. See the [implementation for the WMT workload](https://github.com/mlcommons/algorithmic-efficiency/blob/main/algorithmic_efficiency/workloads/wmt/wmt_pytorch/workload.py#L215-L288) as an example. -## Conformer workload 2x slower in Pytorch vs Jax +## Conformer Pytorch OOM The Conformer Pytorch workload has memory fragmentation issue after upgrading to Pytorch 2.0.1, which led to out of memory errors. To circumvent this issues we have tuned the pytorch memory allocation configuration, which slows down the workload by a factor of roughly 2x. For submitters, this -means that the Conformer Pytorch submission times will be roughly 2x compared to an identical jax submission. +means that the Conformer Pytorch submission times will be roughly 2x slower. Tracking in issue/497(https://github.com/mlcommons/algorithmic-efficiency/issues/497). # FAQS From 442a152e2dd71a2ce0135aed2841b1ec33d5fd09 Mon Sep 17 00:00:00 2001 From: Priya Kasimbeg Date: Fri, 13 Oct 2023 18:59:33 +0000 Subject: [PATCH 18/26] fix --- CHANGELOG.md | 7 ++++--- README.md | 2 +- 2 files changed, 5 insertions(+), 4 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index ea3c7c046..c3112b9d7 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,8 +4,9 @@ First release of AlgoPerf benchmarking code. **Disclaimer**: The Conformer Pytorch workload has memory fragmentation issue after upgrading to -Pytorch 2.0.1. To circumvent this issues we have tuned the pytorch memory allocation configuration, -which slows down the workload by a factor of 2x. For submitters, this means that the Conformer Pytorch -submission times will be about 2x slower. +Pytorch 2.0.1, which led to out of memory errors. To circumvent this issue we have tuned the pytorch +memory allocation configuration, which slows down the workload by a factor of roughly 2x. For submitters, this +means that the Conformer Pytorch submission times will be roughly 2x slower. +Tracking in issue/497(https://github.com/mlcommons/algorithmic-efficiency/issues/497). Tracking issue: [issue/497](https://github.com/mlcommons/algorithmic-efficiency/issues/497). \ No newline at end of file diff --git a/README.md b/README.md index 5a0f146a8..3dce37924 100644 --- a/README.md +++ b/README.md @@ -243,7 +243,7 @@ While this issue might not affect all setups, we currently implement a different ## Conformer Pytorch OOM The Conformer Pytorch workload has memory fragmentation issue after upgrading to -Pytorch 2.0.1, which led to out of memory errors. To circumvent this issues we have tuned the pytorch +Pytorch 2.0.1, which led to out of memory errors. To circumvent this issue we have tuned the pytorch memory allocation configuration, which slows down the workload by a factor of roughly 2x. For submitters, this means that the Conformer Pytorch submission times will be roughly 2x slower. Tracking in issue/497(https://github.com/mlcommons/algorithmic-efficiency/issues/497). From d047d05903332eb418e798f7ed3bcce01e61250e Mon Sep 17 00:00:00 2001 From: Chandramouli Shama Sastry Date: Sun, 15 Oct 2023 00:24:50 +0000 Subject: [PATCH 19/26] conformer oom fixes --- .../librispeech_pytorch/models.py | 298 ++++-------------- .../librispeech_pytorch/workload.py | 12 +- submission_runner.py | 4 +- .../librispeech_conformer/compare.py | 10 +- 4 files changed, 78 insertions(+), 246 deletions(-) diff --git a/algorithmic_efficiency/workloads/librispeech_conformer/librispeech_pytorch/models.py b/algorithmic_efficiency/workloads/librispeech_conformer/librispeech_pytorch/models.py index 665c3c894..2da7dcfb3 100644 --- a/algorithmic_efficiency/workloads/librispeech_conformer/librispeech_pytorch/models.py +++ b/algorithmic_efficiency/workloads/librispeech_conformer/librispeech_pytorch/models.py @@ -71,7 +71,10 @@ def forward(self, x): class Subsample(nn.Module): - def __init__(self, encoder_dim: int = 0, input_dropout_rate: float = 0.0): + def __init__(self, + encoder_dim: int = 0, + input_dropout_rate: float = 0.0, + num_bins: int = 80): super().__init__() self.encoder_dim = encoder_dim self.input_dropout_rate = input_dropout_rate @@ -81,7 +84,10 @@ def __init__(self, encoder_dim: int = 0, input_dropout_rate: float = 0.0): self.conv2 = Conv2dSubsampling( input_channels=encoder_dim, output_channels=encoder_dim) - self.linear = nn.LazyLinear(out_features=self.encoder_dim, bias=True) + self.linear = nn.Linear( + in_features=self.encoder_dim * num_bins // 4, + out_features=self.encoder_dim, + bias=True) self.pos_encode = AddPositionalEmbedding(embedding_dim=self.encoder_dim) self.dropout = nn.Dropout(p=self.input_dropout_rate) @@ -123,6 +129,7 @@ def __init__(self, self.kernel = nn.Parameter( torch.nn.init.xavier_uniform_(torch.empty(*self.filter_shape))) self.bias = nn.Parameter(torch.zeros(output_channels)) + self.register_buffer('paddings_kernel', torch.ones([1, 1, 1])) def get_same_padding(self, input_shape): in_height, in_width = input_shape[2:] @@ -162,15 +169,11 @@ def forward(self, inputs, paddings): input_length = paddings.shape[1] stride = self.filter_stride[0] pad_len = (input_length + stride - 1) // stride * stride - input_length - padded_paddings = torch.cat([ - paddings[:, None, :], - torch.zeros( - size=(paddings.shape[0], 1, pad_len), device=paddings.device) - ], - dim=2) + padded_paddings = F.pad( + paddings[:, None, :], (0, pad_len), mode='constant', value=0) out_padding = F.conv1d( input=padded_paddings, - weight=torch.ones([1, 1, 1], device=paddings.device), + weight=self.paddings_kernel, stride=self.filter_stride[:1]) out_padding = out_padding.squeeze(dim=1) outputs = outputs * (1 - out_padding[:, None, :, None]) @@ -184,11 +187,15 @@ def __init__(self, config: ConformerConfig): self.config = config self.ln = LayerNorm(dim=config.encoder_dim) - self.linear1 = nn.LazyLinear( + self.linear1 = nn.Linear( + in_features=config.encoder_dim, out_features=config.encoder_dim * config.feed_forward_expansion_factor, bias=True) self.dropout1 = nn.Dropout(p=config.feed_forward_dropout_rate) - self.linear2 = nn.LazyLinear(out_features=config.encoder_dim, bias=True) + self.linear2 = nn.Linear( + in_features=config.encoder_dim * config.feed_forward_expansion_factor, + out_features=config.encoder_dim, + bias=True) if config.feed_forward_residual_dropout_rate is None: feed_forward_residual_dropout_rate = 0.1 @@ -253,217 +260,32 @@ def forward(self, inputs): return inputs * scale -class MHSAwithQS(nn.MultiheadAttention): - # pylint: disable=locally-disabled, use-a-generator, line-too-long, invalid-name +class MHSAwithQS(nn.Module): + def __init__(self, config: ConformerConfig): - super().__init__( - embed_dim=config.encoder_dim, - num_heads=config.num_attention_heads, - dropout=config.attention_dropout_rate, - bias=True, - batch_first=True) + super().__init__() + self.embed_dim = config.encoder_dim + self.num_heads = config.num_attention_heads + self.dropout = config.attention_dropout_rate + self.in_proj = nn.Linear(config.encoder_dim, 3 * config.encoder_dim) + self.out_proj = nn.Linear(config.encoder_dim, config.encoder_dim) self.qs = QueryScaler(dim=config.encoder_dim // config.num_attention_heads) - def _scaled_in_proj_weight(self): - # Scale the query projection weight. - qs_input = self.in_proj_weight[:self.embed_dim].view( - self.num_heads, self.embed_dim // self.num_heads, -1).transpose(1, 2) - in_proj_queryW_scaled = self.qs(qs_input).transpose( - 1, 2).view(*self.in_proj_weight[:self.embed_dim].shape) - in_proj_weight = torch.cat( - [in_proj_queryW_scaled, self.in_proj_weight[self.embed_dim:]]) - return in_proj_weight - - def _scaled_in_proj_bias(self): - # Scale the query bias. - in_proj_queryb_scaled = self.qs(self.in_proj_bias[:self.embed_dim].view( - self.num_heads, self.embed_dim // self.num_heads)).view(-1) - in_proj_bias = torch.cat( - [in_proj_queryb_scaled, self.in_proj_bias[self.embed_dim:]]) - return in_proj_bias - - def forward(self, - query, - key, - value, - key_padding_mask=None, - need_weights: bool = True, - attn_mask=None, - average_attn_weights: bool = True): - r""" - Args: - query: Query embeddings of shape :math:`(L, E_q)` for unbatched input, :math:`(L, N, E_q)` when ``batch_first=False`` - or :math:`(N, L, E_q)` when ``batch_first=True``, where :math:`L` is the target sequence length, - :math:`N` is the batch size, and :math:`E_q` is the query embedding dimension ``embed_dim``. - Queries are compared against key-value pairs to produce the output. - See "Attention Is All You Need" for more details. - key: Key embeddings of shape :math:`(S, E_k)` for unbatched input, :math:`(S, N, E_k)` when ``batch_first=False`` - or :math:`(N, S, E_k)` when ``batch_first=True``, where :math:`S` is the source sequence length, - :math:`N` is the batch size, and :math:`E_k` is the key embedding dimension ``kdim``. - See "Attention Is All You Need" for more details. - value: Value embeddings of shape :math:`(S, E_v)` for unbatched input, :math:`(S, N, E_v)` when - ``batch_first=False`` or :math:`(N, S, E_v)` when ``batch_first=True``, where :math:`S` is the source - sequence length, :math:`N` is the batch size, and :math:`E_v` is the value embedding dimension ``vdim``. - See "Attention Is All You Need" for more details. - key_padding_mask: If specified, a mask of shape :math:`(N, S)` indicating which elements within ``key`` - to ignore for the purpose of attention (i.e. treat as "padding"). For unbatched `query`, shape should be :math:`(S)`. - Binary and byte masks are supported. - For a binary mask, a ``True`` value indicates that the corresponding ``key`` value will be ignored for - the purpose of attention. For a float mask, it will be directly added to the corresponding ``key`` value. - need_weights: If specified, returns ``attn_output_weights`` in addition to ``attn_outputs``. - Default: ``True``. - attn_mask: If specified, a 2D or 3D mask preventing attention to certain positions. Must be of shape - :math:`(L, S)` or :math:`(N\cdot\text{num\_heads}, L, S)`, where :math:`N` is the batch size, - :math:`L` is the target sequence length, and :math:`S` is the source sequence length. A 2D mask will be - broadcasted across the batch while a 3D mask allows for a different mask for each entry in the batch. - Binary, byte, and float masks are supported. For a binary mask, a ``True`` value indicates that the - corresponding position is not allowed to attend. For a byte mask, a non-zero value indicates that the - corresponding position is not allowed to attend. For a float mask, the mask values will be added to - the attention weight. - average_attn_weights: If true, indicates that the returned ``attn_weights`` should be averaged across - heads. Otherwise, ``attn_weights`` are provided separately per head. Note that this flag only has an - effect when ``need_weights=True``. Default: ``True`` (i.e. average weights across heads) - - Outputs: - - **attn_output** - Attention outputs of shape :math:`(L, E)` when input is unbatched, - :math:`(L, N, E)` when ``batch_first=False`` or :math:`(N, L, E)` when ``batch_first=True``, - where :math:`L` is the target sequence length, :math:`N` is the batch size, and :math:`E` is the - embedding dimension ``embed_dim``. - - **attn_output_weights** - Only returned when ``need_weights=True``. If ``average_attn_weights=True``, - returns attention weights averaged across heads of shape :math:`(L, S)` when input is unbatched or - :math:`(N, L, S)`, where :math:`N` is the batch size, :math:`L` is the target sequence length, and - :math:`S` is the source sequence length. If ``average_attn_weights=False``, returns attention weights per - head of shape :math:`(\text{num\_heads}, L, S)` when input is unbatched or :math:`(N, \text{num\_heads}, L, S)`. - - .. note:: - `batch_first` argument is ignored for unbatched inputs. - """ - is_batched = query.dim() == 3 - if key_padding_mask is not None: - _kpm_dtype = key_padding_mask.dtype - if _kpm_dtype != torch.bool and not torch.is_floating_point( - key_padding_mask): - raise AssertionError( - "only bool and floating types of key_padding_mask are supported") - why_not_fast_path = '' - if not is_batched: - why_not_fast_path = f"input not batched; expected query.dim() of 3 but got {query.dim()}" - elif query is not key or key is not value: - # When lifting this restriction, don't forget to either - # enforce that the dtypes all match or test cases where - # they don't! - why_not_fast_path = "non-self attention was used (query, key, and value are not the same Tensor)" - elif self.in_proj_bias is not None and query.dtype != self.in_proj_bias.dtype: - why_not_fast_path = f"dtypes of query ({query.dtype}) and self.in_proj_bias ({self.in_proj_bias.dtype}) don't match" - elif self.in_proj_weight is not None and query.dtype != self.in_proj_weight.dtype: - # this case will fail anyway, but at least they'll get a useful error message. - why_not_fast_path = f"dtypes of query ({query.dtype}) and self.in_proj_weight ({self.in_proj_weight.dtype}) don't match" - elif self.training: - why_not_fast_path = "training is enabled" - elif not self.batch_first: - why_not_fast_path = "batch_first was not True" - elif self.bias_k is not None: - why_not_fast_path = "self.bias_k was not None" - elif self.bias_v is not None: - why_not_fast_path = "self.bias_v was not None" - elif self.dropout: - why_not_fast_path = f"dropout was {self.dropout}, required zero" - elif self.add_zero_attn: - why_not_fast_path = "add_zero_attn was enabled" - elif not self._qkv_same_embed_dim: - why_not_fast_path = "_qkv_same_embed_dim was not True" - elif attn_mask is not None: - why_not_fast_path = "attn_mask was not None" - elif query.is_nested and key_padding_mask is not None: - why_not_fast_path = "key_padding_mask is not supported with NestedTensor input" - elif self.num_heads % 2 == 1: - why_not_fast_path = "num_heads is odd" - elif torch.is_autocast_enabled(): - why_not_fast_path = "autocast is enabled" - - if not why_not_fast_path: - tensor_args = ( - query, - key, - value, - self.in_proj_weight, - self.in_proj_bias, - self.out_proj.weight, - self.out_proj.bias, - ) - # We have to use list comprehensions below because TorchScript does not support - # generator expressions. - if torch.overrides.has_torch_function(tensor_args): - why_not_fast_path = "some Tensor argument has_torch_function" - elif not all([(x is None or x.is_cuda or 'cpu' in str(x.device)) - for x in tensor_args]): - why_not_fast_path = "some Tensor argument is neither CUDA nor CPU" - elif torch.is_grad_enabled() and any( - [x is not None and x.requires_grad for x in tensor_args]): - why_not_fast_path = ( - "grad is enabled and at least one of query or the " - "input/output projection weights or biases requires_grad") - if not why_not_fast_path: - # Scale the query bias parameter and the query projection weight. - in_proj_weight = self._scaled_in_proj_weight() - in_proj_bias = self._scaled_in_proj_bias() - return torch._native_multi_head_attention( - query, - key, - value, - self.embed_dim, - self.num_heads, - in_proj_weight, - in_proj_bias, - self.out_proj.weight, - self.out_proj.bias, - key_padding_mask if key_padding_mask is not None else attn_mask, - need_weights, - average_attn_weights, - 1 if key_padding_mask is not None else - 0 if attn_mask is not None else None) - any_nested = query.is_nested or key.is_nested or value.is_nested - assert not any_nested, ("MultiheadAttention does not support NestedTensor outside of its fast path. " + - f"The fast path was not hit because {why_not_fast_path}") - - if self.batch_first and is_batched: - # make sure that the transpose op does not affect the "is" property - if key is value: - if query is key: - query = key = value = query.transpose(1, 0) - else: - query, key = [x.transpose(1, 0) for x in (query, key)] - value = key - else: - query, key, value = [x.transpose(1, 0) for x in (query, key, value)] - - if not self._qkv_same_embed_dim: - attn_output, attn_output_weights = F.multi_head_attention_forward( - query, key, value, self.embed_dim, self.num_heads, - self.in_proj_weight, self.in_proj_bias, - self.bias_k, self.bias_v, self.add_zero_attn, - self.dropout, self.out_proj.weight, self.out_proj.bias, - training=self.training, - key_padding_mask=key_padding_mask, need_weights=need_weights, - attn_mask=attn_mask, use_separate_proj_weight=True, - q_proj_weight=self.q_proj_weight, k_proj_weight=self.k_proj_weight, - v_proj_weight=self.v_proj_weight, average_attn_weights=average_attn_weights) - else: - # Scale the query bias parameter and the query projection weight. - in_proj_weight = self._scaled_in_proj_weight() - in_proj_bias = self._scaled_in_proj_bias() - attn_output, attn_output_weights = F.multi_head_attention_forward( - query, key, value, self.embed_dim, self.num_heads, - in_proj_weight, in_proj_bias, - self.bias_k, self.bias_v, self.add_zero_attn, - self.dropout, self.out_proj.weight, self.out_proj.bias, - training=self.training, - key_padding_mask=key_padding_mask, need_weights=need_weights, - attn_mask=attn_mask, average_attn_weights=average_attn_weights) - if self.batch_first and is_batched: - return attn_output.transpose(1, 0), attn_output_weights - else: - return attn_output, attn_output_weights + def forward(self, inputs, key_padding_mask=None): + batch_size, seq_len, embed_dim = inputs.shape + q, k, v = self.in_proj(inputs).split(self.embed_dim, dim=2) + q = self.qs(q.view(batch_size, seq_len, self.num_heads, -1)).transpose(1, 2) + k = k.view(batch_size, seq_len, self.num_heads, -1).transpose(1, 2) + v = v.view(batch_size, seq_len, self.num_heads, -1).transpose(1, 2) + out = F.scaled_dot_product_attention( + query=q, + key=k, + value=v, + attn_mask=~key_padding_mask[:, None, None], + dropout_p=self.dropout, + ).transpose(1, 2).reshape(batch_size, seq_len, embed_dim) + out = self.out_proj(out) + return out class MultiHeadedSelfAttention(nn.Module): @@ -483,12 +305,9 @@ def __init__(self, config: ConformerConfig): def forward(self, outputs, paddings): outputs = self.ln(outputs) - outputs, _ = self.self_attention( - query=outputs, - key=outputs, - value=outputs, - key_padding_mask=paddings==1, - need_weights=False, + outputs = self.self_attention( + outputs, + key_padding_mask=paddings == 1, ) outputs = self.dropout(outputs) return outputs @@ -504,18 +323,29 @@ def __init__(self, config: ConformerConfig): self.register_buffer('running_var', running_var) self.scale = nn.Parameter(torch.zeros(config.encoder_dim)) self.bias = nn.Parameter(torch.zeros(config.encoder_dim)) - self.register_buffer('momentum', - torch.FloatTensor([config.batch_norm_momentum])) - self.register_buffer('epsilon', - torch.FloatTensor([config.batch_norm_epsilon])) + self.register_buffer('dim', torch.FloatTensor([config.encoder_dim])) - # self.momentum = config.batch_norm_momentum - # self.epsilon = config.batch_norm_epsilon - # self.dim = config.encoder_dim + self.momentum = config.batch_norm_momentum + self.epsilon = config.batch_norm_epsilon def forward(self, inputs, input_paddings): #inputs: NHD #padding: NH + """ + Alternatively: + inputs[input_paddings==0] = F.batch_norm( + input = inputs[input_paddings==0], + running_mean = self.running_mean, + running_var = self.running_var, + weight = 1+self.scale, + bias = self.bias, + training = self.training, + momentum=1-self.momentum, + eps=self.epsilon + ) + inputs.masked_fill(input_paddings[...,None] != 0, 0) + return inputs + """ mask = 1 - input_paddings[:, :, None] if self.training: count = mask.sum() @@ -627,7 +457,9 @@ def __init__(self, config: ConformerConfig): else: input_dropout_rate = config.input_dropout_rate self.subsample = Subsample( - encoder_dim=config.encoder_dim, input_dropout_rate=input_dropout_rate) + encoder_dim=config.encoder_dim, + input_dropout_rate=input_dropout_rate, + num_bins=preprocessing_config.num_bins) self.conformers = nn.ModuleList( [ConformerBlock(config) for _ in range(config.num_encoder_layers)]) diff --git a/algorithmic_efficiency/workloads/librispeech_conformer/librispeech_pytorch/workload.py b/algorithmic_efficiency/workloads/librispeech_conformer/librispeech_pytorch/workload.py index 24f4eb1fc..c4f4a1247 100644 --- a/algorithmic_efficiency/workloads/librispeech_conformer/librispeech_pytorch/workload.py +++ b/algorithmic_efficiency/workloads/librispeech_conformer/librispeech_pytorch/workload.py @@ -47,8 +47,11 @@ def init_model_fn( input_dropout_rate. """ torch.random.manual_seed(rng[0]) - # Disable cudnn benchmark to avoid OOM errors. + # Configure torch backends to avoid OOM errors. torch.backends.cudnn.benchmark = False + torch.backends.cuda.enable_flash_sdp(False) + torch.backends.cuda.enable_mem_efficient_sdp(False) + torch.backends.cuda.enable_math_sdp(True) model = conformer_model.ConformerEncoderDecoder( conformer_model.ConformerConfig( attention_residual_dropout_rate=dropout_rate, @@ -57,13 +60,6 @@ def init_model_fn( input_dropout_rate=aux_dropout_rate, use_specaug=self.use_specaug)) self.ctc_loss = torch.nn.CTCLoss(blank=0, reduction='none') - # Run model once to initialize lazy layers. - # Run the initialization in eval mode to disable BN tracking. - model = model.eval() - t = MAX_INPUT_LENGTH - wave = torch.randn((2, t)) - pad = torch.zeros_like(wave) - _ = model(wave, pad) conformer_model.initialize(model) self._param_shapes = param_utils.pytorch_param_shapes(model) self._param_types = param_utils.pytorch_param_types(self._param_shapes) diff --git a/submission_runner.py b/submission_runner.py index 2289d39d3..d7f28bc22 100644 --- a/submission_runner.py +++ b/submission_runner.py @@ -206,9 +206,9 @@ def train_once( model_params, model_state = workload.init_model_fn( model_init_rng, dropout_rate, aux_dropout_rate) if FLAGS.framework == 'pytorch' and FLAGS.torch_compile: - compile_error_workloads = ['ogbg', 'criteo1tb'] + compile_error_workloads = ['librispeech_conformer', 'ogbg', 'criteo1tb'] eager_backend_workloads = [ - 'librispeech_conformer', 'librispeech_deepspeech' + 'librispeech_deepspeech' ] aot_eager_backend_workloads = [] if FLAGS.workload in compile_error_workloads: diff --git a/tests/modeldiffs/librispeech_conformer/compare.py b/tests/modeldiffs/librispeech_conformer/compare.py index 1d243d83e..d414001dd 100644 --- a/tests/modeldiffs/librispeech_conformer/compare.py +++ b/tests/modeldiffs/librispeech_conformer/compare.py @@ -38,11 +38,15 @@ def sd_transform(sd): out = {} for k in sd: if 'Attention' in ''.join(k): - if 'in_proj' in k[-1]: - new_key = k[:-1] + if 'Dense_0' in k[-2]: + # In-proj + new_key = k[:-2] chunks = sd[k].chunk(3) for t, c in zip(['query', 'key', 'value'], chunks): - out[new_key + (t, k[-1].split('_')[-1])] = c + out[new_key + (t, k[-1])] = c + elif 'Dense_1' in k[-2]: + # Out-proj + out[(*k[:-2], 'out', k[-1])] = sd[k] else: out[k] = sd[k] else: From 28a1ff039a8f096fbb9360825e6ede88a79030ce Mon Sep 17 00:00:00 2001 From: Chandramouli Shama Sastry Date: Sun, 15 Oct 2023 00:43:45 +0000 Subject: [PATCH 20/26] style fix --- submission_runner.py | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/submission_runner.py b/submission_runner.py index d7f28bc22..6577204f2 100644 --- a/submission_runner.py +++ b/submission_runner.py @@ -207,9 +207,7 @@ def train_once( model_init_rng, dropout_rate, aux_dropout_rate) if FLAGS.framework == 'pytorch' and FLAGS.torch_compile: compile_error_workloads = ['librispeech_conformer', 'ogbg', 'criteo1tb'] - eager_backend_workloads = [ - 'librispeech_deepspeech' - ] + eager_backend_workloads = ['librispeech_deepspeech'] aot_eager_backend_workloads = [] if FLAGS.workload in compile_error_workloads: logging.warning( From 2bc09d9a04691c54755469833d5c9b626ca615d0 Mon Sep 17 00:00:00 2001 From: Priya Kasimbeg Date: Fri, 20 Oct 2023 05:04:18 +0000 Subject: [PATCH 21/26] correct num_test_examples --- algorithmic_efficiency/workloads/fastmri/workload.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/algorithmic_efficiency/workloads/fastmri/workload.py b/algorithmic_efficiency/workloads/fastmri/workload.py index ecfa27547..4677dc2bb 100644 --- a/algorithmic_efficiency/workloads/fastmri/workload.py +++ b/algorithmic_efficiency/workloads/fastmri/workload.py @@ -51,7 +51,7 @@ def num_validation_examples(self) -> int: @property def num_test_examples(self) -> int: - return 3581 + return 3548 @property def eval_batch_size(self) -> int: From b2af85a0307b798f165c042f9b96cfcbda1d8b28 Mon Sep 17 00:00:00 2001 From: Priya Kasimbeg Date: Fri, 20 Oct 2023 22:58:00 +0000 Subject: [PATCH 22/26] readme fixes --- CHANGELOG.md | 11 ++--------- README.md | 37 ++++++++++++++++++++++--------------- 2 files changed, 24 insertions(+), 24 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index c3112b9d7..b71e42e01 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,12 +1,5 @@ # Change log -# TODO: algorithmic-efficiency 0.1.0 -First release of AlgoPerf benchmarking code. - -**Disclaimer**: The Conformer Pytorch workload has memory fragmentation issue after upgrading to -Pytorch 2.0.1, which led to out of memory errors. To circumvent this issue we have tuned the pytorch -memory allocation configuration, which slows down the workload by a factor of roughly 2x. For submitters, this -means that the Conformer Pytorch submission times will be roughly 2x slower. -Tracking in issue/497(https://github.com/mlcommons/algorithmic-efficiency/issues/497). +## TODO: algorithmic-efficiency 0.1.0 -Tracking issue: [issue/497](https://github.com/mlcommons/algorithmic-efficiency/issues/497). \ No newline at end of file +First release of AlgoPerf benchmarking code. diff --git a/README.md b/README.md index 3dce37924..9803f38be 100644 --- a/README.md +++ b/README.md @@ -129,7 +129,7 @@ To use the Docker container as an interactive virtual environment, you can run a \ --keep_container_alive true ``` - Note: You may have to use double quotes around `algorithmic-efficiency` in the mounting `-v` flag. If the above command fails try replacing the following line: + Note: You may have to use double quotes around `algorithmic-efficiency` [path] in the mounting `-v` flag. If the above command fails try replacing the following line: ```bash -v $HOME/algorithmic-efficiency:/algorithmic-efficiency2 \ ``` @@ -241,13 +241,6 @@ The JAX and PyTorch versions of the Criteo, FastMRI, Librispeech, OGBG, and WMT Since we use PyTorch's [`DistributedDataParallel`](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel) implementation, there is one Python process for each device. Depending on the hardware and the settings of the cluster, running a TensorFlow input pipeline in each Python process can lead to errors, since too many threads are created in each process. See [this PR thread](https://github.com/mlcommons/algorithmic-efficiency/pull/85) for more details. While this issue might not affect all setups, we currently implement a different strategy: we only run the TensorFlow input pipeline in one Python process (with `rank == 0`), and [broadcast](https://pytorch.org/docs/stable/distributed.html#torch.distributed.broadcast) the batches to all other devices. This introduces an additional communication overhead for each batch. See the [implementation for the WMT workload](https://github.com/mlcommons/algorithmic-efficiency/blob/main/algorithmic_efficiency/workloads/wmt/wmt_pytorch/workload.py#L215-L288) as an example. -## Conformer Pytorch OOM -The Conformer Pytorch workload has memory fragmentation issue after upgrading to -Pytorch 2.0.1, which led to out of memory errors. To circumvent this issue we have tuned the pytorch -memory allocation configuration, which slows down the workload by a factor of roughly 2x. For submitters, this -means that the Conformer Pytorch submission times will be roughly 2x slower. -Tracking in issue/497(https://github.com/mlcommons/algorithmic-efficiency/issues/497). - # FAQS ## Setup and Platform @@ -261,7 +254,7 @@ to make make sure it will fit on a 8 16GB V100 GPUs. You may run into issues with `sudo` and `docker` on a SLURM cluster. To run the workloads in a SLURM cluster you can use Apptainer (previously Singularity), see this [section](using-singularity/apptainer-instead-of-docker). ### How can I run this on my AWS/GCP/Azure cloud project? Depending on your virtual machine, you may have to install install the correct GPU drivers and the NVIDIA Docker toolkit. For example, in GCP you will have to do the following. -1. If you don't have an VM instance yet, we recommmend creating a +1. If you don't have a VM instance yet, we recommend creating a new Compute Instance with the "Deep Learning on Linux" Image in Boot disk options. 2. To install the NVIDIA Docker toolkit, you can use `scripts/cloud-startup.sh` as a startup script for the VM. This will automate the installation of the NVIDIA GPU Drivers and NVIDIA Docker toolkit. @@ -270,14 +263,28 @@ new Compute Instance with the "Deep Learning on Linux" Image in Boot disk option Yes, your submission can be structured using multiple files. ### Can I install custom dependencies? You may use custom dependencies as long as they do not conflict with any of the pinned packages in `algorithmic-efficiency/setup.cfg`. -To include your custom dependencies in your submission, please include them in a requirements.txt file. +To include your custom dependencies in your submission, please include them in a requirements.txt file. Please refer to the [Software dependencies](https://github.com/mlcommons/algorithmic-efficiency/blob/main/RULES.md#software-dependencies) section of our rules. ### How can I know if my code can be run on benchmarking hardware? The benchmarking hardware specifications are documented in the [Getting Started Document](./getting_started.md). -Please monitor your submission's memory usage so that it does not exceed the available memory -on the competition hardware. +We recommend monitoring your submission's memory usage so that it does not exceed the available memory +on the competition hardware. We also recommend to do a dry run using a cloud instance. ### Are we allowed to use our own hardware to self-report the results? -No. However you are allowed to use your own hardware to report the best hyperparameter point to qualify for -a compute sponsorship offering a free evaluation on the full benchmark set, see the [Rules](./RULES.md#qualification-set). +You only have to use the competition hardware for runs that are directly involved in the scoring procedure. This includes all runs for the self-tuning ruleset, but only the runs of the best hyperparameter configuration in each study for the external tuning ruleset. For example, you could use your own (different) hardware to tune your submission and identify the best hyperparameter configuration (in each study) and then only run this configuration (i.e. 5 runs, one for each study) on the competition hardware. # Citing AlgoPerf Benchmark -Todo: how to cite the algoperf benchmark? \ No newline at end of file +If you use the **AlgoPerf** Benchmark in your work, please consider citing: + +> [George E. Dahl, Frank Schneider, Zachary Nado, et al.
+> **Benchmarking Neural Network Training Algorithms**
+> *arXiv 2306.07179*](http://arxiv.org/abs/2306.07179) + +```bibtex +@misc{dahl2023algoperf, + title={{Benchmarking Neural Network Training Algorithms}}, + author={Dahl, George E. and Schneider, Frank and Nado, Zachary and Agarwal, Naman and Sastry, Chandramouli Shama and Hennig, Philipp and Medapati, Sourabh and Eschenhagen, Runa and Kasimbeg, Priya and Suo, Daniel and Bae, Juhan and Gilmer, Justin and Peirson, Abel L. and Khan, Bilal and Anil, Rohan and Rabbat, Mike and Krishnan, Shankar and Snider, Daniel and Amid, Ehsan and Chen, Kongtao and Maddison, Chris J. and Vasudev, Rakshith and Badura, Michal and Garg, Ankush and Mattson, Peter}, + year={2023}, + eprint={2306.07179}, + archivePrefix={arXiv}, + primaryClass={cs.LG} +} +``` \ No newline at end of file From 2e1ffd40c9d3fc7935b693e245c5928b53963c7c Mon Sep 17 00:00:00 2001 From: Priya Kasimbeg Date: Mon, 23 Oct 2023 18:10:02 +0000 Subject: [PATCH 23/26] add space after heading --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 9803f38be..f4c467665 100644 --- a/README.md +++ b/README.md @@ -244,6 +244,7 @@ While this issue might not affect all setups, we currently implement a different # FAQS ## Setup and Platform + ### My machine only has one GPU. How can I use this repo? You can run this repo on a machine with arbitrary number of GPUs. However, the default batchsizes in our reference algorithms `algorithmic-efficiency/baselines` and `algorithmic-efficiency/reference_algorithms` are tuned for a machine with 8 16GB V100 GPUs. You may run into OOMs if you run these algorithms with fewer than 8 GPUs. If you run into these issues because you are using a machine with less total GPU memory, please reduce the batchsizes for the submission. Note that your final submission must 'fit' on the benchmarking hardware, so if you are using fewer From 9cd436282ac34a7c3b2d33ff6574b59767e69466 Mon Sep 17 00:00:00 2001 From: Priya Kasimbeg Date: Mon, 23 Oct 2023 18:13:30 +0000 Subject: [PATCH 24/26] fix typos --- README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index f4c467665..6ffbab6f7 100644 --- a/README.md +++ b/README.md @@ -246,15 +246,15 @@ While this issue might not affect all setups, we currently implement a different ## Setup and Platform ### My machine only has one GPU. How can I use this repo? -You can run this repo on a machine with arbitrary number of GPUs. However, the default batchsizes in our reference algorithms `algorithmic-efficiency/baselines` and `algorithmic-efficiency/reference_algorithms` are tuned for a machine with 8 16GB V100 GPUs. You may run into OOMs if you run these algorithms with fewer than 8 GPUs. If you run into these issues because you are using a machine with less total GPU memory, please reduce the batchsizes for the submission. Note that your final submission must 'fit' +You can run this repo on a machine with an arbitrary number of GPUs. However, the default batch sizes in our reference algorithms `algorithmic-efficiency/baselines` and `algorithmic-efficiency/reference_algorithms` are tuned for a machine with 8 16GB V100 GPUs. You may run into OOMs if you run these algorithms with fewer than 8 GPUs. If you run into these issues because you are using a machine with less total GPU memory, please reduce the batch sizes for the submission. Note that your final submission must 'fit' on the benchmarking hardware, so if you are using fewer -GPUs with higher per gpu memory, please monitor your memory usage -to make make sure it will fit on a 8 16GB V100 GPUs. +GPUs with higher per GPU memory, please monitor your memory usage +to make make sure it will fit on 8xV100 GPUs with 16GB of VRAM per card. ### How do I run this on my SLURM cluster? You may run into issues with `sudo` and `docker` on a SLURM cluster. To run the workloads in a SLURM cluster you can use Apptainer (previously Singularity), see this [section](using-singularity/apptainer-instead-of-docker). ### How can I run this on my AWS/GCP/Azure cloud project? - Depending on your virtual machine, you may have to install install the correct GPU drivers and the NVIDIA Docker toolkit. For example, in GCP you will have to do the following. + Depending on your virtual machine, you may have to install the correct GPU drivers and the NVIDIA Docker toolkit. For example, in GCP you will have to do the following. 1. If you don't have a VM instance yet, we recommend creating a new Compute Instance with the "Deep Learning on Linux" Image in Boot disk options. 2. To install the NVIDIA Docker toolkit, you can use `scripts/cloud-startup.sh` as a startup script for the VM. This will automate the installation of the NVIDIA GPU Drivers and NVIDIA Docker toolkit. From c096464dd88c664061368e72d78ff04168f0fdd0 Mon Sep 17 00:00:00 2001 From: Priya Kasimbeg Date: Mon, 23 Oct 2023 18:25:03 +0000 Subject: [PATCH 25/26] use full validation set --- algorithmic_efficiency/workloads/fastmri/input_pipeline.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/algorithmic_efficiency/workloads/fastmri/input_pipeline.py b/algorithmic_efficiency/workloads/fastmri/input_pipeline.py index 8f6ddafd1..df59098e4 100644 --- a/algorithmic_efficiency/workloads/fastmri/input_pipeline.py +++ b/algorithmic_efficiency/workloads/fastmri/input_pipeline.py @@ -180,7 +180,7 @@ def load_fastmri_split(global_batch_size, h5_paths = glob.glob(file_pattern) elif split == 'validation': file_pattern = os.path.join(data_dir, _VAL_DIR, '*.h5') - h5_paths = sorted(glob.glob(file_pattern))[:100] + h5_paths = sorted(glob.glob(file_pattern)) elif split == 'test': # The fastmri validation set is split into a validation and test set file_pattern = os.path.join(data_dir, _VAL_DIR, '*.h5') From 730be20c1317b06c682f3709fd09e0336b351326 Mon Sep 17 00:00:00 2001 From: Priya Kasimbeg Date: Mon, 23 Oct 2023 18:26:28 +0000 Subject: [PATCH 26/26] revert fastmri input pipeline change --- algorithmic_efficiency/workloads/fastmri/input_pipeline.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/algorithmic_efficiency/workloads/fastmri/input_pipeline.py b/algorithmic_efficiency/workloads/fastmri/input_pipeline.py index df59098e4..8f6ddafd1 100644 --- a/algorithmic_efficiency/workloads/fastmri/input_pipeline.py +++ b/algorithmic_efficiency/workloads/fastmri/input_pipeline.py @@ -180,7 +180,7 @@ def load_fastmri_split(global_batch_size, h5_paths = glob.glob(file_pattern) elif split == 'validation': file_pattern = os.path.join(data_dir, _VAL_DIR, '*.h5') - h5_paths = sorted(glob.glob(file_pattern)) + h5_paths = sorted(glob.glob(file_pattern))[:100] elif split == 'test': # The fastmri validation set is split into a validation and test set file_pattern = os.path.join(data_dir, _VAL_DIR, '*.h5')