ATorch is an extension library of PyTorch developed by Ant Group's AI Infrastructure team. By decoupling model definition from training optimization strategy, ATorch supports efficient and easy-to-use model training experience. The design principle is to minimally disrupt the native PyTorch programming style. Through its API, ATorch provides performance optimizations in aspects such as I/O, preprocessing, computation, and communication (including automatic optimization). ATorch has supported large-scale pretraining of LLMs with over 100 billion parameters and thousands of A100/H100 GPUs.
- Easy-to-use interface
- auto_accelerate API
- ATorchTrainer (ongoing work)
- Solutions for large-scale model training
- support efficient large model initialization, checkpoint save/load, and restart with elastic resources.
- Automatic/semi-automatic optimization
- Acceleration Engine for automatic optimization
- Semi-automatic optimization supports custom optimization
- Hybrid parallelism suport (arbitrary combination of fsdp/zero/ddp/tp/sp/pp)
- High performance operators
- Flash attention 2 with custom mask support
- Transformer ops
- High-performance MOE
- sub-graph compilation
- Checkpointing
- Mixed precision
- Communication optimization
- Cached sharding
- Effective optimizers for fast training convergence
- IO/Preprocessing
- CPU/GPU coworker to speedup data preprocessing
- IO optimization for different dataset
- Elastic and fault tolerance
- Hardware error detection and migration (with dlrover)
- GPU elastic training support
- HangDetector (detecting and automatically restarting distributed training if it hangs)
ATorch supports PyTorch with version >= 1.12, and verion 2.1 or above is preferred.
For example, you can use docker image easydl/atorch:iml_pt210
which has PyTorch 2.1 installed.
Install atorch in any PyTorch-preinstalled environment (such as a container created with the docker image above) with pip
:
pip install atorch
# clone repository
git clone https://github.com/intelligent-machine-learning/dlrover.git
cd dlrover/atorch
# build package
sh dev/scripts/build.sh
# install the created package in dist directory
pip install dist/atorch-0.1.0.dev0-py3-none-any.whl
- To run auto_accelerate examples:
cd dlrover/atorch/examples/auto_accelerate
# Single process train
python train.py --model_type toy
# Distributed train
python -m atorch.distributed.run --nproc_per_node 2 train.py --model_type llama --distributed --load_strategy --use_fsdp --use_amp --use_module_replace --use_checkpointing
Contributions are welcome! If you have any suggestions, ideas, or bug reports, please open an issue or submit a pull request.
We leverage the power of GitHub Actions to automate our development, release and deployment workflows. Please check out this documentation on how the automated workflows are operated.