Skip to content

Latest commit

 

History

History
113 lines (81 loc) · 2.92 KB

File metadata and controls

113 lines (81 loc) · 2.92 KB

🚀 Parallel Training Methods for AI

Sam Foreman
Intro to AI-driven Science on Supercomputers
2024-11-05

👋 Hands On

  1. Submit interactive job:

    qsub -A ALCFAITP -q by-node -l select=1 -l walltime=01:00:00,filesystems=eagle:home -I
  2. On Sophia:

    export HTTP_PROXY="http://proxy.alcf.anl.gov:3128"
    export HTTPS_PROXY="http://proxy.alcf.anl.gov:3128"
    export http_proxy="http://proxy.alcf.anl.gov:3128"
    export https_proxy="http://proxy.alcf.anl.gov:3128"
    export ftp_proxy="http://proxy.alcf.anl.gov:3128"
  3. Clone repos:

    1. saforem2/wordplay:

      git clone https://github.com/saforem2/wordplay
      cd wordplay
    2. saforem2/ezpz:

      git clone https://github.com/saforem2/ezpz deps/ezpz
  4. Setup python:

    export PBS_O_WORKDIR=$(pwd) && source deps/ezpz/src/ezpz/bin/utils.sh
    ezpz_setup_python
    ezpz_setup_job
  5. Install {ezpz, wordplay}:

    python3 -m pip install -e deps/ezpz --require-virtualenv
    python3 -m pip install -e . --require-virtualenv
  6. Setup (or disable) wandb:

    # to setup:
    wandb login
    # to disable:
    export WANDB_DISABLED=1
  7. Test Distributed Setup:

    mpirun -n "${NGPUS}" python3 -m ezpz.test_dist

    See: ezpz/test_dist.py

  8. Prepare Data:

    python3 data/shakespeare_char/prepare.py
  9. Launch Training:

    mpirun -n "${NGPUS}" python3 -m wordplay \
        train.backend=DDP \
        train.eval_interval=100 \
        data=shakespeare \
        train.dtype=bf16 \
        model.batch_size=64 \
        model.block_size=1024 \
        train.max_iters=1000 \
        train.log_interval=10 \
        train.compile=false

🎒 Homework

Submit proof that you were able to successfully follow the above instructions and launch a distributed data parallel training run.

Where proof can be any of:

  • The contents printed out to your terminal during the run
  • A path to a logfile containing the output from a run on the ALCF filesystems
  • A screenshot of:
    • the text printed out from the run
    • a graph from the W&B Run
    • anything that shows that you clearly were able to run the example
  • url to a W&B Run or W&B Report
  • etc.