Name		Name	Last commit message	Last commit date
parent directory ..
previous-years		previous-years
README.md		README.md

README.md

🚀 Parallel Training Methods for AI

Sam Foreman
Intro to AI-driven Science on Supercomputers
2024-11-05

Slides: https://samforeman.me/talks/ai-for-science-2024/slides
- HTML version: https://samforeman.me/talks/ai-for-science-2024

👋 Hands On

Submit interactive job:

qsub -A ALCFAITP -q by-node -l select=1 -l walltime=01:00:00,filesystems=eagle:home -I

On Sophia:

export HTTP_PROXY="http://proxy.alcf.anl.gov:3128"
export HTTPS_PROXY="http://proxy.alcf.anl.gov:3128"
export http_proxy="http://proxy.alcf.anl.gov:3128"
export https_proxy="http://proxy.alcf.anl.gov:3128"
export ftp_proxy="http://proxy.alcf.anl.gov:3128"

Clone repos:

saforem2/wordplay:

git clone https://github.com/saforem2/wordplay
cd wordplay

saforem2/ezpz:

git clone https://github.com/saforem2/ezpz deps/ezpz

Setup python:

export PBS_O_WORKDIR=$(pwd) && source deps/ezpz/src/ezpz/bin/utils.sh
ezpz_setup_python
ezpz_setup_job

Install {ezpz, wordplay}:

python3 -m pip install -e deps/ezpz --require-virtualenv
python3 -m pip install -e . --require-virtualenv

Setup (or disable) wandb:

# to setup:
wandb login
# to disable:
export WANDB_DISABLED=1

Test Distributed Setup:

mpirun -n "${NGPUS}" python3 -m ezpz.test_dist

See: ezpz/test_dist.py

Prepare Data:

python3 data/shakespeare_char/prepare.py

Launch Training:

mpirun -n "${NGPUS}" python3 -m wordplay \
    train.backend=DDP \
    train.eval_interval=100 \
    data=shakespeare \
    train.dtype=bf16 \
    model.batch_size=64 \
    model.block_size=1024 \
    train.max_iters=1000 \
    train.log_interval=10 \
    train.compile=false

🎒 Homework

Submit proof that you were able to successfully follow the above instructions and launch a distributed data parallel training run.

Where proof can be any of:

The contents printed out to your terminal during the run
A path to a logfile containing the output from a run on the ALCF filesystems
A screenshot of:
- the text printed out from the run
- a graph from the W&B Run
- anything that shows that you clearly were able to run the example
url to a W&B Run or W&B Report
etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

06_parallel_training

06_parallel_training

README.md

🚀 Parallel Training Methods for AI

👋 Hands On

🎒 Homework

Files

06_parallel_training

Directory actions

More options

Directory actions

More options

Latest commit

History

06_parallel_training

Folders and files

parent directory

README.md

🚀 Parallel Training Methods for AI

👋 Hands On

🎒 Homework