End-to-end PyTorch training job for multi-node GPU training on a Kubernetes cluster.
This helm chart will deploy a StatefulSet of N replicas as defined in the chart's values.yaml. This is where resources and affinity may be defined and allows for a generic port to most programming languages
- Kubernetes Cluster (Tested on >= v1.20)
- Two or more nodes with at least one NVIDIA GPU per node (Tested on Amphere architecture)
- NVIDIA NGC Account
Using the pytorch/distributed-training/values.yaml
file, set the parameters as needed for your cluster. The most notable options will be the node affinity and resources. Node affinity will allocate all pods to nodes of your defined affinity. Resources will allocate all resources per pod as your defined resources.
helm install pytorch-distributed pytorch/distributed-training \
--set imageCredentials.password=$NGC_API_KEY \
--set imageCredentials.email=$NGC_EMAIL
The training arguments follow the PyTorch Image Classification reference training scripts. you can specify the direction given below in the pytorch/distributed-training/values.yaml
file.
example:
args:
- python
- train.py
- --model=resnet18
- --batch-size=1024
- --epochs=10
- --lr=0.5
- --momentum=0.9
- --weight-decay=0.00002
- --lr-step-size=30
- --lr-gamma=0.1
- --print-freq=1
- --val-resize-size=28
- --val-crop-size=28
- --train-crop-size=28
- --output-dir=/workspace/data/resnet18
- --amp
Reference training scripts for image classification. They serve as a log of how to train specific models, as provide baseline training and evaluation scripts to quickly bootstrap research.
Except otherwise noted, all models have been trained on 8x V100 GPUs with the following parameters:
Parameter | value |
---|---|
--batch_size |
32 |
--epochs |
90 |
--lr |
0.1 |
--momentum |
0.9 |
--wd , --weight-decay |
1e-4 |
--lr-step-size |
30 |
--lr-gamma |
0.1 |
Since AlexNet
and the original VGG
architectures do not include batch
normalization, the default initial learning rate --lr 0.1
is too high.
python train.py\
--model $MODEL --lr 1e-2
Here $MODEL
is one of alexnet
, vgg11
, vgg13
, vgg16
or vgg19
. Note
that vgg11_bn
, vgg13_bn
, vgg16_bn
, and vgg19_bn
include batch
normalization and thus are trained with the default parameters.
The weights of the GoogLeNet model are ported from the original paper rather than trained from scratch.
The weights of the Inception V3 model are ported from the original paper rather than trained from scratch.
Since it expects tensors with a size of N x 3 x 299 x 299, to validate the model use the following command:
python train.py --model inception_v3\
--test-only --weights Inception_V3_Weights.IMAGENET1K_V1
python train.py --model $MODEL
Here $MODEL
is one of resnet18
, resnet34
, resnet50
, resnet101
or resnet152
.
python train.py\
--model $MODEL --epochs 100
Here $MODEL
is one of resnext50_32x4d
or resnext101_32x8d
.
Note that the above command corresponds to a single node with 8 GPUs. If you use
a different number of GPUs and/or a different batch size, then the learning rate
should be scaled accordingly. For example, the pretrained model provided by
torchvision
was trained on 8 nodes, each with 8 GPUs (for a total of 64 GPUs),
with --batch_size 16
and --lr 0.4
, instead of the current defaults
which are respectively batch_size=32 and lr=0.1
python train.py\
--model mobilenet_v2 --epochs 300 --lr 0.045 --wd 0.00004\
--lr-step-size 1 --lr-gamma 0.98
python train.py\
--model $MODEL --epochs 600 --opt rmsprop --batch-size 128 --lr 0.064\
--wd 0.00001 --lr-step-size 2 --lr-gamma 0.973 --auto-augment imagenet --random-erase 0.2
Here $MODEL
is one of mobilenet_v3_large
or mobilenet_v3_small
.
Then we averaged the parameters of the last 3 checkpoints that improved the Acc@1. See #3182 and #3354 for details.
The weights of the B0-B4 variants are ported from Ross Wightman's timm repo.
The weights of the B5-B7 variants are ported from Luke Melas' EfficientNet-PyTorch repo.
All models were trained using Bicubic interpolation and each have custom crop and resize sizes. To validate the models use the following commands:
python train.py --model efficientnet_b0 --test-only --weights EfficientNet_B0_Weights.IMAGENET1K_V1
python train.py --model efficientnet_b1 --test-only --weights EfficientNet_B1_Weights.IMAGENET1K_V1
python train.py --model efficientnet_b2 --test-only --weights EfficientNet_B2_Weights.IMAGENET1K_V1
python train.py --model efficientnet_b3 --test-only --weights EfficientNet_B3_Weights.IMAGENET1K_V1
python train.py --model efficientnet_b4 --test-only --weights EfficientNet_B4_Weights.IMAGENET1K_V1
python train.py --model efficientnet_b5 --test-only --weights EfficientNet_B5_Weights.IMAGENET1K_V1
python train.py --model efficientnet_b6 --test-only --weights EfficientNet_B6_Weights.IMAGENET1K_V1
python train.py --model efficientnet_b7 --test-only --weights EfficientNet_B7_Weights.IMAGENET1K_V1
python train.py \
--model $MODEL --batch-size 128 --lr 0.5 --lr-scheduler cosineannealinglr \
--lr-warmup-epochs 5 --lr-warmup-method linear --auto-augment ta_wide --epochs 600 --random-erase 0.1 \
--label-smoothing 0.1 --mixup-alpha 0.2 --cutmix-alpha 1.0 --weight-decay 0.00002 --norm-weight-decay 0.0 \
--train-crop-size $TRAIN_SIZE --model-ema --val-crop-size $EVAL_SIZE --val-resize-size $EVAL_SIZE \
--ra-sampler --ra-reps 4
Here $MODEL
is one of efficientnet_v2_s
and efficientnet_v2_m
.
Note that the Small variant had a $TRAIN_SIZE
of 300
and a $EVAL_SIZE
of 384
, while the Medium 384
and 480
respectively.
Note that the above command corresponds to training on a single node with 8 GPUs.
For generatring the pre-trained weights, we trained with 4 nodes, each with 8 GPUs (for a total of 32 GPUs),
and --batch_size 32
.
The weights of the Large variant are ported from the original paper rather than trained from scratch. See the EfficientNet_V2_L_Weights
entry for their exact preprocessing transforms.
python train.py\
--model $MODEL --epochs 100 --batch-size 128 --wd 0.00005 --lr=0.8\
--lr-scheduler=cosineannealinglr --lr-warmup-method=linear\
--lr-warmup-epochs=5 --lr-warmup-decay=0.1
Here $MODEL
is one of regnet_x_400mf
, regnet_x_800mf
, regnet_x_1_6gf
, regnet_y_400mf
, regnet_y_800mf
and regnet_y_1_6gf
. Please note we used learning rate 0.4 for regent_y_400mf
to get the same Acc@1 as [the paper)(https://arxiv.org/abs/2003.13678).
python train.py\
--model $MODEL --epochs 100 --batch-size 64 --wd 0.00005 --lr=0.4\
--lr-scheduler=cosineannealinglr --lr-warmup-method=linear\
--lr-warmup-epochs=5 --lr-warmup-decay=0.1
Here $MODEL
is one of regnet_x_3_2gf
, regnet_x_8gf
, regnet_x_16gf
, regnet_y_3_2gf
and regnet_y_8gf
.
python train.py\
--model $MODEL --epochs 100 --batch-size 32 --wd 0.00005 --lr=0.2\
--lr-scheduler=cosineannealinglr --lr-warmup-method=linear\
--lr-warmup-epochs=5 --lr-warmup-decay=0.1
Here $MODEL
is one of regnet_x_32gf
, regnet_y_16gf
and regnet_y_32gf
.
python train.py\
--model vit_b_16 --epochs 300 --batch-size 512 --opt adamw --lr 0.003 --wd 0.3\
--lr-scheduler cosineannealinglr --lr-warmup-method linear --lr-warmup-epochs 30\
--lr-warmup-decay 0.033 --amp --label-smoothing 0.11 --mixup-alpha 0.2 --auto-augment ra\
--clip-grad-norm 1 --ra-sampler --cutmix-alpha 1.0 --model-ema
Note that the above command corresponds to training on a single node with 8 GPUs.
For generatring the pre-trained weights, we trained with 8 nodes, each with 8 GPUs (for a total of 64 GPUs),
and --batch_size 64
.
python train.py\
--model vit_b_32 --epochs 300 --batch-size 512 --opt adamw --lr 0.003 --wd 0.3\
--lr-scheduler cosineannealinglr --lr-warmup-method linear --lr-warmup-epochs 30\
--lr-warmup-decay 0.033 --amp --label-smoothing 0.11 --mixup-alpha 0.2 --auto-augment imagenet\
--clip-grad-norm 1 --ra-sampler --cutmix-alpha 1.0 --model-ema
Note that the above command corresponds to training on a single node with 8 GPUs.
For generatring the pre-trained weights, we trained with 2 nodes, each with 8 GPUs (for a total of 16 GPUs),
and --batch_size 256
.
python train.py\
--model vit_l_16 --epochs 600 --batch-size 128 --lr 0.5 --lr-scheduler cosineannealinglr\
--lr-warmup-method linear --lr-warmup-epochs 5 --label-smoothing 0.1 --mixup-alpha 0.2\
--auto-augment ta_wide --random-erase 0.1 --weight-decay 0.00002 --norm-weight-decay 0.0\
--clip-grad-norm 1 --ra-sampler --cutmix-alpha 1.0 --model-ema --val-resize-size 232
Note that the above command corresponds to training on a single node with 8 GPUs.
For generatring the pre-trained weights, we trained with 2 nodes, each with 8 GPUs (for a total of 16 GPUs),
and --batch_size 64
.
python train.py\
--model vit_l_32 --epochs 300 --batch-size 512 --opt adamw --lr 0.003 --wd 0.3\
--lr-scheduler cosineannealinglr --lr-warmup-method linear --lr-warmup-epochs 30\
--lr-warmup-decay 0.033 --amp --label-smoothing 0.11 --mixup-alpha 0.2 --auto-augment ra\
--clip-grad-norm 1 --ra-sampler --cutmix-alpha 1.0 --model-ema
Note that the above command corresponds to training on a single node with 8 GPUs.
For generatring the pre-trained weights, we trained with 8 nodes, each with 8 GPUs (for a total of 64 GPUs),
and --batch_size 64
.
python train.py\
--model $MODEL --batch-size 128 --opt adamw --lr 1e-3 --lr-scheduler cosineannealinglr \
--lr-warmup-epochs 5 --lr-warmup-method linear --auto-augment ta_wide --epochs 600 --random-erase 0.1 \
--label-smoothing 0.1 --mixup-alpha 0.2 --cutmix-alpha 1.0 --weight-decay 0.05 --norm-weight-decay 0.0 \
--train-crop-size 176 --model-ema --val-resize-size 232 --ra-sampler --ra-reps 4
Here $MODEL
is one of convnext_tiny
, convnext_small
, convnext_base
and convnext_large
. Note that each variant had its --val-resize-size
optimized in a post-training step, see their Weights
entry for their exact value.
Note that the above command corresponds to training on a single node with 8 GPUs.
For generatring the pre-trained weights, we trained with 2 nodes, each with 8 GPUs (for a total of 16 GPUs),
and --batch_size 64
.
python train.py\
--model $MODEL --epochs 300 --batch-size 128 --opt adamw --lr 0.001 --weight-decay 0.05 --norm-weight-decay 0.0 --bias-weight-decay 0.0 --transformer-embedding-decay 0.0 --lr-scheduler cosineannealinglr --lr-min 0.00001 --lr-warmup-method linear --lr-warmup-epochs 20 --lr-warmup-decay 0.01 --amp --label-smoothing 0.1 --mixup-alpha 0.8 --clip-grad-norm 5.0 --cutmix-alpha 1.0 --random-erase 0.25 --interpolation bicubic --auto-augment ta_wide --model-ema --ra-sampler --ra-reps 4 --val-resize-size 224
Here $MODEL
is one of swin_t
, swin_s
or swin_b
.
Note that --val-resize-size
was optimized in a post-training step, see their Weights
entry for the exact value.
python train.py \
--batch-size=128 \
--lr=0.5 --lr-scheduler=cosineannealinglr --lr-warmup-epochs=5 --lr-warmup-method=linear \
--auto-augment=ta_wide --epochs=600 --random-erase=0.1 --weight-decay=0.00002 \
--norm-weight-decay=0.0 --label-smoothing=0.1 --mixup-alpha=0.2 --cutmix-alpha=1.0 \
--train-crop-size=176 --model-ema --val-resize-size=232 --ra-sampler --ra-reps=4
Here $MODEL
is either shufflenet_v2_x1_5
or shufflenet_v2_x2_0
.
The models shufflenet_v2_x0_5
and shufflenet_v2_x1_0
were contributed by the community. See PR-849 for details.
Automatic Mixed Precision (AMP) training on GPU for Pytorch can be enabled with the torch.cuda.amp.
Mixed precision training makes use of both FP32 and FP16 precisions where appropriate. FP16 operations can leverage the Tensor cores on NVIDIA GPUs (Volta, Turing or newer architectures) for improved throughput, generally without loss in model accuracy. Mixed precision training also often allows larger batch sizes. GPU automatic mixed precision training for Pytorch Vision can be enabled via the flag value --amp=True
.
python train.py\
--model resnext50_32x4d --epochs 100 --amp
For all post training quantized models, the settings are:
- num_calibration_batches: 32
- num_workers: 16
- batch_size: 32
- eval_batch_size: 128
- backend: 'fbgemm'
python train_quantization.py --device='cpu' --post-training-quantize --backend='fbgemm' --model='$MODEL'
Here $MODEL
is one of googlenet
, inception_v3
, resnet18
, resnet50
, resnext101_32x8d
, shufflenet_v2_x0_5
and shufflenet_v2_x1_0
.
Here are commands that we use to quantized the shufflenet_v2_x1_5
and shufflenet_v2_x2_0
models.
# For shufflenet_v2_x1_5
python train_quantization.py --device='cpu' --post-training-quantize --backend='fbgemm' \
--model=shufflenet_v2_x1_5 --weights="ShuffleNet_V2_X1_5_Weights.IMAGENET1K_V1" \
--train-crop-size 176 --val-resize-size 232 --data-path /datasets01_ontap/imagenet_full_size/061417/
# For shufflenet_v2_x2_0
python train_quantization.py --device='cpu' --post-training-quantize --backend='fbgemm' \
--model=shufflenet_v2_x2_0 --weights="ShuffleNet_V2_X2_0_Weights.IMAGENET1K_V1" \
--train-crop-size 176 --val-resize-size 232 --data-path /datasets01_ontap/imagenet_full_size/061417/
For Mobilenet-v2, the model was trained with quantization aware training, the settings used are:
- num_workers: 16
- batch_size: 32
- eval_batch_size: 128
- backend: 'qnnpack'
- learning-rate: 0.0001
- num_epochs: 90
- num_observer_update_epochs:4
- num_batch_norm_update_epochs:3
- momentum: 0.9
- lr_step_size:30
- lr_gamma: 0.1
- weight-decay: 0.0001
python train_quantization.py --model='mobilenet_v2'
Training converges at about 10 epochs.
For Mobilenet-v3 Large, the model was trained with quantization aware training, the settings used are:
- num_workers: 16
- batch_size: 32
- eval_batch_size: 128
- backend: 'qnnpack'
- learning-rate: 0.001
- num_epochs: 90
- num_observer_update_epochs:4
- num_batch_norm_update_epochs:3
- momentum: 0.9
- lr_step_size:30
- lr_gamma: 0.1
- weight-decay: 0.00001
python train_quantization.py --model='mobilenet_v3_large' \
--wd 0.00001 --lr 0.001
For post training quant, device is set to CPU. For training, the device is set to CUDA.
python train_quantization.py --device='cpu' --test-only --backend='<backend>' --model='<model_name>'
For inception_v3 you need to pass the following extra parameters:
--val-resize-size 342 --val-crop-size 299 --train-crop-size 299