Skip to content

End-to-end deployment for multi-node training using GPU nodes on a Kubernetes cluster.

Notifications You must be signed in to change notification settings

tuttlebr/multi-node-k8s-ml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multi-node PyTorch Training With Kubernetes

End-to-end PyTorch training job for multi-node GPU training on a Kubernetes cluster.

Intro

This helm chart will deploy a StatefulSet of N replicas as defined in the chart's values.yaml. This is where resources and affinity may be defined and allows for a generic port to most programming languages

Prerequisite

  1. Kubernetes Cluster (Tested on >= v1.20)
  2. Two or more nodes with at least one NVIDIA GPU per node (Tested on Amphere architecture)
  3. NVIDIA NGC Account

Install

Using the pytorch/distributed-training/values.yaml file, set the parameters as needed for your cluster. The most notable options will be the node affinity and resources. Node affinity will allocate all pods to nodes of your defined affinity. Resources will allocate all resources per pod as your defined resources.

helm install pytorch-distributed pytorch/distributed-training \
    --set imageCredentials.password=$NGC_API_KEY \
    --set imageCredentials.email=$NGC_EMAIL

The training arguments follow the PyTorch Image Classification reference training scripts. you can specify the direction given below in the pytorch/distributed-training/values.yaml file.

example:

args:
  - python
  - train.py
  - --model=resnet18
  - --batch-size=1024
  - --epochs=10
  - --lr=0.5
  - --momentum=0.9
  - --weight-decay=0.00002
  - --lr-step-size=30
  - --lr-gamma=0.1
  - --print-freq=1
  - --val-resize-size=28
  - --val-crop-size=28
  - --train-crop-size=28
  - --output-dir=/workspace/data/resnet18
  - --amp

Image classification reference training scripts

Reference training scripts for image classification. They serve as a log of how to train specific models, as provide baseline training and evaluation scripts to quickly bootstrap research.

Except otherwise noted, all models have been trained on 8x V100 GPUs with the following parameters:

Parameter value
--batch_size 32
--epochs 90
--lr 0.1
--momentum 0.9
--wd, --weight-decay 1e-4
--lr-step-size 30
--lr-gamma 0.1

AlexNet and VGG

Since AlexNet and the original VGG architectures do not include batch normalization, the default initial learning rate --lr 0.1 is too high.

python train.py\
    --model $MODEL --lr 1e-2

Here $MODEL is one of alexnet, vgg11, vgg13, vgg16 or vgg19. Note that vgg11_bn, vgg13_bn, vgg16_bn, and vgg19_bn include batch normalization and thus are trained with the default parameters.

GoogLeNet

The weights of the GoogLeNet model are ported from the original paper rather than trained from scratch.

Inception V3

The weights of the Inception V3 model are ported from the original paper rather than trained from scratch.

Since it expects tensors with a size of N x 3 x 299 x 299, to validate the model use the following command:

python train.py --model inception_v3\
      --test-only --weights Inception_V3_Weights.IMAGENET1K_V1

ResNet

python train.py --model $MODEL

Here $MODEL is one of resnet18, resnet34, resnet50, resnet101 or resnet152.

ResNext

python train.py\
    --model $MODEL --epochs 100

Here $MODEL is one of resnext50_32x4d or resnext101_32x8d. Note that the above command corresponds to a single node with 8 GPUs. If you use a different number of GPUs and/or a different batch size, then the learning rate should be scaled accordingly. For example, the pretrained model provided by torchvision was trained on 8 nodes, each with 8 GPUs (for a total of 64 GPUs), with --batch_size 16 and --lr 0.4, instead of the current defaults which are respectively batch_size=32 and lr=0.1

MobileNetV2

python train.py\
     --model mobilenet_v2 --epochs 300 --lr 0.045 --wd 0.00004\
     --lr-step-size 1 --lr-gamma 0.98

MobileNetV3 Large & Small

python train.py\
     --model $MODEL --epochs 600 --opt rmsprop --batch-size 128 --lr 0.064\
     --wd 0.00001 --lr-step-size 2 --lr-gamma 0.973 --auto-augment imagenet --random-erase 0.2

Here $MODEL is one of mobilenet_v3_large or mobilenet_v3_small.

Then we averaged the parameters of the last 3 checkpoints that improved the Acc@1. See #3182 and #3354 for details.

EfficientNet-V1

The weights of the B0-B4 variants are ported from Ross Wightman's timm repo.

The weights of the B5-B7 variants are ported from Luke Melas' EfficientNet-PyTorch repo.

All models were trained using Bicubic interpolation and each have custom crop and resize sizes. To validate the models use the following commands:

python train.py --model efficientnet_b0 --test-only --weights EfficientNet_B0_Weights.IMAGENET1K_V1
python train.py --model efficientnet_b1 --test-only --weights EfficientNet_B1_Weights.IMAGENET1K_V1
python train.py --model efficientnet_b2 --test-only --weights EfficientNet_B2_Weights.IMAGENET1K_V1
python train.py --model efficientnet_b3 --test-only --weights EfficientNet_B3_Weights.IMAGENET1K_V1
python train.py --model efficientnet_b4 --test-only --weights EfficientNet_B4_Weights.IMAGENET1K_V1
python train.py --model efficientnet_b5 --test-only --weights EfficientNet_B5_Weights.IMAGENET1K_V1
python train.py --model efficientnet_b6 --test-only --weights EfficientNet_B6_Weights.IMAGENET1K_V1
python train.py --model efficientnet_b7 --test-only --weights EfficientNet_B7_Weights.IMAGENET1K_V1

EfficientNet-V2

python train.py \
--model $MODEL --batch-size 128 --lr 0.5 --lr-scheduler cosineannealinglr \
--lr-warmup-epochs 5 --lr-warmup-method linear --auto-augment ta_wide --epochs 600 --random-erase 0.1 \
--label-smoothing 0.1 --mixup-alpha 0.2 --cutmix-alpha 1.0 --weight-decay 0.00002 --norm-weight-decay 0.0 \
--train-crop-size $TRAIN_SIZE --model-ema --val-crop-size $EVAL_SIZE --val-resize-size $EVAL_SIZE \
--ra-sampler --ra-reps 4

Here $MODEL is one of efficientnet_v2_s and efficientnet_v2_m. Note that the Small variant had a $TRAIN_SIZE of 300 and a $EVAL_SIZE of 384, while the Medium 384 and 480 respectively.

Note that the above command corresponds to training on a single node with 8 GPUs. For generatring the pre-trained weights, we trained with 4 nodes, each with 8 GPUs (for a total of 32 GPUs), and --batch_size 32.

The weights of the Large variant are ported from the original paper rather than trained from scratch. See the EfficientNet_V2_L_Weights entry for their exact preprocessing transforms.

RegNet

Small models

python train.py\
     --model $MODEL --epochs 100 --batch-size 128 --wd 0.00005 --lr=0.8\
     --lr-scheduler=cosineannealinglr --lr-warmup-method=linear\
     --lr-warmup-epochs=5 --lr-warmup-decay=0.1

Here $MODEL is one of regnet_x_400mf, regnet_x_800mf, regnet_x_1_6gf, regnet_y_400mf, regnet_y_800mf and regnet_y_1_6gf. Please note we used learning rate 0.4 for regent_y_400mf to get the same Acc@1 as [the paper)(https://arxiv.org/abs/2003.13678).

Medium models

python train.py\
     --model $MODEL --epochs 100 --batch-size 64 --wd 0.00005 --lr=0.4\
     --lr-scheduler=cosineannealinglr --lr-warmup-method=linear\
     --lr-warmup-epochs=5 --lr-warmup-decay=0.1

Here $MODEL is one of regnet_x_3_2gf, regnet_x_8gf, regnet_x_16gf, regnet_y_3_2gf and regnet_y_8gf.

Large models

python train.py\
     --model $MODEL --epochs 100 --batch-size 32 --wd 0.00005 --lr=0.2\
     --lr-scheduler=cosineannealinglr --lr-warmup-method=linear\
     --lr-warmup-epochs=5 --lr-warmup-decay=0.1

Here $MODEL is one of regnet_x_32gf, regnet_y_16gf and regnet_y_32gf.

Vision Transformer

vit_b_16

python train.py\
    --model vit_b_16 --epochs 300 --batch-size 512 --opt adamw --lr 0.003 --wd 0.3\
    --lr-scheduler cosineannealinglr --lr-warmup-method linear --lr-warmup-epochs 30\
    --lr-warmup-decay 0.033 --amp --label-smoothing 0.11 --mixup-alpha 0.2 --auto-augment ra\
    --clip-grad-norm 1 --ra-sampler --cutmix-alpha 1.0 --model-ema

Note that the above command corresponds to training on a single node with 8 GPUs. For generatring the pre-trained weights, we trained with 8 nodes, each with 8 GPUs (for a total of 64 GPUs), and --batch_size 64.

vit_b_32

python train.py\
    --model vit_b_32 --epochs 300 --batch-size 512 --opt adamw --lr 0.003 --wd 0.3\
    --lr-scheduler cosineannealinglr --lr-warmup-method linear --lr-warmup-epochs 30\
    --lr-warmup-decay 0.033 --amp --label-smoothing 0.11 --mixup-alpha 0.2 --auto-augment imagenet\
    --clip-grad-norm 1 --ra-sampler --cutmix-alpha 1.0 --model-ema

Note that the above command corresponds to training on a single node with 8 GPUs. For generatring the pre-trained weights, we trained with 2 nodes, each with 8 GPUs (for a total of 16 GPUs), and --batch_size 256.

vit_l_16

python train.py\
    --model vit_l_16 --epochs 600 --batch-size 128 --lr 0.5 --lr-scheduler cosineannealinglr\
    --lr-warmup-method linear --lr-warmup-epochs 5 --label-smoothing 0.1 --mixup-alpha 0.2\
    --auto-augment ta_wide --random-erase 0.1 --weight-decay 0.00002 --norm-weight-decay 0.0\
    --clip-grad-norm 1 --ra-sampler --cutmix-alpha 1.0 --model-ema --val-resize-size 232

Note that the above command corresponds to training on a single node with 8 GPUs. For generatring the pre-trained weights, we trained with 2 nodes, each with 8 GPUs (for a total of 16 GPUs), and --batch_size 64.

vit_l_32

python train.py\
    --model vit_l_32 --epochs 300 --batch-size 512 --opt adamw --lr 0.003 --wd 0.3\
    --lr-scheduler cosineannealinglr --lr-warmup-method linear --lr-warmup-epochs 30\
    --lr-warmup-decay 0.033 --amp --label-smoothing 0.11 --mixup-alpha 0.2 --auto-augment ra\
    --clip-grad-norm 1 --ra-sampler --cutmix-alpha 1.0 --model-ema

Note that the above command corresponds to training on a single node with 8 GPUs. For generatring the pre-trained weights, we trained with 8 nodes, each with 8 GPUs (for a total of 64 GPUs), and --batch_size 64.

ConvNeXt

python train.py\
--model $MODEL --batch-size 128 --opt adamw --lr 1e-3 --lr-scheduler cosineannealinglr \
--lr-warmup-epochs 5 --lr-warmup-method linear --auto-augment ta_wide --epochs 600 --random-erase 0.1 \
--label-smoothing 0.1 --mixup-alpha 0.2 --cutmix-alpha 1.0 --weight-decay 0.05 --norm-weight-decay 0.0 \
--train-crop-size 176 --model-ema --val-resize-size 232 --ra-sampler --ra-reps 4

Here $MODEL is one of convnext_tiny, convnext_small, convnext_base and convnext_large. Note that each variant had its --val-resize-size optimized in a post-training step, see their Weights entry for their exact value.

Note that the above command corresponds to training on a single node with 8 GPUs. For generatring the pre-trained weights, we trained with 2 nodes, each with 8 GPUs (for a total of 16 GPUs), and --batch_size 64.

SwinTransformer

python train.py\
--model $MODEL --epochs 300 --batch-size 128 --opt adamw --lr 0.001 --weight-decay 0.05 --norm-weight-decay 0.0  --bias-weight-decay 0.0 --transformer-embedding-decay 0.0 --lr-scheduler cosineannealinglr --lr-min 0.00001 --lr-warmup-method linear  --lr-warmup-epochs 20 --lr-warmup-decay 0.01 --amp --label-smoothing 0.1 --mixup-alpha 0.8 --clip-grad-norm 5.0 --cutmix-alpha 1.0 --random-erase 0.25 --interpolation bicubic --auto-augment ta_wide --model-ema --ra-sampler --ra-reps 4  --val-resize-size 224

Here $MODEL is one of swin_t, swin_s or swin_b. Note that --val-resize-size was optimized in a post-training step, see their Weights entry for the exact value.

ShuffleNet V2

python train.py \
--batch-size=128 \
--lr=0.5 --lr-scheduler=cosineannealinglr --lr-warmup-epochs=5 --lr-warmup-method=linear \
--auto-augment=ta_wide --epochs=600 --random-erase=0.1 --weight-decay=0.00002 \
--norm-weight-decay=0.0 --label-smoothing=0.1 --mixup-alpha=0.2 --cutmix-alpha=1.0 \
--train-crop-size=176 --model-ema --val-resize-size=232 --ra-sampler --ra-reps=4

Here $MODEL is either shufflenet_v2_x1_5 or shufflenet_v2_x2_0.

The models shufflenet_v2_x0_5 and shufflenet_v2_x1_0 were contributed by the community. See PR-849 for details.

Mixed precision training

Automatic Mixed Precision (AMP) training on GPU for Pytorch can be enabled with the torch.cuda.amp.

Mixed precision training makes use of both FP32 and FP16 precisions where appropriate. FP16 operations can leverage the Tensor cores on NVIDIA GPUs (Volta, Turing or newer architectures) for improved throughput, generally without loss in model accuracy. Mixed precision training also often allows larger batch sizes. GPU automatic mixed precision training for Pytorch Vision can be enabled via the flag value --amp=True.

python train.py\
    --model resnext50_32x4d --epochs 100 --amp

Quantized

Post training quantized models

For all post training quantized models, the settings are:

  1. num_calibration_batches: 32
  2. num_workers: 16
  3. batch_size: 32
  4. eval_batch_size: 128
  5. backend: 'fbgemm'
python train_quantization.py --device='cpu' --post-training-quantize --backend='fbgemm' --model='$MODEL'

Here $MODEL is one of googlenet, inception_v3, resnet18, resnet50, resnext101_32x8d, shufflenet_v2_x0_5 and shufflenet_v2_x1_0.

Quantized ShuffleNet V2

Here are commands that we use to quantized the shufflenet_v2_x1_5 and shufflenet_v2_x2_0 models.

# For shufflenet_v2_x1_5
python train_quantization.py --device='cpu' --post-training-quantize --backend='fbgemm' \
    --model=shufflenet_v2_x1_5 --weights="ShuffleNet_V2_X1_5_Weights.IMAGENET1K_V1" \
    --train-crop-size 176 --val-resize-size 232 --data-path /datasets01_ontap/imagenet_full_size/061417/

# For shufflenet_v2_x2_0
python train_quantization.py --device='cpu' --post-training-quantize --backend='fbgemm' \
    --model=shufflenet_v2_x2_0 --weights="ShuffleNet_V2_X2_0_Weights.IMAGENET1K_V1" \
    --train-crop-size 176 --val-resize-size 232 --data-path /datasets01_ontap/imagenet_full_size/061417/

QAT MobileNetV2

For Mobilenet-v2, the model was trained with quantization aware training, the settings used are:

  1. num_workers: 16
  2. batch_size: 32
  3. eval_batch_size: 128
  4. backend: 'qnnpack'
  5. learning-rate: 0.0001
  6. num_epochs: 90
  7. num_observer_update_epochs:4
  8. num_batch_norm_update_epochs:3
  9. momentum: 0.9
  10. lr_step_size:30
  11. lr_gamma: 0.1
  12. weight-decay: 0.0001
python train_quantization.py --model='mobilenet_v2'

Training converges at about 10 epochs.

QAT MobileNetV3

For Mobilenet-v3 Large, the model was trained with quantization aware training, the settings used are:

  1. num_workers: 16
  2. batch_size: 32
  3. eval_batch_size: 128
  4. backend: 'qnnpack'
  5. learning-rate: 0.001
  6. num_epochs: 90
  7. num_observer_update_epochs:4
  8. num_batch_norm_update_epochs:3
  9. momentum: 0.9
  10. lr_step_size:30
  11. lr_gamma: 0.1
  12. weight-decay: 0.00001
python train_quantization.py --model='mobilenet_v3_large' \
    --wd 0.00001 --lr 0.001

For post training quant, device is set to CPU. For training, the device is set to CUDA.

Command to evaluate quantized models using the pre-trained weights:

python train_quantization.py --device='cpu' --test-only --backend='<backend>' --model='<model_name>'

For inception_v3 you need to pass the following extra parameters:

--val-resize-size 342 --val-crop-size 299 --train-crop-size 299

About

End-to-end deployment for multi-node training using GPU nodes on a Kubernetes cluster.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published