Fault-tolerance and Elasticity Experiments of DLRover ElasticJob

The tutorial shows experiments to test the fault-tolerance and elasticity of DLRover ElasticJob. In the experiments, we use the chaos enginerring toolkit chaosblade to simulate fault scenarios.

Preliminary

Create a k8s cluster and configure cluster credentials on your local computer.
Deploy DLRover ElasticJob on the k8s cluster with the tutorial.
Build the image with chaosblade like the example.

Experiments of PyTorch Distributed Job

We conduct experiments to simulate the following scenarios:

The Pod is preempted.
The Pod is a straggler.
The Pod is placed on a fualt node.
The Pod network breaks down during training.
The training process corrupts in the Pod.

Pod is Preempted

In the experiment, we submit a job with the example and the command in the worker spec is

  command:
    - /bin/bash
    - -c
    - "dlrover-run --network-check --exclude-straggler --nnodes=3:$WORKER_NUM \
        --nproc_per_node=2 --max_restarts=3  --rdzv_conf pend_timeout=600 \
        examples/pytorch/mnist/cnn_train.py --num_epochs 5 \
        --training_data /data/mnist_png/training/ \
        --validation_data /data/mnist_png/testing/"

The Pods of the job are:

chaos-test-edljob-worker-0                    1/1     Running   0             85s
chaos-test-edljob-worker-1                    1/1     Running   0             85s
chaos-test-edljob-worker-2                    1/1     Running   0             85s
chaos-test-edljob-worker-3                    1/1     Running   0             85s
elasticjob-chaos-test-dlrover-master          1/1     Running   0             89s

We kill the worker-0 to simulate that the Pod is preempted by the command

kubectl -n dlrover delete pod chaos-test-edljob-worker-0

After killing worker-0, job Pods are

chaos-test-edljob-worker-1                    1/1     Running   0             2m3s
chaos-test-edljob-worker-2                    1/1     Running   0             2m3s
chaos-test-edljob-worker-3                    1/1     Running   0             2m3s
chaos-test-edljob-worker-4                    1/1     Running   0             30s
elasticjob-chaos-test-dlrover-master          1/1     Running   0             2m7s

Then, we can see the log of worker to check whether the training restores.

kubectl -n dlrover logs chaos-test-edljob-worker-1

loss = 2.298487901687622, step = 0
INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration.
INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration.
loss = 2.195965051651001, step = 20
loss = 1.2307546138763428, step = 40
loss = 0.6579511761665344, step = 60
loss = 1.0608341693878174, step = 80
loss = 0.7761049270629883, step = 100

Straggler Pod

In the experiment, we set replicas of worker to 4 in a job and use chaosblade to perform a CPU full load 90% on the worker-1 with the command

chaosblade-1.7.2/blade create cpu load --cpu-percent 90

If you use the image registry.cn-hangzhou.aliyuncs.com/intell-ai/dlrover:torch201-mnist, you can use chaosblade to create a chaos experiment by

sh examples/pytorch/mnist/start_chaos.sh cpu-overload

and set the command in the yaml of elasticjob like the example.

  command:
    - /bin/bash
    - -c
    - "(bash examples/pytorch/mnist/start_chaos.sh cpu-overload &) && \
        dlrover-run --network-check --exclude-straggler --nnodes=3:$WORKER_NUM \
        --nproc_per_node=2 --max_restarts=3  --rdzv_conf pend_timeout=600 \
        examples/pytorch/mnist/cnn_train.py --num_epochs 5 \
        --training_data /data/mnist_png/training/ \
        --validation_data /data/mnist_png/testing/"

After submitting an ElasticJob to the k8s cluster by kubectl -n dlrover apply -f examples/pytorch/mnist/choas_test_job.yaml, We can see the worker-1 exits with errors like

elasticjob-torch-mnist-debug-dlrover-master   0/1     Completed   0             3h17m
torch-mnist-debug-edljob-worker-0             0/1     Completed   0             3h17m
torch-mnist-debug-edljob-worker-1             0/1     Error       0             3h17m
torch-mnist-debug-edljob-worker-2             0/1     Completed   0             3h17m
torch-mnist-debug-edljob-worker-3             0/1     Completed   0             3h17m
torch-mnist-debug-edljob-worker-4             0/1     Completed   0             3h10m

From the log of worker-1 by kubectl -n dlrover logs torch-mnist-debug-edljob-worker-1, worker-1 fails because it is a straggler. If we don't want to the worker-1 fails due to straggler, we can remove the config dlrover-run --network-check --exclude-straggler from the command like dlrover-run --network-check.

[2023-09-26 03:52:20,235] [INFO] [training.py:707:run] Fault nodes are: []  and stragglers are: [1].
Traceback (most recent call last):
  File "/usr/local/bin/dlrover-run", line 8, in <module>
    sys.exit(main())
  ...
  File "/usr/local/lib/python3.8/site-packages/dlrover/python/elastic_agent/torch/training.py", line 733, in run
    raise RuntimeError("The node is a straggler and exits.")
RuntimeError: The node is a straggler and exits.

We can see the elapsed time of each node to run the task to check straggler in the master log.

kubectl -n dlrover logs elasticjob-torch-mnist-debug-dlrover-master | grep elapsed

Round 0: The node elapsed time are {2: 20.307, 3: 20.265, 0: 206.872, 1: 151.752}
Round 1: The node elapsed time are {2: 20.307, 3: 20.265, 0: 23.174, 1: 135.961}
Round 2: The node elapsed time aree {2: 21.491, 0: 22.685, 3: 20.889, 1: 23.097}

From the log, The worker-1 the elapsed time is much bigger than others in the first 2 rounds. After the worker-1 fails, the ElasticJob relaunch a new Pod worker-4 to restore the failed Pod. The elapsed times of all nodes have not significant differenct. Note. the index is the WOKRER_RANK of node. The WORKER_RANK of worker-4 is the same as worker-1.

Fault Node

In the experiment, we set replicas of worker to 4 in a job and use chaosblade to kill the process to run run_network_check.py to simulate the fault node.

and set the command in the yaml of elasticjob like the example.

command:
    - /bin/bash
    - -c
    - "(bash examples/pytorch/mnist/start_chaos.sh kill-process &) && \
        dlrover-run --network-check --exclude-straggler --nnodes=3:$WORKER_NUM \
        --nproc_per_node=2 --max_restarts=3  --rdzv_conf pend_timeout=600 \
        examples/pytorch/mnist/cnn_train.py --num_epochs 5 \
        --training_data /data/mnist_png/training/ \
        --validation_data /data/mnist_png/testing/"

chaos-test-edljob-worker-0                    1/1     Running             0             12m
chaos-test-edljob-worker-1                    0/1     Error               0             12m
chaos-test-edljob-worker-2                    1/1     Running             0             12m
chaos-test-edljob-worker-3                    1/1     Running             0             12m
chaos-test-edljob-worker-4                    1/1     Running             0             3m59s
elasticjob-chaos-test-dlrover-master          1/1     Running             0             12m

The worker-1 fails with the message

Traceback (most recent call last):
  ....
  File "/usr/local/lib/python3.8/site-packages/dlrover/python/elastic_agent/torch/training.py", line 732, in run
    raise RuntimeError("The node network is breakdown.")
RuntimeError: The node network is breakdown.

From the master log by kubectl -n dlrover logs elasticjob-chaos-test-dlrover-master | grep "The node status", the worker-1 fails in the first 2 round check. Afther worker-4 starts to replace worker-1, all nodes are noraml.

Round 1: The node status are {1: False, 2: True, 3: True, 0: False}.
Round 2: The node status are {1: False, 2: True, 3: True, 0: True}.
Round 3: The node status are {3: True, 0: True, 1: True, 2: True}.

Network Breakdown

In the experiment, we set replicas of worker to 4 in a job and use chaosblade to set the network loss rate to 100% to simulate that the network of the node is breakdown.

We watch the log of worker-1 to check whether the training starts. The training starts if loss=..., step=... in the log. After the training starts, we perform a network loadd rate 100% in the worker-1 to simulate the networker of worker-1 is breakdown.

kubectl -n dlrover exec -it chaos-test-edljob-worker-1  bash
./chaosblade-1.7.2/blade create network loss --percent 100 --interface eth0

Then, the worker-1 fails and a new worker-4 starts to replace the worker-1.

chaos-test-edljob-worker-0                    1/1     Running   0             4m39s
chaos-test-edljob-worker-1                    0/1     Error     0             4m39s
chaos-test-edljob-worker-2                    1/1     Running   0             4m39s
chaos-test-edljob-worker-3                    1/1     Running   0             4m39s
chaos-test-edljob-worker-4                    1/1     Running   0             17s
elasticjob-chaos-test-dlrover-master          1/1     Running   0             4m43s

The training also restores after the worker-4 starts by the log of worker-0.

loss = 0.24101698398590088, step = 0
INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration.
INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration.
loss = 0.4646361768245697, step = 20

Training Process Corruption

In the experiment, we set replicas of worker to 4 in a job and use chaosblade to kill a training process to simulate the GPU error.

We watch the log of worker-1 to check whether the training starts. The training starts if loss=..., step=... in the log. After the training starts, we kill a process in the worker-1.

kubectl -n dlrover exec -it chaos-test-edljob-worker-1  bash
ps -aux | grep cnn_train.py

Then, we can kill a training process by kill -9 ${PID}. The all workers are still running and we can see that the training restarts from the log.

chaos-test-edljob-worker-0                    1/1     Running   0             3m4s
chaos-test-edljob-worker-1                    1/1     Running   0             3m4s
chaos-test-edljob-worker-2                    1/1     Running   0             3m4s
chaos-test-edljob-worker-3                    1/1     Running   0             3m4s
elasticjob-chaos-test-dlrover-master          1/1     Running   0             3m9s

Scale Up Nodes

In the experiment, we use the example to submit an elastic training job. In the job, we set the min_node=3 and max_node=$WORKER_NUM as the number of replicas. The ElasticJob will set the replicas into the environment WORKER_NUM.

At first, there are 3 running workers and 1 pending worker due to the insufficient resource.

elasticjob-torch-mnist-dxlrover-master           1/1     Running     0             57s
torch-mnist-edljob-worker-0                      1/1     Running     0             47s
torch-mnist-edljob-worker-1                      1/1     Running     0             47s
torch-mnist-edljob-worker-2                      1/1     Running     0             47s
torch-mnist-edljob-worker-3                      0/1     Pending     0             47s

After about 2 min, we can see the training starts in 3 running workers with the log.

[2023-09-27 02:23:21,097] [INFO] [training.py:344:_rendezvous] [default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=192.168.0.71
  master_port=36725
  group_rank=0
  group_world_size=3
  local_ranks=[0, 1]
  role_ranks=[0, 1]
  global_ranks=[0, 1]
  role_world_sizes=[6, 6]
  global_world_sizes=[6, 6]

rank 1 is initialized local_rank = 1
loss = 2.3198373317718506, step = 0
loss = 2.2946105003356934, step = 0
loss = 1.7543025016784668, step = 20

Then, we kill another job to release resource and the worker-3 will start.

elasticjob-torch-mnist-dlrover-master         1/1     Running   0             5m39s
torch-mnist-edljob-worker-0                   1/1     Running   0             5m34s
torch-mnist-edljob-worker-1                   1/1     Running   0             5m34s
torch-mnist-edljob-worker-2                   1/1     Running   0             5m34s
torch-mnist-edljob-worker-3                   1/1     Running   0             5m34s

From the log of worker-0, we can see the training starts with group_world_size=4.

[2023-09-27 02:25:43,362] [INFO] [training.py:344:_rendezvous] [default] Rendezvous complete for workers. Result:
  restart_count=1
  master_addr=192.168.0.71
  master_port=58241
  group_rank=0
  group_world_size=4
  local_ranks=[0, 1]
  role_ranks=[0, 1]
  global_ranks=[0, 1]
  role_world_sizes=[8, 8]
  global_world_sizes=[8, 8]

rank 1 is initialized local_rank = 1rank 0 is initialized local_rank = 0

loss = 2.2984073162078857, step = 0
loss = 2.1407980918884277, step = 20
loss = 1.1324385404586792, step = 40
loss = 0.4783979058265686, step = 60
loss = 0.5714012384414673, step = 80
loss = 0.6941334009170532, step = 100

Scale Down Nodes

In the experiment, we use the example to submit an elastic training job. In the job, we set the min_node=3 and max_node=$WORKER_NUM as the number of replicas. The ElasticJob will set the replicas into the environment WORKER_NUM.

At first, there are 4 running workers.

elasticjob-torch-mnist-dlrover-master            1/1     Running     0             2m43s
torch-mnist-edljob-worker-0                      1/1     Running     0             2m38s
torch-mnist-edljob-worker-1                      1/1     Running     0             2m38s
torch-mnist-edljob-worker-2                      1/1     Running     0             2m38s
torch-mnist-edljob-worker-3                      0/1     Running     0             2m38s

Then, we use the chaosblade to make worker-1 failed.

kubectl -n dlrover exec -it torch-mnist-edljob-worker-1 bash
./chaosblade-1.7.2/blade create process kill --process dlrover-run --signal 1

elasticjob-torch-mnist-dlrover-master         1/1     Running   0             4m43s
torch-mnist-edljob-worker-0                   1/1     Running   0             4m38s
torch-mnist-edljob-worker-1                   0/1     Error     0             4m38s
torch-mnist-edljob-worker-2                   1/1     Running   0             4m38s
torch-mnist-edljob-worker-3                   1/1     Running   0             4m38s

From the log of worker-0, we can see the training restores the model and data sampler from the checkpoint and starts with group_world_size=3.

[2023-09-27 03:18:00,815] [INFO] [training.py:344:_rendezvous] [default] Rendezvous complete for workers. Result:
  restart_count=1
  master_addr=192.168.0.66
  master_port=39705
  group_rank=0
  group_world_size=3
  local_ranks=[0, 1]
  role_ranks=[0, 1]
  global_ranks=[0, 1]
  role_world_sizes=[6, 6]
  global_world_sizes=[6, 6]

[2023-09-27 03:18:05,957] [INFO] [sampler.py:153:load_state_dict] Load epoch = 0, completed num = 51200, num_samples = 1467
[2023-09-27 03:18:05,958] [INFO] [sampler.py:153:load_state_dict] Load epoch = 0, completed num = 51200, num_samples = 1467
loss = 0.2617453336715698, step = 0
loss = 0.2548859417438507, step = 20

Experiments of TensorFlow PS Distributed Job

We conduct experiments with the TF distributed job using PS to test the fault-tolerance of worker and PS.

Fault-tolerance of Worker

We can sumit a TensorFlow PS job using the example. The job will launch 1 chief, 1 worker and 1 PS.

kubectl -n dlrover apply -f examples/tensorflow/criteo_deeprec/manual_job.yaml

deepctr-manual-scale-edljob-chief-0              1/1     Running   0             88s
deepctr-manual-scale-edljob-ps-0                 1/1     Running   0             88s
deepctr-manual-scale-edljob-worker-0             1/1     Running   0             88s
elasticjob-deepctr-manual-scale-dlrover-master   1/1     Running   0             99s

We use kubectl to kill a worker.

kubectl -n dlrover delete pod deepctr-manual-scale-edljob-worker-0

After the worker-0 is killed, the job relaunch the worker-1 to restore the failed node.

NAME                                                 READY   STATUS    RESTARTS   AGE
deepctr-manual-scale-edljob-chief-0              1/1     Running   0             2m57s
deepctr-manual-scale-edljob-ps-0                 1/1     Running   0             2m57s
deepctr-manual-scale-edljob-worker-1             1/1     Running   0             60s
elasticjob-deepctr-manual-scale-dlrover-master   1/1     Running   0             3m8s

After the job runs about 4min, the chief-0 fails with OOM due to the insufficient memory configuration. The job relaunches the chief-1 with more memory to restore it.

deepctr-manual-scale-edljob-chief-0              0/1     OOMKilled   0             4m53s
deepctr-manual-scale-edljob-chief-1              1/1     Running     0             64s
deepctr-manual-scale-edljob-ps-0                 1/1     Running     0             4m53s
deepctr-manual-scale-edljob-worker-1             1/1     Running     0             2m56s

We can view the memory of chief-0 and chief-1 by

kubectl -n dlrover get pod deepctr-manual-scale-edljob-chief-0 -o yaml | grep memory

>>>
        memory: 4Gi
        memory: 4Gi

kubectl -n dlrover get pod deepctr-manual-scale-edljob-chief-1 -o yaml | grep memory

>>>
        memory: 8Gi
        memory: 8Gi

We can view the log of chief-1 to check whether the training restores.

[2023-03-20 11:51:10,774] [INFO][session_manager.py:511:_try_run_local_init_op] Running local_init_op.
[2023-03-20 11:51:11,302] [INFO][session_manager.py:513:_try_run_local_init_op] Done running local_init_op.
[2023-03-20 11:51:14,279] [INFO][global_step_hook.py:39:before_run] global_step: 126

Fault-tolerance of PS

We kill the ps-0 by kubectl -n dlrover delete pod deepctr-manual-scale-edljob-ps-0. The job relaunches the ps-1 to restore the killed ps-0>

deepctr-manual-scale-edljob-chief-0              0/1     OOMKilled   0             10m
deepctr-manual-scale-edljob-chief-1              1/1     Running     0             7m1s
deepctr-manual-scale-edljob-ps-1                 1/1     Running     0             109s
deepctr-manual-scale-edljob-worker-1             0/1     OOMKilled   0             8m53s
deepctr-manual-scale-edljob-worker-2             1/1     Running     0             4m13s
elasticjob-deepctr-manual-scale-dlrover-master   1/1     Running     0             11m

From the log of chief, the training job restore the model from the latest checkpoint and contiune training the model.

[2023-09-26 19:24:00,861] [INFO][saver.py:1531:restore] Restoring parameters from /nas/deepctr/model.ckpt-126
[2023-09-26 19:24:03,473] [INFO][session_manager.py:511:_try_run_local_init_op] Running local_init_op.
[2023-09-26 19:24:03,580] [INFO] [resource.py:164:report_resource] Report Resource CPU : 0.98, Memory 7146, GPU []
[2023-09-26 19:24:03,670] [INFO][session_manager.py:513:_try_run_local_init_op] Done running local_init_op.
[2023-09-26 19:24:07,665] [INFO][basic_session_run_hooks.py:627:_save] Saving checkpoints for 126 into /nas/deepctr/model.ckpt.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fault_tolerance_exps.md

fault_tolerance_exps.md

Fault-tolerance and Elasticity Experiments of DLRover ElasticJob

Preliminary

Experiments of PyTorch Distributed Job

Pod is Preempted

Straggler Pod

Fault Node

Network Breakdown

Training Process Corruption

Scale Up Nodes

Scale Down Nodes

Experiments of TensorFlow PS Distributed Job

Fault-tolerance of Worker

Fault-tolerance of PS

Files

fault_tolerance_exps.md

Latest commit

History

fault_tolerance_exps.md

File metadata and controls

Fault-tolerance and Elasticity Experiments of DLRover ElasticJob

Preliminary

Experiments of PyTorch Distributed Job

Pod is Preempted

Straggler Pod

Fault Node

Network Breakdown

Training Process Corruption

Scale Up Nodes

Scale Down Nodes

Experiments of TensorFlow PS Distributed Job

Fault-tolerance of Worker

Fault-tolerance of PS