You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
based on the autotuning config. If ZeRO is enabled, then based on which stages are enabled, optimizer_mem, gradients_mem, and/or params_mem get sharded across the GPUs. But then if self.mp_size() (for tensor parallelism, right?) is greater than 1, then the total memory usage is divided again by the amount of tensor parallelism. So if ZeRO and tensor parallelism are both enabled, this is double-dipping, right? With N GPUs, we can’t get the per-GPU memory usage any smaller than 1/N. I’m not sure if
there’s a bug here
if the value of the num_gpus flag supposed to be reduced by the amount of tensor parallelism, or
if I’m not understanding this correctly.
The text was updated successfully, but these errors were encountered:
I’m reading through the Autotuner code and found this function:
DeepSpeed/deepspeed/autotuning/autotuner.py
Lines 278 to 302 in f743fec
It computes
total_gpus = self.exp_num_nodes * self.exp_num_gpus
based on the autotuning config. If ZeRO is enabled, then based on which stages are enabled,
optimizer_mem
,gradients_mem
, and/orparams_mem
get sharded across the GPUs. But then ifself.mp_size()
(for tensor parallelism, right?) is greater than 1, then the total memory usage is divided again by the amount of tensor parallelism. So if ZeRO and tensor parallelism are both enabled, this is double-dipping, right? With N GPUs, we can’t get the per-GPU memory usage any smaller than 1/N. I’m not sure ifnum_gpus
flag supposed to be reduced by the amount of tensor parallelism, orThe text was updated successfully, but these errors were encountered: