Question about using Autotuner with ZeRO and tensor parallelism #6796

rlanday · 2024-11-27T17:03:40Z

I’m reading through the Autotuner code and found this function:

DeepSpeed/deepspeed/autotuning/autotuner.py

Lines 278 to 302 in f743fec

    
           def get_instantiation_memory_required_per_gpu(self, zero_stage): 
        
               num_params = self.get_model_num_params() 
        
               total_gpus = self.exp_num_nodes * self.exp_num_gpus 
        
               fp16_enabled = self.fp16_enabled() 
        
               if not num_params: 
        
                   return 0 
        
               # assume the model uses Adam optimizer 
        
               # ZeroStageEnum.disabled: 
        
               params_mem = num_params * (2 if fp16_enabled else 4) 
        
               gradients_mem = num_params * (2 if fp16_enabled else 4) 
        
               optimizer_mem = num_params * (16 if fp16_enabled else 8) 
        
               if zero_stage >= ZeroStageEnum.optimizer_states: 
        
                   optimizer_mem = optimizer_mem / total_gpus 
        
               if zero_stage >= ZeroStageEnum.gradients: 
        
                   gradients_mem = gradients_mem / total_gpus 
        
               if zero_stage >= ZeroStageEnum.weights: 
        
                   params_mem = params_mem / total_gpus 
        
               mem_per_gpu = (params_mem + gradients_mem + optimizer_mem) / self.mp_size() 
        
               return mem_per_gpu

It computes

total_gpus = self.exp_num_nodes * self.exp_num_gpus

based on the autotuning config. If ZeRO is enabled, then based on which stages are enabled, optimizer_mem, gradients_mem, and/or params_mem get sharded across the GPUs. But then if self.mp_size() (for tensor parallelism, right?) is greater than 1, then the total memory usage is divided again by the amount of tensor parallelism. So if ZeRO and tensor parallelism are both enabled, this is double-dipping, right? With N GPUs, we can’t get the per-GPU memory usage any smaller than 1/N. I’m not sure if

there’s a bug here
if the value of the num_gpus flag supposed to be reduced by the amount of tensor parallelism, or
if I’m not understanding this correctly.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about using Autotuner with ZeRO and tensor parallelism #6796

Question about using Autotuner with ZeRO and tensor parallelism #6796

rlanday commented Nov 27, 2024

Question about using Autotuner with ZeRO and tensor parallelism #6796

Question about using Autotuner with ZeRO and tensor parallelism #6796

Comments

rlanday commented Nov 27, 2024