Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about using Autotuner with ZeRO and tensor parallelism #6796

Open
rlanday opened this issue Nov 27, 2024 · 0 comments
Open

Question about using Autotuner with ZeRO and tensor parallelism #6796

rlanday opened this issue Nov 27, 2024 · 0 comments

Comments

@rlanday
Copy link

rlanday commented Nov 27, 2024

I’m reading through the Autotuner code and found this function:

def get_instantiation_memory_required_per_gpu(self, zero_stage):
num_params = self.get_model_num_params()
total_gpus = self.exp_num_nodes * self.exp_num_gpus
fp16_enabled = self.fp16_enabled()
if not num_params:
return 0
# assume the model uses Adam optimizer
# ZeroStageEnum.disabled:
params_mem = num_params * (2 if fp16_enabled else 4)
gradients_mem = num_params * (2 if fp16_enabled else 4)
optimizer_mem = num_params * (16 if fp16_enabled else 8)
if zero_stage >= ZeroStageEnum.optimizer_states:
optimizer_mem = optimizer_mem / total_gpus
if zero_stage >= ZeroStageEnum.gradients:
gradients_mem = gradients_mem / total_gpus
if zero_stage >= ZeroStageEnum.weights:
params_mem = params_mem / total_gpus
mem_per_gpu = (params_mem + gradients_mem + optimizer_mem) / self.mp_size()
return mem_per_gpu

It computes

total_gpus = self.exp_num_nodes * self.exp_num_gpus

based on the autotuning config. If ZeRO is enabled, then based on which stages are enabled, optimizer_mem, gradients_mem, and/or params_mem get sharded across the GPUs. But then if self.mp_size() (for tensor parallelism, right?) is greater than 1, then the total memory usage is divided again by the amount of tensor parallelism. So if ZeRO and tensor parallelism are both enabled, this is double-dipping, right? With N GPUs, we can’t get the per-GPU memory usage any smaller than 1/N. I’m not sure if

  1. there’s a bug here
  2. if the value of the num_gpus flag supposed to be reduced by the amount of tensor parallelism, or
  3. if I’m not understanding this correctly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant