Retrieve CUDA available memory via `torch.cuda.mem_get_info()` #4847

XuehaiPan · 2023-12-20T14:41:26Z

This PR refactors the available_memory() method for the CUDA accelerator to use free, total = torch.cuda.mem_get_info(). It also removes the hard dependency pynvml.

Related PR:

add available memory check to accelerators #4508

The torch.cuda.mem_get_info() function was added two years ago (May 26th, 2021). We have already relied on torch.cuda.is_bf16_supported() without a torch version check in the next method below. The torch.cuda.is_bf16_supported() function was added on August 26th, 2021. So we can assume the torch.cuda.mem_get_info() function is always available for the torch version we support.

Rationale

The official NVML Python binding package is nvidia-ml-py rather than pynvml on PyPI. See the documentation on https://pypi.org/project/pynvml:

This is a wrapper around the NVML library. For information about the NVML library, see the NVML developer page http://developer.nvidia.com/nvidia-management-library-nvml

As of version 11.0.0, the NVML-wrappers used in pynvml are identical to those published through nvidia-ml-py.
Having pynvml will add an extra dependency. It will also break the users' Python environment if they have nvidia-ml-py installed. Because both pynvml and nvidia-ml-py provide the pynvml module. We can rely on torch.cuda.mem_get_info() where no extra dependency will be added.
Handling the CUDA_VISIBLE_DEVICES environment variable is very complex. The variable can be a comma-separated list of integers or UUID strings. Currently, we only support integers. The torch.cuda.mem_get_info() directly calls the CUDA API which does not need index conversion between CUDA and NVML.

DeepSpeed/accelerator/cuda_accelerator.py

Lines 156 to 169 in 6d7b44a

    
               def _get_nvml_gpu_id(self, torch_gpu_id): 
        
                   """ 
        
                   credit: https://discuss.pytorch.org/t/making-pynvml-match-torch-device-ids-cuda-visible-devices/103020 
        
                   Remap torch device id to nvml device id, respecting CUDA_VISIBLE_DEVICES. 
        
                   If the latter isn't set return the same id 
        
                   """ 
        
                   # if CUDA_VISIBLE_DEVICES is used automagically remap the id since pynvml ignores this env var 
        
                   if "CUDA_VISIBLE_DEVICES" in os.environ: 
        
                       ids = list(map(int, os.environ.get("CUDA_VISIBLE_DEVICES", "").split(","))) 
        
                       return ids[torch_gpu_id]  # remap 
        
                   else: 
        
                       return torch_gpu_id

$ nvidia-smi -L
GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-3cd9eb06-03f4-3b39-2f7b-48ee826b0a26)
GPU 1: NVIDIA GeForce RTX 3090 (UUID: GPU-611f484b-7a5a-f1ae-5aac-64d2ddad1ab6)
GPU 2: NVIDIA GeForce RTX 3090 (UUID: GPU-ba171e16-8df7-e1c4-5468-2ee35e18d1f0)
GPU 3: NVIDIA GeForce RTX 3090 (UUID: GPU-66bd9aec-436e-24eb-91e8-d31d6370d8f0)
GPU 4: NVIDIA GeForce RTX 3090 (UUID: GPU-9cc6b251-34a2-db9d-4ca0-7532f951aad2)
GPU 5: NVIDIA GeForce RTX 3090 (UUID: GPU-a6c609c1-078d-e47e-b418-8008e61a8cf6)
GPU 6: NVIDIA GeForce RTX 3090 (UUID: GPU-be37798a-62fb-ebee-90d2-01b018d81c6d)
GPU 7: NVIDIA GeForce RTX 3090 (UUID: GPU-8b2e78db-cff8-bb89-d9fd-64f1633df658)

$ export CUDA_VISIBLE_DEVICES="GPU-ba171e16,GPU-611f484b,GPU-3cd9eb06"

$ ipython

In [1]: import torch

In [2]: torch.cuda.memory_allocated(0)
Out[2]: 0

In [3]: torch.cuda.get_device_properties(0).total_memory
Out[3]: 25447170048

In [4]: torch.cuda.mem_get_info(0)
Out[4]: (510328832, 25447170048)

In [5]: from nvitop import CudaDevice

In [6]: cuda0 = CudaDevice(0)
   ...: cuda0
Out[6]: CudaDevice(cuda_index=0, nvml_index=2, name="NVIDIA GeForce RTX 3090", total_memory=24.00GiB)

In [7]: cuda0.memory_free()
Out[7]: 510328832

In [8]: cuda0.memory_used()
Out[8]: 24936841216

In [9]: cuda0.memory_total()
Out[9]: 25769803776

mrwyattii · 2023-12-20T18:33:50Z

Hi @XuehaiPan - thank you for the contribution. If I recall correctly, we had to use pynvml because we were getting inaccurate memory information from torch in some scenarios. @jeffra may be able to comment more on this.

Either way, I will try out this branch and see if that is still the case. In particular, this code is necessary for FastGen and DeepSpeed-MII.

loadams · 2024-01-02T17:52:20Z

Hi @XuehaiPan - thank you for the contribution. If I recall correctly, we had to use pynvml because we were getting inaccurate memory information from torch in some scenarios. @jeffra may be able to comment more on this.

Either way, I will try out this branch and see if that is still the case. In particular, this code is necessary for FastGen and DeepSpeed-MII.

If we aren't able to switch over, would it at least make sense to move to the nvidia-ml-py package as it is more regularly updated and at least matches the cuda version?

Retrieve CUDA available memory via torch.cuda.mem_get_info()

463e594

XuehaiPan requested review from loadams and mrwyattii as code owners December 20, 2023 14:41

mrwyattii and others added 2 commits December 20, 2023 10:37

Merge branch 'master' into cuda-available-memory

f867c3b

Merge branch 'master' into cuda-available-memory

2041229

Merge branch 'master' into cuda-available-memory

5d6ea98

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retrieve CUDA available memory via `torch.cuda.mem_get_info()` #4847

Retrieve CUDA available memory via `torch.cuda.mem_get_info()` #4847

XuehaiPan commented Dec 20, 2023 •

edited by loadams

Loading

mrwyattii commented Dec 20, 2023

loadams commented Jan 2, 2024

	def _get_nvml_gpu_id(self, torch_gpu_id):
	"""
	credit: https://discuss.pytorch.org/t/making-pynvml-match-torch-device-ids-cuda-visible-devices/103020

	Remap torch device id to nvml device id, respecting CUDA_VISIBLE_DEVICES.

	If the latter isn't set return the same id
	"""
	# if CUDA_VISIBLE_DEVICES is used automagically remap the id since pynvml ignores this env var
	if "CUDA_VISIBLE_DEVICES" in os.environ:
	ids = list(map(int, os.environ.get("CUDA_VISIBLE_DEVICES", "").split(",")))
	return ids[torch_gpu_id] # remap
	else:
	return torch_gpu_id

Retrieve CUDA available memory via torch.cuda.mem_get_info() #4847

Are you sure you want to change the base?

Retrieve CUDA available memory via torch.cuda.mem_get_info() #4847

Conversation

XuehaiPan commented Dec 20, 2023 • edited by loadams Loading

Rationale

mrwyattii commented Dec 20, 2023

loadams commented Jan 2, 2024

Retrieve CUDA available memory via `torch.cuda.mem_get_info()` #4847

Retrieve CUDA available memory via `torch.cuda.mem_get_info()` #4847

XuehaiPan commented Dec 20, 2023 •

edited by loadams

Loading