Retrieve CUDA available memory via torch.cuda.mem_get_info()
#4847
+5
−41
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR refactors the
available_memory()
method for the CUDA accelerator to usefree, total = torch.cuda.mem_get_info()
. It also removes the hard dependencypynvml
.Related PR:
The
torch.cuda.mem_get_info()
function was added two years ago (May 26th, 2021). We have already relied ontorch.cuda.is_bf16_supported()
without atorch
version check in the next method below. Thetorch.cuda.is_bf16_supported()
function was added on August 26th, 2021. So we can assume thetorch.cuda.mem_get_info()
function is always available for thetorch
version we support.Rationale
The official NVML Python binding package is
nvidia-ml-py
rather thanpynvml
on PyPI. See the documentation on https://pypi.org/project/pynvml:Having
pynvml
will add an extra dependency. It will also break the users' Python environment if they havenvidia-ml-py
installed. Because bothpynvml
andnvidia-ml-py
provide thepynvml
module. We can rely ontorch.cuda.mem_get_info()
where no extra dependency will be added.Handling the
CUDA_VISIBLE_DEVICES
environment variable is very complex. The variable can be a comma-separated list of integers or UUID strings. Currently, we only support integers. Thetorch.cuda.mem_get_info()
directly calls the CUDA API which does not need index conversion between CUDA and NVML.DeepSpeed/accelerator/cuda_accelerator.py
Lines 156 to 169 in 6d7b44a