Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pyNVML won't work on a Jetson, is there a workaround #400

Open
JasonAtNvidia opened this issue Sep 16, 2020 · 14 comments · May be fixed by #402
Open

pyNVML won't work on a Jetson, is there a workaround #400

JasonAtNvidia opened this issue Sep 16, 2020 · 14 comments · May be fixed by #402

Comments

@JasonAtNvidia
Copy link

There is no NVML library on aarch64 NVIDIA Jetson That will break many libraries relying on this library, such as cuxfilter. The geospatial and cuxfilter libraries are among the most requested for Jetson and I'd love to make it work. Is there a way to use Numba functions to replace pyNVML in this library?

@quasiben
Copy link
Member

So we use pynvml in two places:

  1. getting the number of GPUs in machine -- this is easy to do
  2. getting the CPU affinity for GPUs -- this will be challenging to replace. Do Jetson devices have more than one GPU ?

@JasonAtNvidia
Copy link
Author

The probability of a Jetson with a discrete GPU is ultra low and we can say that they don't exist outside of NVIDIA DRIVE units. We could easily wrap the affinity functionality in a statement such as if "tegra" in platform.uname().release and that would indicate a Jetson device.

@jakirkham
Copy link
Member

It might be possible to detect affinity through hwloc.

@quasiben
Copy link
Member

Alternatively, if there is only one GPU on a jetson, does device affinity do anything ?

@JasonAtNvidia
Copy link
Author

Theoretically there is no device affinity on a Jetson, GPU and CPU share the same chunk of RAM and don't have to communicate via PCI bus.

@pentschev
Copy link
Member

Do any of the Jetson board have multiple GPUs @JasonAtNvidia ? Note that dask-cuda is targeting a one-process-per-GPU model for parallelism, and if none of the boards have multiple GPUs you may not have a lot of use for dask-cuda anyway.

If there are multiple GPU Jetsons, is there a reliable way to query whether the system is running on a Jetson? We can certainly add some conditions and work around pyNVML, we do something similar for the DGXs in

def _get_dgx_name():
product_name_file = "/sys/class/dmi/id/product_name"
dgx_release_file = "/etc/dgx-release"
# We verify `product_name_file` to check it's a DGX, and check
# if `dgx_release_file` exists to confirm it's not a container.
if not os.path.isfile(product_name_file) or not os.path.isfile(dgx_release_file):
return None
for line in open(product_name_file):
return line
, although those are only for tests today.

@JasonAtNvidia
Copy link
Author

There are Jetson boards with multiple GPU capability, DRIVE units are most common. They have a Xavier SoM and a Turing daughter board.

The linux-4-tegra distribution has a file in /etc/nv_tegra_release that contains the version. And you could check for the existence of /sys/class/tegra-firmware/ (a folder) to verify you are running on a Jetson (these exist in the container, whereas nv_tegra_release does not exist in the container)

@pentschev
Copy link
Member

There are Jetson boards with multiple GPU capability, DRIVE units are most common. They have a Xavier SoM and a Turing daughter board.

Sorry for the late reply here @JasonAtNvidia , when you say multiple GPU capability you're saying that you can address each process with CUDA_VISIBLE_DEVICES=0, CUDA_VISIBLE_DEVICES=1, and so on? Or how do you choose which GPU the application should use?

The linux-4-tegra distribution has a file in /etc/nv_tegra_release that contains the version. And you could check for the existence of /sys/class/tegra-firmware/ (a folder) to verify you are running on a Jetson (these exist in the container, whereas nv_tegra_release does not exist in the container)

As long as we can choose each GPU correctly, these should work for us to detect the platform correctly so we can work around the current NVML workaround. As soon as you confirm we can indeed use CUDA_VISIBLE_DEVICES for each Dask worker I can submit a PR to address this.

@JasonAtNvidia
Copy link
Author

@pentschev
Yes, Jetson devices respond to the CUDA_VISIBLE_DEVICES environment variable.

I do not have a Jetson device to test multiple GPUs with, but I am able to verify that CUDA_VISIBLE_DEVICES=0 is successful and CUDA_VISIBLE_DEVICES=1 results in an error that no device is found. I will try to find a multiple GPU device to test with.

@pentschev pentschev linked a pull request Sep 25, 2020 that will close this issue
@pentschev
Copy link
Member

@JasonAtNvidia I just pushed #402 , this should work with Tegra, but I don't have access to a Tegra device to test, it would be great if you could test it when you have a chance.

@JasonAtNvidia
Copy link
Author

@pentschev
I think your patch is good. It builds and loads on the Jetson device, and I think these are the 3 functions you touched with the patch.

>>> dask_cuda.utils.get_gpu_count()
1
>>> dask_cuda.utils._is_tegra()
True
>>> dask_cuda.utils.get_device_total_memory()
16582901760

@pentschev
Copy link
Member

@JasonAtNvidia those are the correct functions. It would be interesting to know if you can go any further to do some Dask computation as well, but as I mentioned before, you won't see any benefits in using dask-cuda with a single GPU vs just using the library (e.g., CuPy, cuDF, etc.) you're trying to compute with alone.

@pentschev pentschev added the feature request New feature or request label Jan 8, 2021
@github-actions
Copy link

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

4 participants