Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU 0 is always used in a multi-GPU setup #139

Open
nikolai-franke opened this issue Oct 19, 2023 · 6 comments
Open

GPU 0 is always used in a multi-GPU setup #139

nikolai-franke opened this issue Oct 19, 2023 · 6 comments

Comments

@nikolai-franke
Copy link

nikolai-franke commented Oct 19, 2023

System:

  • OS version: Red Hat Enterprise Linux (RHEL) 8.x
  • Python version: Python 3.10 and Python 3.9
  • SAPIEN version: sapien==2.2.2
  • Environment: Server with xvfb

Describe the bug
SAPIEN always uses GPU 0 in multi-GPU setup in addition to the GPU specified by CUDA_VISIBLE_DEVICES

To Reproduce

  1. Run modified examples/robotics/basic_robot.py script (the only difference is that there is no Viewer) https://pastebin.com/abuJeuVG with CUDA_VISIBLE_DEVICES=0
  2. Run modified examples/robotics/basic_robot.py script (the only difference is that there is no Viewer) https://pastebin.com/abuJeuVG with CUDA_VISIBLE_DEVICES=1

Expected behavior
Checking the GPU usage, only the selected GPU should be used. For CUDA_VISIBLE_DEVICES=0, that is the case. For CUDA_VISIBLE_DEVICES=1, both GPU 0 and GPU 1 get used.

Screenshots
CUDA_VISIBLE_DEVICES=0:
cuda_0
CUDA_VISIBLE_DEVICES=1:
cuda_1

Additional context
Even though GPU 0 only gets used a bit when CUDA_VISIBLE_DEVICES=1, this usage quickly adds up when running many parallel simulations. I am using ManiSkill2 for Reinforcement Learning on an HPC node with 4 Nvidia A100 GPUs and this bug severely limits the number of parallel environments I can run. Additionally, running many parallel environments becomes slow, since GPU 0 is used by every single simulation environment instead of just 1/4th of the simulations.

@fbxiang
Copy link
Collaborator

fbxiang commented Oct 23, 2023

You may try passing offscreen_only=True to SapienRenderer constructor. This behavior will be changed in the future (to make CUDA device take higher priority than on-screen rendering)

@nikolai-franke
Copy link
Author

Passing offscreen_only=True doesn't make a difference.

@fbxiang
Copy link
Collaborator

fbxiang commented Nov 11, 2023

I cannot figure out what is causing the issue. I think you should set the pci id of the device you want to use directly. This method requires a bit setup but should never fail. First, before creating anything with SAPIEN, run sapien.SapienRenderer.set_log_level("info"). Next, run your code. You will see a table listing devices visible to Vulkan. From there, you will see all your GPUs with a field PciBus. The PciBus is unique to each of your physical GPU. Next when you create SapienRenderer, you can pass device="pci:x" where x is the PciBus id shown in the log. This should bypass all other checks.

@nikolai-franke
Copy link
Author

Thank you very much for your answer! Sadly the result is still exactly the same. GPU 0 always gets used, even when selecting another GPU via PCI address.

@fbxiang
Copy link
Collaborator

fbxiang commented Nov 22, 2023

Are you using sapien==2.2.2? I have verified that the GPU selection feature is working. You can try sapien.SapienRenderer.set_log_level("info") before creating the renderer. It will list all available GPUs to the console and tell you which GPU is selected for rendering. Since an incorrect pci id will result in an error, I guess that maybe some other program is running on your GPU 0 and it is not SAPIEN renderer.

@balazsgyenes
Copy link

I'm actually having the same issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants