-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU Server Loses GPU #1952
Comments
Possible workaround I am testing today: Open an ipython session, run the following, and leave it open in a separate terminal. Idea is to keep the GPU device in use (with a small tensor on the GPU) and prevent the GPU from detaching. Code to run in ipython: import torch
d = torch.device('cuda:0')
x = torch.randn([4, 4]).to(d)
x.device |
That workaround seems to have been working for me so far today. |
Thanks for the update @StanHatko ! I've added it to the AAW issue backlog and will be assessing it at a later date. |
This workaround usually works (always for me until today), but on one server today the workaround failed and the GPU still detached. Hopefully this failure with the workaround remains rare but it can occur. |
This workaround failed on another GPU server. It seems the workaround basically no longer works, at least today. |
But restarting those servers and not using the workaround, it worked. So today it was inverted, problems occurred with the workaround but not without the workaround (just using the server normally). |
It occurred for me just now without the workaround being on (so it can occur in both cases), though it seems less frequent today when not having the workaround active. |
I'm currently trying the following modification below to the workaround to keep the GPU device active and stop it from detaching. So far it seems to be working, but that could be a coincidence. import time
import torch
d = torch.device('cuda:0')
x = torch.randn([4, 4]).to(d)
print(x.device)
with torch.no_grad():
while True:
x = x + 0.01
time.sleep(0.5) |
In the past couple of days I've encountered the case of GPU servers suddenly losing the GPU. This has very rarely occurred in the past, but yesterday and today is occurring very frequently and is making GPU servers close to unusable.
It occurs in the following situation: if a process using the GPU exits (either normally at end of program or by ctrl-c) and a new task uses the GPU starts, there's a good chance the GPU will no longer be available for the new task. An existing nvidia-smi -l 1 processing running will continue to run and report 0 GPU usage, but if terminated and restarted nvidia-smi will not work, generating the error shown in the screenshot.
The text was updated successfully, but these errors were encountered: