GPU Server Loses GPU #1952

StanHatko · 2024-02-29T16:56:48Z

In the past couple of days I've encountered the case of GPU servers suddenly losing the GPU. This has very rarely occurred in the past, but yesterday and today is occurring very frequently and is making GPU servers close to unusable.

It occurs in the following situation: if a process using the GPU exits (either normally at end of program or by ctrl-c) and a new task uses the GPU starts, there's a good chance the GPU will no longer be available for the new task. An existing nvidia-smi -l 1 processing running will continue to run and report 0 GPU usage, but if terminated and restarted nvidia-smi will not work, generating the error shown in the screenshot.

StanHatko · 2024-03-01T14:36:45Z

Possible workaround I am testing today: Open an ipython session, run the following, and leave it open in a separate terminal. Idea is to keep the GPU device in use (with a small tensor on the GPU) and prevent the GPU from detaching.

Code to run in ipython:

import torch
d = torch.device('cuda:0')
x = torch.randn([4, 4]).to(d)
x.device

StanHatko · 2024-03-01T22:19:44Z

That workaround seems to have been working for me so far today.

chuckbelisle · 2024-03-06T16:09:34Z

Thanks for the update @StanHatko ! I've added it to the AAW issue backlog and will be assessing it at a later date.

StanHatko · 2024-03-17T21:57:45Z

This workaround usually works (always for me until today), but on one server today the workaround failed and the GPU still detached. Hopefully this failure with the workaround remains rare but it can occur.

StanHatko · 2024-03-18T00:24:19Z

This workaround failed on another GPU server. It seems the workaround basically no longer works, at least today.

StanHatko · 2024-03-18T02:16:19Z

But restarting those servers and not using the workaround, it worked. So today it was inverted, problems occurred with the workaround but not without the workaround (just using the server normally).

StanHatko · 2024-03-18T04:21:59Z

It occurred for me just now without the workaround being on (so it can occur in both cases), though it seems less frequent today when not having the workaround active.

StanHatko · 2024-03-19T14:15:33Z

I'm currently trying the following modification below to the workaround to keep the GPU device active and stop it from detaching. So far it seems to be working, but that could be a coincidence.

import time
import torch
d = torch.device('cuda:0')
x = torch.randn([4, 4]).to(d)
print(x.device)

with torch.no_grad():
    while True:
        x = x + 0.01
        time.sleep(0.5)

StanHatko added kind/bug Something isn't working triage/support labels Feb 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Server Loses GPU #1952

GPU Server Loses GPU #1952

StanHatko commented Feb 29, 2024

StanHatko commented Mar 1, 2024

StanHatko commented Mar 1, 2024

chuckbelisle commented Mar 6, 2024

StanHatko commented Mar 17, 2024

StanHatko commented Mar 18, 2024

StanHatko commented Mar 18, 2024

StanHatko commented Mar 18, 2024

StanHatko commented Mar 19, 2024 •

edited

Loading

GPU Server Loses GPU #1952

GPU Server Loses GPU #1952

Comments

StanHatko commented Feb 29, 2024

StanHatko commented Mar 1, 2024

StanHatko commented Mar 1, 2024

chuckbelisle commented Mar 6, 2024

StanHatko commented Mar 17, 2024

StanHatko commented Mar 18, 2024

StanHatko commented Mar 18, 2024

StanHatko commented Mar 18, 2024

StanHatko commented Mar 19, 2024 • edited Loading

StanHatko commented Mar 19, 2024 •

edited

Loading