Query device name from pytorch if only device index is given #500

dvrogozh · 2024-07-15T20:57:10Z

Fixes: #499
Fixes: huggingface/transformers#31941

In some cases only device index is given on querying device. In this case both PyTorch and Safetensors were returning 'cuda:N' by default. This is causing runtime failures if user actually runs something on non-cuda device and does not have cuda at all. Recently this was addressed on PyTorch side by 1: starting from PyTorch 2.5 calling 'torch.device(N)' will return current device instead of cuda device.

This commit is making similar change to Safetensors. If only device index is given, Safetensors will query and return device calling 'torch.device(N)'. This change is backward compatible since this call would return 'cuda:N' on PyTorch <=2.4 which aligns with previous Safetensors behavior.

CC: @guangyey @jgong5 @faaany @muellerzr @SunMarc @Narsil

dvrogozh · 2024-07-16T21:25:29Z

@Narsil : can you, please, help to review?

dvrogozh · 2024-07-25T17:18:47Z

Rebased on top of latest main. @Narsil, @ArthurZucker, @SunMarc : can you, please, help to review?

dvrogozh · 2024-07-30T23:02:57Z

Resolved conflict with 2331974. @Narsil @muellerzr @SunMarc : can this PR, please, be reviewed?

Narsil · 2024-07-31T09:33:49Z

@dvrogozh Can you stop calling it a bug everywhere, since it's not a bug, it's breaking change you are proposing, that you introduced in torch==2.5.

The new behavior may be more user friendly on non cuda accelerators, it is nonetheless a breaking change and should be treated as such.

This PR introduces a dependency on torch itself since we would be dependent on torch to produce the correct string, therefore I cannot take the code as-is.

The raison d'être of this code is to provide simple validation so safe_open(...., device="xxx") is rejected with an appropriate error message (before doing work and sending the invalid values to torch.).

Narsil · 2024-07-31T09:39:03Z

I made a cleaner implementation imho: #509

Can you check that it fixes your issue ? You're also more than welcome to steal the code from said PR so we can merge your PR instead of mine, so you get credit.

Co-authored-by: Dmitry Rogozhkin <[email protected]>

Fixes: huggingface#499 Fixes: huggingface/transformers#31941 In some cases only device index is given on querying device. In this case both PyTorch and Safetensors were returning 'cuda:N' by default. This is causing runtime failures if user actually runs something on non-cuda device and does not have cuda at all. Recently this was addressed on PyTorch side by [1]: starting from PyTorch 2.5 calling 'torch.device(N)' will return current device instead of cuda device. This commit is making similar change to Safetensors. If only device index is given, Safetensors will query and return device calling 'torch.device(N)'. This change is backward compatible since this call would return 'cuda:N' on PyTorch <=2.4 which aligns with previous Safetensors behavior. See[1]: pytorch/pytorch#129119 Signed-off-by: Dmitry Rogozhkin <[email protected]>

dvrogozh · 2024-07-31T14:56:03Z

I made a cleaner implementation imho: #509. Can you check that it fixes your issue ?

Unfortunately it does not. Fails with "RuntimeError: Invalid device string: '0'". See pytorch part of the stack at #509 (review)

dvrogozh · 2024-07-31T15:08:00Z

@Narsil : I reworked my PR on top of your proposal made in #509 to abstract device string parsing (add fn parse_device). And I preserved original logic from my PR which is to return output of torch.device(N) instead of cuda:N. This should work for any pytorch version since on <2.5 torch.device(N) it will return cuda:N and on >=2.5 it will return whatever is configured on pytorch side. The difference with #509 is that mine PR returns fully qualified pytorch device name while yours returns just index. Apparently somewhere index is not enough, but further debug is needed to say exactly where (pytorch stack does not tell exact place, it's somewhere in safetensors' rust code).

Narsil · 2024-08-01T08:49:29Z

The logic you kept is the logic I want to get rid of. It's still wrong to depend on torch internals (here the string representation of resolved device) on that specific part.

dvrogozh · 2024-08-01T14:31:54Z

The logic you kept is the logic I want to get rid of.

I am fine with this as soon as your change address the problem. I gave a try to modified version of #509, it works for me now. We can proceed with yours variant if you believe it's more aligned with the safetensors design.

Narsil · 2024-08-01T15:06:54Z

Perfect done.

Thanks again for raising awareness about the upcoming new behavior !

Narsil · 2024-08-01T15:07:40Z

Superseeded by #509

dvrogozh mentioned this pull request Jul 15, 2024

pytorch: safetensors library hardcodes using CUDA if only device index is provided #499

Closed

SunMarc requested a review from Narsil July 16, 2024 11:40

dvrogozh force-pushed the fixes branch from 28798c0 to bc1c2f0 Compare July 25, 2024 17:17

dvrogozh force-pushed the fixes branch from bc1c2f0 to 9fbf0dc Compare July 30, 2024 23:00

Narsil and others added 2 commits July 31, 2024 07:43

Simplify code in lib.rs for Device

f141963

Co-authored-by: Dmitry Rogozhkin <[email protected]>

dvrogozh force-pushed the fixes branch from 9fbf0dc to 13f36cc Compare July 31, 2024 14:47

dvrogozh mentioned this pull request Jul 31, 2024

Respects torch.device(0) new behavior without breaking backward compatibility #509

Merged

Narsil closed this Aug 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query device name from pytorch if only device index is given #500

Query device name from pytorch if only device index is given #500

dvrogozh commented Jul 15, 2024 •

edited

Loading

dvrogozh commented Jul 16, 2024

dvrogozh commented Jul 25, 2024

dvrogozh commented Jul 30, 2024

Narsil commented Jul 31, 2024

Narsil commented Jul 31, 2024

dvrogozh commented Jul 31, 2024

dvrogozh commented Jul 31, 2024

Narsil commented Aug 1, 2024

dvrogozh commented Aug 1, 2024

Narsil commented Aug 1, 2024

Narsil commented Aug 1, 2024

Query device name from pytorch if only device index is given #500

Query device name from pytorch if only device index is given #500

Conversation

dvrogozh commented Jul 15, 2024 • edited Loading

dvrogozh commented Jul 16, 2024

dvrogozh commented Jul 25, 2024

dvrogozh commented Jul 30, 2024

Narsil commented Jul 31, 2024

Narsil commented Jul 31, 2024

dvrogozh commented Jul 31, 2024

dvrogozh commented Jul 31, 2024

Narsil commented Aug 1, 2024

dvrogozh commented Aug 1, 2024

Narsil commented Aug 1, 2024

Narsil commented Aug 1, 2024

dvrogozh commented Jul 15, 2024 •

edited

Loading