Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: add cuda backend support for to_raggedtensor and from_raggedtensor functions #3263

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

maxymnaumchyk
Copy link
Collaborator

@maxymnaumchyk maxymnaumchyk commented Oct 1, 2024

No description provided.

Copy link

codecov bot commented Oct 1, 2024

Codecov Report

Attention: Patch coverage is 11.36364% with 39 lines in your changes missing coverage. Please review.

Project coverage is 82.18%. Comparing base (b749e49) to head (5cfccda).
Report is 176 commits behind head on main.

Files with missing lines Patch % Lines
src/awkward/operations/ak_to_raggedtensor.py 13.33% 26 Missing ⚠️
src/awkward/operations/ak_from_raggedtensor.py 7.14% 13 Missing ⚠️
Additional details and impacted files
Files with missing lines Coverage Δ
src/awkward/operations/ak_from_raggedtensor.py 23.07% <7.14%> (ø)
src/awkward/operations/ak_to_raggedtensor.py 21.81% <13.33%> (ø)

... and 157 files with indirect coverage changes

@maxymnaumchyk
Copy link
Collaborator Author

maxymnaumchyk commented Oct 1, 2024

@jpivarski while trying to make the to_raggedtensor function keep the device of the original awkward array I stumbled upon an issue. The thing is, tensorflow automatically selects gpu for computation, if it's available. And if I try to run the following code on gpu, it does return a tensor on cpu:

import tensorflow as tf

def function():
    with tf.device('CPU:0'):
        a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
        return a

a = function()
a.device

>>/job:localhost/replica:0/task:0/device:CPU:0

However if try to do the same with the to_raggedtensor function, the intermediate ragged tensor is allocated on cpu (the line 78 print says that it's on cpu) but the resulting tensor is allocated on gpu:

to_raggedtensor(ak.Array([[[1.1, 2.2], [3.3]], [], [[4.4, 5.5]]]))[0][0].device
>>/job:localhost/replica:0/task:0/device:GPU:0

Should I make the function use a TensorFlow policy and automatically select a device or create some kind of workaround?

@jpivarski
Copy link
Member

ak.to_raggedtensor should return a RaggedTensor on the same device as the Awkward Array, as a view (no copy) if possible. That may mean that the implementation needs to specify non-default arguments of the RaggedTensor constructor (or use the with block) in order to control it.

If this is not possible and TensorFlow returns an object whose backend depends on what hardware is available (a terrible practice! shame on TensorFlow!), then we'll have to explain that (apologetically) in our documentation.

@maxymnaumchyk maxymnaumchyk marked this pull request as ready for review October 16, 2024 15:09
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good! I added some possible changes—actually, "things to think about" because you know the TensorFlow situation better than I do.

This could also use tests. Would it be sufficient to copy the to/from raggedtensor tests from the tests/ directory to tests-cuda/ and replace NumPy arrays with CuPy arrays?

Just as you can run the normal tests with

python -m pytest tests

you can run the CUDA tests with

python -m pytest tests-cuda

on a computer with an Nvidia GPU.

src/awkward/operations/ak_from_raggedtensor.py Outdated Show resolved Hide resolved
src/awkward/operations/ak_from_raggedtensor.py Outdated Show resolved Hide resolved
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good! Except maybe for the case of more than 10 GPUs: see below. Once that's fixed, this would be ready to merge.

@@ -79,9 +63,18 @@ def _impl(array):


def _tensor_to_np_or_cp(array, device):
import tensorflow as tf
if device.endswith("GPU", 0, -2):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to check the documentation on str.endswith, but it seems that this is equivalent to

    if device[:-2].endswith("GPU"):

(though I think the latter is easier to understand because slicing is more well-known than the extra arguments of str.endswith).

However, are you assuming that the GPU number is one digit? That is, will the above code break for a computer with 10 GPUs?

If the format for the 15th GPU is "GPU-14" (zero-indexed), then maybe you want

Suggested change
if device.endswith("GPU", 0, -2):
if device.split("-")[0] == "GPU":

(and if lowercase is possible, you can also add a .upper() in the chain).

But before you accept the suggestion above, is it really a hyphen? If there's only one GPU, would there be no hyphen? (Note that device.split("-")[0] is equal to device if there is no hyphen, so the same code may be fine.)

Copy link
Collaborator Author

@maxymnaumchyk maxymnaumchyk Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching that, I haven't thought about that case! If there's only one GPU, then the device looks like this:
/job:localhost/replica:0/task:0/device:GPU:0
So, I think if device.split(":")[-2].upper() == "GPU": will work for all cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will, but it relies on TensorFlow never changing the text to end with "GPU" rather than "GPU:0". All of this is about trying to write something defensively, so that either our incomplete knowledge of the upstream library (TensorFlow) or possible changes in that upstream library would cause our code to break. By "break," I mean "do the wrong thing without an error message." Failing with an error message if TensorFlow changes would be fine.

Given that what we expect from TensorFlow is a string like

/job:localhost/replica:0/task:0/device:CPU:0

or

/job:localhost/replica:0/task:0/device:GPU:0

or

/job:localhost/replica:0/task:0/device:GPU:14

this would be a safe way to catch it:

import re

m = re.match(".*:(CPU|GPU):[0-9]+", device)
if m is not None:
    raise NotImplementedError(f"TensorFlow device has an unexpected format: {device!r}")
if m.groups()[0] == "GPU":
    ...

It also expresses to the future maintainer (or code reviewer) what you know about what TensorFlow gives you. (The import needs to be in the import section.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants