Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deserialize bytes to array on the GPU directly #436

Open
goelayu opened this issue Aug 14, 2024 · 5 comments
Open

Deserialize bytes to array on the GPU directly #436

goelayu opened this issue Aug 14, 2024 · 5 comments

Comments

@goelayu
Copy link

goelayu commented Aug 14, 2024

I am using kvikio to read an array stored on a file on the disk, directly onto the GPU. On the GPU I want to deserialize the file content into an array.

size = os.path.getsize(file_path)
with kvikio.cuFile(file_path, 'r') as f:
  tensor = cp.empty(size//4)
  f.read(tensor)

My understanding is that the deserialization should happen on the GPU itself, i.e., there is no host CPU involved.
However when I profile the above code using nsys, I don't see any activity on the GPU corresponding to the deserialization.
Also when looking at the CPU utilization of my code, it seems that that CPU is doing the work of deserialization.
Why is this case?

@jakirkham
Copy link
Member

For this use case, it might be worth looking at KvikIO's fromfile & tofile API

There are some examples at the top of this PR: #135

Admittedly this doesn't answer the question as to why things are not working in your case, but maybe it provides a path forward

@goelayu
Copy link
Author

goelayu commented Aug 15, 2024

Thanks for the pointer. My objective is to understand the differences and compare the performance of cupy.fromfile and kvikio.numpy.fromfile.

I believe the cupy API 1) reads the file and deserializes the bytes to an array, on the host itself (using numpy.fromfile) and 2) copies the array to the GPU using cudamemcpy.
The kvikio API, on the other hand, 1) first copies the data to the GPU and then 2) deserializes the data into an array, on the GPU itself.

However, when profiling the code using both nsys for GPU computations and cprofile for host computations, it seems to me that the deserialization is happening on the host in both the cases.

Profiler output for cupy.fromfile

 tottime  percall  cumtime  percall filename:lineno(function)
 2.672    2.672    2.672    2.672 {built-in method numpy.fromfile}

The numpy.fromfile method takes care of the deserialization and accounts for most of the runtime.

Profiler output for kvikio.numpy.fromfile

tottime  percall  cumtime  percall filename:lineno(function)
 0.000    0.000    3.376    3.376 /lib/python3.10/site-packages/kvikio/cufile.py:206(read)
 3.371    3.371    3.371    3.371 /lib/python3.10/site-packages/kvikio/cufile.py:44(get)
 1.707    1.707    1.707    1.707 /lib/python3.10/site-packages/kvikio/cufile.py:70(__init__)

Looks like most of the time is spent inside the kvikio.cuFile methods, implying that the deserialization is being performed on the host (also note that the nsys profiler shows no extra GPU computations, resulting in the same conclusion).

To summarize, my questions are:

  1. Is my understanding on the semantic differences between cupy.fromfile and kvikio.numpy.fromfile accurate?
  2. If so, why am I not seeing the deserialization being offloaded to the GPU?

@jakirkham
Copy link
Member

Could you please share the code used in the second case? It is hard to comment on what is happening there without knowing what was done

@madsbk
Copy link
Member

madsbk commented Aug 16, 2024

When reading a binary file, cupy.fromfile doesn't do any computation unless it has to convert the data to little-endian thus the deserialization is essential free.

If GDS isn't available, cupy.fromfile and kvikio.numpy.fromfile do exactly the same. They first read from disk into a bounce buffer, and then copy to device.

Now, if GDS is available and the data is larger than KVIKIO_GDS_THRESHOLD (1 MiB), KvikIO will not use a bounce buffer but instead use GDS to write directly to device memory (skipping the CPU).

However, even when GDS isn't available, kvikio.numpy.fromfile typically outperforms cupy.fromfile when using multiple threads. Try setting the environment variable KVIKIO_NTHREADS.

@jakirkham
Copy link
Member

Thanks Mads! 🙏

Do we document somewhere how to check whether KvikIO is able to use GDS? Think this might be a useful diagnostic test for Akshay (and future users) to run through to confirm they have a working configuration

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants