Deserialize bytes to array on the GPU directly #436

goelayu · 2024-08-14T18:27:29Z

I am using kvikio to read an array stored on a file on the disk, directly onto the GPU. On the GPU I want to deserialize the file content into an array.

size = os.path.getsize(file_path)
with kvikio.cuFile(file_path, 'r') as f:
  tensor = cp.empty(size//4)
  f.read(tensor)

My understanding is that the deserialization should happen on the GPU itself, i.e., there is no host CPU involved.
However when I profile the above code using nsys, I don't see any activity on the GPU corresponding to the deserialization.
Also when looking at the CPU utilization of my code, it seems that that CPU is doing the work of deserialization.
Why is this case?

The text was updated successfully, but these errors were encountered:

jakirkham · 2024-08-15T02:51:58Z

For this use case, it might be worth looking at KvikIO's fromfile & tofile API

There are some examples at the top of this PR: #135

Admittedly this doesn't answer the question as to why things are not working in your case, but maybe it provides a path forward

goelayu · 2024-08-15T19:55:37Z

Thanks for the pointer. My objective is to understand the differences and compare the performance of cupy.fromfile and kvikio.numpy.fromfile.

I believe the cupy API 1) reads the file and deserializes the bytes to an array, on the host itself (using numpy.fromfile) and 2) copies the array to the GPU using cudamemcpy.
The kvikio API, on the other hand, 1) first copies the data to the GPU and then 2) deserializes the data into an array, on the GPU itself.

However, when profiling the code using both nsys for GPU computations and cprofile for host computations, it seems to me that the deserialization is happening on the host in both the cases.

Profiler output for cupy.fromfile

 tottime  percall  cumtime  percall filename:lineno(function)
 2.672    2.672    2.672    2.672 {built-in method numpy.fromfile}

The numpy.fromfile method takes care of the deserialization and accounts for most of the runtime.

Profiler output for kvikio.numpy.fromfile

tottime  percall  cumtime  percall filename:lineno(function)
 0.000    0.000    3.376    3.376 /lib/python3.10/site-packages/kvikio/cufile.py:206(read)
 3.371    3.371    3.371    3.371 /lib/python3.10/site-packages/kvikio/cufile.py:44(get)
 1.707    1.707    1.707    1.707 /lib/python3.10/site-packages/kvikio/cufile.py:70(__init__)

Looks like most of the time is spent inside the kvikio.cuFile methods, implying that the deserialization is being performed on the host (also note that the nsys profiler shows no extra GPU computations, resulting in the same conclusion).

To summarize, my questions are:

Is my understanding on the semantic differences between cupy.fromfile and kvikio.numpy.fromfile accurate?
If so, why am I not seeing the deserialization being offloaded to the GPU?

jakirkham · 2024-08-15T22:29:08Z

Could you please share the code used in the second case? It is hard to comment on what is happening there without knowing what was done

madsbk · 2024-08-16T06:37:08Z

When reading a binary file, cupy.fromfile doesn't do any computation unless it has to convert the data to little-endian thus the deserialization is essential free.

If GDS isn't available, cupy.fromfile and kvikio.numpy.fromfile do exactly the same. They first read from disk into a bounce buffer, and then copy to device.

Now, if GDS is available and the data is larger than KVIKIO_GDS_THRESHOLD (1 MiB), KvikIO will not use a bounce buffer but instead use GDS to write directly to device memory (skipping the CPU).

However, even when GDS isn't available, kvikio.numpy.fromfile typically outperforms cupy.fromfile when using multiple threads. Try setting the environment variable KVIKIO_NTHREADS.

jakirkham · 2024-08-16T06:46:04Z

Thanks Mads! 🙏

Do we document somewhere how to check whether KvikIO is able to use GDS? Think this might be a useful diagnostic test for Akshay (and future users) to run through to confirm they have a working configuration

jakirkham mentioned this issue Aug 15, 2024

Deserialize bytes to array on the GPU directly cupy/cupy#8488

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deserialize bytes to array on the GPU directly #436

Deserialize bytes to array on the GPU directly #436

goelayu commented Aug 14, 2024

jakirkham commented Aug 15, 2024

goelayu commented Aug 15, 2024 •

edited

Loading

jakirkham commented Aug 15, 2024

madsbk commented Aug 16, 2024 •

edited

Loading

jakirkham commented Aug 16, 2024

Deserialize bytes to array on the GPU directly #436

Deserialize bytes to array on the GPU directly #436

Comments

goelayu commented Aug 14, 2024

jakirkham commented Aug 15, 2024

goelayu commented Aug 15, 2024 • edited Loading

jakirkham commented Aug 15, 2024

madsbk commented Aug 16, 2024 • edited Loading

jakirkham commented Aug 16, 2024

goelayu commented Aug 15, 2024 •

edited

Loading

madsbk commented Aug 16, 2024 •

edited

Loading