Tracking support of new cuFile features #204

madsbk · 2023-04-25T15:11:18Z

Meta Issue to track support of new cuFile features.

jakirkham · 2023-06-27T04:47:36Z

C++ support for Batch IO was done in PR ( #220 ), right? Or is this about Python support?

madsbk · 2023-06-27T06:06:14Z

C++ support for Batch IO was done in PR ( #220 ), right? Or is this about Python support?

Yes, updated the issue

Hi there, Thanks for this great repository! I want to use the cuFile async IO in my research project and noticed this kvikio repo. However, the initial support has been done in #259 and tracked in #204, but the Python interface hasn't been done yet. So I exported the write_async and read_async to the CuFile Python class and added test case. This will be very helpful for my project where I want to do the PyTorch training computation and simultaneously load tensors from the SSDs. I created this PR because hopefully, it could be helpful for your repository as well as keeping the Python interface current. Please let me know your thoughts. Thank you. Best Regards, Kun Authors: - Kun Wu (https://github.com/K-Wu) - Mads R. B. Kristensen (https://github.com/madsbk) Approvers: - Mads R. B. Kristensen (https://github.com/madsbk) URL: #376

fstrug · 2024-10-08T21:25:50Z

Is there any timeline for batch IO support in python?

madsbk · 2024-10-10T06:51:38Z

Is there any timeline for batch IO support in python?

No, not at the moment but we could prioritize it. Do you have a particular use case in mind?

fstrug · 2024-10-10T14:49:08Z

Do you have a particular use case in mind?

We are working on developing a tool for high energy particle physicists that will use the GPU to read and store data directly on the GPU for later use in an analysis. Our data is stored row-wise, so the bytes for any column are divided into many small baskets that are spread throughout the length of the file. To get a column of data out of the file to an array, we are performing many small CuFile.preads() (~300 reads of 10^5 bytes) at different offsets. With the CuFile API calls being FIFO, it seems that a batch API call would be the performant way to launch these reads.

madsbk · 2024-10-11T09:07:04Z

Yes, sounds like the batch API could be useful.

Currently with CuFile.preads(), are you using the thread pool be setting KVIKIO_NTHREADS or calling kvikio.default.num_threads_reset()?

fstrug · 2024-10-11T16:45:19Z

I have tried adjusting the thread pool size. The performance decreases when increasing the size of the thread pool from the default when doing many CuFile.preads(). Setting KVIKIO_GDS_THRESHOLD = 4096 so that all my calls should go to the thread pool does not affect performance. This may have something to do with running in compatibility mode KVIKIO_COMPAT_MODE=True, but setting this to False gives me worse performance. The docs mention kvikio and CuFile have separate compatibility mode settings, but I don't see how to check if CuFile is or not. I have installed libcufile before installing kvikio, but kvikio uses KVIKIO_COMPAT_MODE=True by default.

Does the thread pool have a 1:1 correspondence with the number of CUDA threads that will be used by kvikio? In some of my checks, the read times scaled weaker than I would have expected with the number of threads used for reading with when KVIKIO_COMPAT_MODE = False. When KVIKIO_COMPAT_MODE = True, I get better performance and there is still some scaling wrt the size of the thread pool.

I am working on a server with a 20GB slot of a 80GB A100 in the below tests.

madsbk · 2024-10-12T14:14:36Z

KVIKIO_COMPAT_MODE=True means that KvikIO are doing all the work using a thread pool and regular POSIX IO, which, I think, is the fastest setting in your case.

KVIKIO_NTHREADS isn't related to CUDA threads. It is the maximum number of POSIX threads that KvikIO will use concurrently to call POSIX read/write.

Could you try with fewer threads, maybe KVIKIO_NTHREADS=8 or KVIKIO_NTHREADS=16 ?

PS: I am away all of next week, so might not be able to reply until the week after.

fstrug · 2024-10-21T19:29:00Z

KVIKIO_COMPAT_MODE=True means that KvikIO are doing all the work using a thread pool and regular POSIX IO, which, I think, is the fastest setting in your case.

For our environments, we start from a base conda image that has some version of CUDA installed (currently 12.2). The default setting for this value in my environment is True implying libcufile may not have been found. I find when running the C++ examples it is unable to find libcufile and even some other dependencies like bs_thread_pool. I set my PATH to contain /home/fstrug/.conda/envs/kvikio-env/bin and LD_LIBRARY_PATH to contain /home/fstrug/.conda/envs/kvikio-env/lib. I can see that libcufile is installed within the conda environment (/home/fstrug/.conda/envs/img_cuda12.2-kvikio/lib/libcufile.so), so it isn't clear to me why these aren't being built with kvikio. Is there a way to explicitly check if the python module is unable to find cufile as well, I don't see any way from the docs?

Could you try with fewer threads, maybe KVIKIO_NTHREADS=8 or KVIKIO_NTHREADS=16 ?

Even with these values, I am still not seeing performance improvements. Often the pread is only reading 10^3 bytes at a time, which might be a barrier to seeing performance increases with more threads.

madsbk · 2024-10-23T08:04:15Z

Yes, reading 1k chunks is very small. How many of the columns do you need? It might be better to read big chunks of the columns and transpose in memory, even if it means you have to read some unneeded columns.

fstrug · 2024-10-28T18:19:40Z

There can be ~1000s of columns and users usually need to only read a small subset of these (~5). Designing an algorithm to optimize the reads based on columns requested is something we've considered, but there may be a better path forward for us with batch functionality. If cufile performs the reads with the GPU threads when not in compatibility mode, I presume that for some large contiguous read it is already being 'batched' for threads during execution?

madsbk changed the title ~~Tracking GDS 1.7/CUDA 12.x features support~~ Tracking cuFile features support Apr 27, 2023

madsbk changed the title ~~Tracking cuFile features support~~ Tracking support of new cuFile features Apr 27, 2023

K-Wu mentioned this issue May 5, 2024

Initial Python Interface for cufile Async IO #376

Merged

madsbk mentioned this issue Jun 26, 2024

Adding multiline/IOV APIs #77

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking support of new cuFile features #204

Tracking support of new cuFile features #204

madsbk commented Apr 25, 2023 •

edited

Loading

jakirkham commented Jun 27, 2023

madsbk commented Jun 27, 2023

fstrug commented Oct 8, 2024

madsbk commented Oct 10, 2024

fstrug commented Oct 10, 2024

madsbk commented Oct 11, 2024

fstrug commented Oct 11, 2024

madsbk commented Oct 12, 2024

fstrug commented Oct 21, 2024

madsbk commented Oct 23, 2024 •

edited

Loading

fstrug commented Oct 28, 2024

Tracking support of new cuFile features #204

Tracking support of new cuFile features #204

Comments

madsbk commented Apr 25, 2023 • edited Loading

jakirkham commented Jun 27, 2023

madsbk commented Jun 27, 2023

fstrug commented Oct 8, 2024

madsbk commented Oct 10, 2024

fstrug commented Oct 10, 2024

madsbk commented Oct 11, 2024

fstrug commented Oct 11, 2024

madsbk commented Oct 12, 2024

fstrug commented Oct 21, 2024

madsbk commented Oct 23, 2024 • edited Loading

fstrug commented Oct 28, 2024

madsbk commented Apr 25, 2023 •

edited

Loading

madsbk commented Oct 23, 2024 •

edited

Loading