Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking support of new cuFile features #204

Open
3 of 4 tasks
madsbk opened this issue Apr 25, 2023 · 11 comments
Open
3 of 4 tasks

Tracking support of new cuFile features #204

madsbk opened this issue Apr 25, 2023 · 11 comments

Comments

@madsbk
Copy link
Member

madsbk commented Apr 25, 2023

Meta Issue to track support of new cuFile features.

@madsbk madsbk changed the title Tracking GDS 1.7/CUDA 12.x features support Tracking cuFile features support Apr 27, 2023
@madsbk madsbk changed the title Tracking cuFile features support Tracking support of new cuFile features Apr 27, 2023
@jakirkham
Copy link
Member

C++ support for Batch IO was done in PR ( #220 ), right? Or is this about Python support?

@madsbk
Copy link
Member Author

madsbk commented Jun 27, 2023

C++ support for Batch IO was done in PR ( #220 ), right? Or is this about Python support?

Yes, updated the issue

rapids-bot bot pushed a commit that referenced this issue May 7, 2024
Hi there,

Thanks for this great repository! I want to use the cuFile async IO in my research project and noticed this kvikio repo. However, the initial support has been done in #259 and tracked in #204, but the Python interface hasn't been done yet. So I exported the write_async and read_async to the CuFile Python class and added test case. This will be very helpful for my project where I want to do the PyTorch training computation and simultaneously load tensors from the SSDs. I created this PR because hopefully, it could be helpful for your repository as well as keeping the Python interface current.

Please let me know your thoughts. Thank you.

Best Regards,
Kun

Authors:
  - Kun Wu (https://github.com/K-Wu)
  - Mads R. B. Kristensen (https://github.com/madsbk)

Approvers:
  - Mads R. B. Kristensen (https://github.com/madsbk)

URL: #376
@fstrug
Copy link

fstrug commented Oct 8, 2024

Is there any timeline for batch IO support in python?

@madsbk
Copy link
Member Author

madsbk commented Oct 10, 2024

Is there any timeline for batch IO support in python?

No, not at the moment but we could prioritize it. Do you have a particular use case in mind?

@fstrug
Copy link

fstrug commented Oct 10, 2024

Do you have a particular use case in mind?

We are working on developing a tool for high energy particle physicists that will use the GPU to read and store data directly on the GPU for later use in an analysis. Our data is stored row-wise, so the bytes for any column are divided into many small baskets that are spread throughout the length of the file. To get a column of data out of the file to an array, we are performing many small CuFile.preads() (~300 reads of 10^5 bytes) at different offsets. With the CuFile API calls being FIFO, it seems that a batch API call would be the performant way to launch these reads.

@madsbk
Copy link
Member Author

madsbk commented Oct 11, 2024

Yes, sounds like the batch API could be useful.

Currently with CuFile.preads(), are you using the thread pool be setting KVIKIO_NTHREADS or calling kvikio.default.num_threads_reset()?

@fstrug
Copy link

fstrug commented Oct 11, 2024

I have tried adjusting the thread pool size. The performance decreases when increasing the size of the thread pool from the default when doing many CuFile.preads(). Setting KVIKIO_GDS_THRESHOLD = 4096 so that all my calls should go to the thread pool does not affect performance. This may have something to do with running in compatibility mode KVIKIO_COMPAT_MODE=True, but setting this to False gives me worse performance. The docs mention kvikio and CuFile have separate compatibility mode settings, but I don't see how to check if CuFile is or not. I have installed libcufile before installing kvikio, but kvikio uses KVIKIO_COMPAT_MODE=True by default.

Does the thread pool have a 1:1 correspondence with the number of CUDA threads that will be used by kvikio? In some of my checks, the read times scaled weaker than I would have expected with the number of threads used for reading with when KVIKIO_COMPAT_MODE = False. When KVIKIO_COMPAT_MODE = True, I get better performance and there is still some scaling wrt the size of the thread pool.

I am working on a server with a 20GB slot of a 80GB A100 in the below tests.

Image
Image

@madsbk
Copy link
Member Author

madsbk commented Oct 12, 2024

KVIKIO_COMPAT_MODE=True means that KvikIO are doing all the work using a thread pool and regular POSIX IO, which, I think, is the fastest setting in your case.

KVIKIO_NTHREADS isn't related to CUDA threads. It is the maximum number of POSIX threads that KvikIO will use concurrently to call POSIX read/write.

Could you try with fewer threads, maybe KVIKIO_NTHREADS=8 or KVIKIO_NTHREADS=16 ?


PS: I am away all of next week, so might not be able to reply until the week after.

@fstrug
Copy link

fstrug commented Oct 21, 2024

KVIKIO_COMPAT_MODE=True means that KvikIO are doing all the work using a thread pool and regular POSIX IO, which, I think, is the fastest setting in your case.

For our environments, we start from a base conda image that has some version of CUDA installed (currently 12.2). The default setting for this value in my environment is True implying libcufile may not have been found. I find when running the C++ examples it is unable to find libcufile and even some other dependencies like bs_thread_pool. I set my PATH to contain /home/fstrug/.conda/envs/kvikio-env/bin and LD_LIBRARY_PATH to contain /home/fstrug/.conda/envs/kvikio-env/lib. I can see that libcufile is installed within the conda environment (/home/fstrug/.conda/envs/img_cuda12.2-kvikio/lib/libcufile.so), so it isn't clear to me why these aren't being built with kvikio. Is there a way to explicitly check if the python module is unable to find cufile as well, I don't see any way from the docs?

Could you try with fewer threads, maybe KVIKIO_NTHREADS=8 or KVIKIO_NTHREADS=16 ?

Even with these values, I am still not seeing performance improvements. Often the pread is only reading 10^3 bytes at a time, which might be a barrier to seeing performance increases with more threads.

@madsbk
Copy link
Member Author

madsbk commented Oct 23, 2024

Yes, reading 1k chunks is very small. How many of the columns do you need? It might be better to read big chunks of the columns and transpose in memory, even if it means you have to read some unneeded columns.

@fstrug
Copy link

fstrug commented Oct 28, 2024

There can be ~1000s of columns and users usually need to only read a small subset of these (~5). Designing an algorithm to optimize the reads based on columns requested is something we've considered, but there may be a better path forward for us with batch functionality. If cufile performs the reads with the GPU threads when not in compatibility mode, I presume that for some large contiguous read it is already being 'batched' for threads during execution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants