-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracking support of new cuFile features #204
Comments
C++ support for Batch IO was done in PR ( #220 ), right? Or is this about Python support? |
Yes, updated the issue |
Hi there, Thanks for this great repository! I want to use the cuFile async IO in my research project and noticed this kvikio repo. However, the initial support has been done in #259 and tracked in #204, but the Python interface hasn't been done yet. So I exported the write_async and read_async to the CuFile Python class and added test case. This will be very helpful for my project where I want to do the PyTorch training computation and simultaneously load tensors from the SSDs. I created this PR because hopefully, it could be helpful for your repository as well as keeping the Python interface current. Please let me know your thoughts. Thank you. Best Regards, Kun Authors: - Kun Wu (https://github.com/K-Wu) - Mads R. B. Kristensen (https://github.com/madsbk) Approvers: - Mads R. B. Kristensen (https://github.com/madsbk) URL: #376
Is there any timeline for batch IO support in python? |
No, not at the moment but we could prioritize it. Do you have a particular use case in mind? |
We are working on developing a tool for high energy particle physicists that will use the GPU to read and store data directly on the GPU for later use in an analysis. Our data is stored row-wise, so the bytes for any column are divided into many small baskets that are spread throughout the length of the file. To get a column of data out of the file to an array, we are performing many small |
Yes, sounds like the batch API could be useful. Currently with |
I have tried adjusting the thread pool size. The performance decreases when increasing the size of the thread pool from the default when doing many Does the thread pool have a 1:1 correspondence with the number of CUDA threads that will be used by kvikio? In some of my checks, the read times scaled weaker than I would have expected with the number of threads used for reading with when I am working on a server with a 20GB slot of a 80GB A100 in the below tests. |
Could you try with fewer threads, maybe PS: I am away all of next week, so might not be able to reply until the week after. |
For our environments, we start from a base conda image that has some version of CUDA installed (currently 12.2). The default setting for this value in my environment is
Even with these values, I am still not seeing performance improvements. Often the |
Yes, reading 1k chunks is very small. How many of the columns do you need? It might be better to read big chunks of the columns and transpose in memory, even if it means you have to read some unneeded columns. |
There can be ~1000s of columns and users usually need to only read a small subset of these (~5). Designing an algorithm to optimize the reads based on columns requested is something we've considered, but there may be a better path forward for us with batch functionality. If |
Meta Issue to track support of new cuFile features.
The text was updated successfully, but these errors were encountered: