Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use pinned bounce buffer in the CSV reader #13750

Closed

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Jul 25, 2023

Description

Decompressed input data in the CSV reader is copied to the device in chunks.
This PR modifies the process to copy each chunk to a pinned bounce buffer, from which the data is then copied to the device. This should lead to higher throughput of the H2D copies.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@vuule vuule added cuIO cuIO issue Performance Performance related issue non-breaking Non-breaking change labels Jul 25, 2023
@vuule vuule self-assigned this Jul 25, 2023
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jul 25, 2023
@vuule vuule added improvement Improvement / enhancement to an existing function and removed libcudf Affects libcudf (C++/CUDA) code. labels Jul 25, 2023
@vuule vuule changed the title Use pinned bounce buffer to copy input data to device Use pinned bounce buffer in the CSV reader Jul 25, 2023
@vuule
Copy link
Contributor Author

vuule commented Jul 25, 2023

No perf numbers because my dev system has oddly low throughput that does not change with pinned memory.
Profiles from a lab system show big improvement.

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Jul 25, 2023

Thank you @vuule for preparing this option for switching read_csv to use a pinned buffer for mmap file reads. Here are my quick benchmarking results read files containing LLM-inspired data. I compared recent 23.08 commit d4c2d1ccee0 with fc7b81690b18 on this branch. Solid points use the pageable mmap baseline and open circles use the pinned mmap proposal.

image

The CSV format is the only one expected to change from this diff. We see perhaps a 15% runtime increase from the largest file sizes. I will update this issue after we discuss the merits internally.

@vuule
Copy link
Contributor Author

vuule commented Jul 31, 2023

Closing based on results in the comment above

@vuule vuule closed this Jul 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Performance Performance related issue
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants