Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework read_csv IO to avoid reading whole input with a single host_read #16826

Merged
merged 24 commits into from
Sep 28, 2024

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Sep 18, 2024

Description

Issue #13797

The CSV reader ingests all input data with single call to host_read.
This is a problem for a few reasons:

  1. With cudaHostRegister we cannot reliably copy from the mapped region to the GPU without issues with mixing registered and unregistered areas. The reader can't know the datasource implementation details needed to avoid this issue.
  2. Currently the reader performs the H2D copies manually, so there's no multi-threaded or pinned memory optimizations. Using device_read has the potential to outperform manual copies.

This PR changes read_csv IO to perform small host_reads to get the data like BOM and first row. Most of the data is then read in chunks using device_read calls. We can further remove host_reads by moving some of the host processing to the GPU.

No significant changes in performance. We are likely to get performance improvements from future changes like increasing the kvikIO thread pool size.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Sep 18, 2024
@vuule vuule self-assigned this Sep 18, 2024
@vuule vuule added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Sep 18, 2024
@vuule vuule changed the base branch from branch-24.10 to branch-24.12 September 19, 2024 22:04
cpp/src/io/csv/reader_impl.cu Outdated Show resolved Hide resolved
@vuule vuule marked this pull request as ready for review September 19, 2024 22:07
@vuule vuule requested a review from a team as a code owner September 19, 2024 22:07
Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of questions.

I'm not familiar with this code. I'm not sure I'll do this justice on my first read of it.

cpp/src/io/csv/reader_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/csv/reader_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/csv/reader_impl.cu Show resolved Hide resolved
rmm::device_uvector<char> d_data{
(load_whole_file) ? data.size() : std::min(buffer_size * 2, data.size()), stream};
d_data.resize(0, stream);
auto pos = range_begin;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm not reading this correctly: Where is pos modified?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Way down in line 393. I messed with the control flow here as little as I could, this code is very fragile and not very well documented.
Core change is the addition on the byte_range_offset parameter, to be able to read from a source of host buffer that contains the whole file.

cpp/src/io/csv/reader_impl.cu Outdated Show resolved Hide resolved
@mythrocks
Copy link
Contributor

As a complete aside, I was wondering if there is any value in making the constructor of cudf::io::detail::csv::selected_rows_offsets explicit. I realize that it wasn't modified here.

@vuule
Copy link
Contributor Author

vuule commented Sep 25, 2024

As a complete aside, I was wondering if there is any value in making the constructor of cudf::io::detail::csv::selected_rows_offsets explicit. I realize that it wasn't modified here.

Made it explicit 👍


// None of the parameters for row selection is used, we are parsing the entire file
bool const load_whole_file =
range_offset == 0 && range_size == 0 && skip_rows <= 0 && skip_end_rows <= 0 && num_rows == -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why these aren't all equality checks. Under what scenario would skip_rows < 0 || skip_end_rows < 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Negative values mean "no value". We could modernize this with std::optional.

Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@karthikeyann karthikeyann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.
minor nit.

bom_buffer->size()};
if (has_utf8_bom(bom_chars)) { data_start_offset += sizeof(UTF8_BOM); }
} else {
constexpr auto find_data_start_chunk_size = 4ul * 1024;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion for future:
For wide CSVs, if this turn out to take a lot of time, we could double find_data_start_chunk_size number after couple of loops if it can't find terminator.

cpp/src/io/csv/reader_impl.cu Outdated Show resolved Hide resolved
@vuule
Copy link
Contributor Author

vuule commented Sep 28, 2024

/merge

@rapids-bot rapids-bot bot merged commit e2bcbb8 into rapidsai:branch-24.12 Sep 28, 2024
100 checks passed
rapids-bot bot pushed a commit that referenced this pull request Oct 4, 2024
…ource` (#16865)

Depends on #16826

Set of fixes that improve robustness on the non-GDS file input:

1. Avoid registering beyond the byte range - addresses problems when reading adjacent byte ranges from multiple threads (GH only).
2. Allow reading data outside of the memory mapped region. This prevents issues with very long rows in CSV or JSON input.
3. Copy host data when the range being read is only partially registered. This avoids errors when trying to copy the host data range to the device (GH only).

Modifies the datasource class hierarchy to avoid reuse of direct file `host_read`s

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Basit Ayantunde (https://github.com/lamarrr)
  - Mads R. B. Kristensen (https://github.com/madsbk)
  - Bradley Dice (https://github.com/bdice)

URL: #16865
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants