-
Notifications
You must be signed in to change notification settings - Fork 904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework read_csv
IO to avoid reading whole input with a single host_read
#16826
Rework read_csv
IO to avoid reading whole input with a single host_read
#16826
Conversation
…into rework-read_csv-ingest
…rework-read_csv-ingest
…into rework-read_csv-ingest
…rework-read_csv-ingest
…rework-read_csv-ingest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of questions.
I'm not familiar with this code. I'm not sure I'll do this justice on my first read of it.
rmm::device_uvector<char> d_data{ | ||
(load_whole_file) ? data.size() : std::min(buffer_size * 2, data.size()), stream}; | ||
d_data.resize(0, stream); | ||
auto pos = range_begin; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm not reading this correctly: Where is pos
modified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Way down in line 393. I messed with the control flow here as little as I could, this code is very fragile and not very well documented.
Core change is the addition on the byte_range_offset
parameter, to be able to read from a source of host buffer that contains the whole file.
As a complete aside, I was wondering if there is any value in making the constructor of |
Made it explicit 👍 |
|
||
// None of the parameters for row selection is used, we are parsing the entire file | ||
bool const load_whole_file = | ||
range_offset == 0 && range_size == 0 && skip_rows <= 0 && skip_end_rows <= 0 && num_rows == -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why these aren't all equality checks. Under what scenario would skip_rows < 0 || skip_end_rows < 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Negative values mean "no value". We could modernize this with std::optional
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
minor nit.
cpp/src/io/csv/reader_impl.cu
Outdated
bom_buffer->size()}; | ||
if (has_utf8_bom(bom_chars)) { data_start_offset += sizeof(UTF8_BOM); } | ||
} else { | ||
constexpr auto find_data_start_chunk_size = 4ul * 1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion for future:
For wide CSVs, if this turn out to take a lot of time, we could double find_data_start_chunk_size
number after couple of loops if it can't find terminator.
/merge |
…ource` (#16865) Depends on #16826 Set of fixes that improve robustness on the non-GDS file input: 1. Avoid registering beyond the byte range - addresses problems when reading adjacent byte ranges from multiple threads (GH only). 2. Allow reading data outside of the memory mapped region. This prevents issues with very long rows in CSV or JSON input. 3. Copy host data when the range being read is only partially registered. This avoids errors when trying to copy the host data range to the device (GH only). Modifies the datasource class hierarchy to avoid reuse of direct file `host_read`s Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Basit Ayantunde (https://github.com/lamarrr) - Mads R. B. Kristensen (https://github.com/madsbk) - Bradley Dice (https://github.com/bdice) URL: #16865
Description
Issue #13797
The CSV reader ingests all input data with single call to host_read.
This is a problem for a few reasons:
cudaHostRegister
we cannot reliably copy from the mapped region to the GPU without issues with mixing registered and unregistered areas. The reader can't know the datasource implementation details needed to avoid this issue.device_read
has the potential to outperform manual copies.This PR changes
read_csv
IO to perform smallhost_read
s to get the data like BOM and first row. Most of the data is then read in chunks usingdevice_read
calls. We can further remove host_reads by moving some of the host processing to the GPU.No significant changes in performance. We are likely to get performance improvements from future changes like increasing the kvikIO thread pool size.
Checklist