-
Notifications
You must be signed in to change notification settings - Fork 904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use "ranger" to prevent grid stride loop overflow #10368
Comments
The fix in basically all of these cases is quite simple: just make the index a |
I'd love to just add an algorithm to do this. |
Or even a simple range helper for |
I think the general approach should be:
|
Partially addresses #10368 Specifically: - `valid_if` - `scatter` - `rolling_window` - `compute_column_kernel` (ast stuff) - `replace_nulls` (fixed-width and strings) The majority of the fixes are simply making the indexing variable a `std::size_t` instead of a `cudf::size_type`. Although scatter had an additional place it was overflowing outside the kernel. I didn't add tests for these fixes, but each of them were individually tested locally to make sure they actually manifested the issue and then were verified with the fixes. Authors: - https://github.com/nvdbaranec Approvers: - Bradley Dice (https://github.com/bdice) - Mike Wilson (https://github.com/hyperbolic2346) - Mark Harris (https://github.com/harrism) - Nghia Truong (https://github.com/ttnghia) URL: #10448
This issue has been labeled |
Still relevant. |
Created https://github.com/harrism/ranger as a solution to this. Needs to be moved into libcudf. |
This issue has been labeled |
Still relevant. |
This issue has been labeled |
Still relevant. |
This issue has been labeled |
@GregoryKimball I have now created a PR to use ranger in libcuspatial. You guys could use this as an example if you want to do the same in libcudf. rapidsai/cuspatial#1178 |
wrt attempting to find locations where this might be happening. In host code, clang and gcc will warn if you add #include <cstdint>
int what(int upper)
{
int i = 0; // no warning if this is a std::int64_t
unsigned int stride = 10;
while (i < upper) {
i = i + stride; // clang warns for this, so does gcc
}
i = 0;
while (i < upper) {
i += stride; // gcc warns for this, clang does not.
}
return i;
} |
This PR adds `grid_1d::grid_stride()` and uses it in a handful of kernels. Follow-up to #13910, which added a `grid_1d::global_thread_id()`. We'll need to do a later PR that catches any missing instances where this should be used, since there are a large number of PRs in flight touching thread indexing code in various files. See #10368. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Vyas Ramasubramani (https://github.com/vyasr) URL: #13996
…joins (#13971) See #10368 (and more recently #13771 Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Bradley Dice (https://github.com/bdice) - Yunsong Wang (https://github.com/PointKernel) - David Wendt (https://github.com/davidwendt) URL: #13971
This PR refactors a few kernels to use `thread_index_type` and associated utilities. I started this before realizing how much scope was still left in issue #10368 ("Part 2 - Take another pass over more challenging kernels"), and then I stopped working on this due to time constraints. For the moment, I hope this PR makes a small dent in the number of remaining kernels to convert to using `thread_index_type`. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - MithunR (https://github.com/mythrocks) - Mark Harris (https://github.com/harrism) - David Wendt (https://github.com/davidwendt) URL: #14107
(updated Aug 2023)
Background
We found a kernel indexing overflow issue, first discovered in the
fused_concatenate
kernels (#10333) and this issue is present in a number of our CUDA kernels that take the following form:If we have an output_size of say 1.2 billion and a grid size that's the same, the following happens: Some late thread id, say 1.19 billion attempts to add 1.2 billion (blockDim.x * gridDim.x) and overflows the size_type (signed 32 bits).
We made a round of fixes in #10448, and then later found another instance of this error in #13838. Our first pass of investigation was not adequate to contain the issue, so we need to take another close look.
Part 1 - First pass fix kernels with this issue
copying/concatenate.cu
fused_concatenate_kernel
valid_if.cuh
valid_if_kernel
scatter.cu
marking_bitmask_kernel
replace/nulls.cu
replace_nulls_strings
replace/nulls.cu
replace_nulls
rolling/rolling_detail.cuh
gpu_rolling
rolling/jit/kernel.cu
gpu_rolling_new
transform/compute_column.cu
compute_column_kernel
copying/concatenate.cu
fused_concatenate_string_offset_kernel
replace/replace.cu
replace_strings_first_pass
replace_strings_second_pass
replace_kernel
copying/concatenate.cu
concatenate_masks_kernel
fused_concatenate_string_offset_kernel
fused_concatenate_string_chars_kernel
fused_concatenate_kernel
(int64)hash/helper_functions.cuh
init_hashtbl
null_mask.cu
set_null_mask_kernel
copy_offset_bitmask
count_set_bits_kernel
transform/row_bit_count.cu
compute_row_sizes
multibyte_split.cu
multibyte_split_init_kernel
multibyte_split_seed_kernel
(auto??)multibyte_split_kernel
io/utilities/parsing_utils.cu
count_and_set_positions
(uint64_t)conditional_join_kernels.cuh
compute_conditional_join_output_size
conditional_join
merge.cu
materialize_merged_bitmask_kernel
partitioning.cu
compute_row_partition_numbers
compute_row_output_locations
copy_block_partitions
json_path.cu
get_json_object_kernel
tdigest
compute_percentiles_kernel
(int)strings/attributes.cu
count_characters_parallel_fn
strings/convert/convert_urls.cu
url_decode_char_counter
(int)url_decode_char_replacer
(int)text/subword/data_normalizer.cu
kernel_data_normalizer
(uint32_t)text/subword/subword_tokenize.cu
kernel_compute_tensor_metadata
(uint32_t)text/subword/wordpiece_tokenizer.cu
init_data_and_mark_word_start_and_ends
(uint32_t)mark_string_start_and_ends
(uint32_t)kernel_wordpiece_tokenizer
(uint32_t)Part 2 - Take another pass over more challenging kernels
gridDim.x
orblockDim.x
to find more examplesPart 3 - Use ranger to prevent grid stride loop overflow
Additional information
There are also a number of kernels that have this pattern but probably don't ever overflow because they are indexing by bitmask words. (Example)
Additional, In this kernel,
source_idx
probably overflows, but harmlessly.A snippet of code to see this in action:
Note: rmm may mask out of bounds accesses in some cases, so it's helpful to run with the plain cuda allocator.
The text was updated successfully, but these errors were encountered: