Long string optimization for string column parsing in JSON reader #13803

karthikeyann · 2023-08-02T19:15:42Z

Description

In old code, 1 thread per string is allocated for parsing a string column.
For longer strings (>1024), the runtime of 1-thread-per-string to decode is taking too long even for few strings.

In this change, 1 warp per string is used for parsing for strings length <=1024 and 1 block per string for string length >1024. If max string length < 128, 1 thread per string is used as usual.

256 threads_per_block is used for both kernels.
Code for 1-warp-per-string and 1-block-per-string is similar, but only varies with warp-wide and block-wide primitives for reduction and scan operations. shared memory usage will differ slightly too.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…_string_perf

vuule

Just a few small comments, looks great otherwise.

cpp/tests/io/json_test.cpp

cpp/src/io/utilities/data_casting.cu

…_string_perf

karthikeyann · 2023-09-13T21:30:42Z

/ok to test

wence- · 2023-09-18T13:25:49Z

/ok to test

wence-

Some minor nitpicks:

two, I think, want addressed before merge (even if only to convince me that there is "nothing to see here"). Specifically:

possible integer overflow in the grid stride loop
logic difference in the two places UTF16 surrogate codepoints are handled

cpp/src/io/utilities/data_casting.cu

wence- · 2023-09-18T15:10:35Z

cpp/src/io/utilities/data_casting.cu

+
+    // This is indeed a UTF16 surrogate pair
+    if (hex_val >= UTF16_HIGH_SURROGATE_BEGIN && hex_val < UTF16_HIGH_SURROGATE_END &&
+        hex_low_val >= UTF16_LOW_SURROGATE_BEGIN && hex_low_val < UTF16_LOW_SURROGATE_END) {


question: ‏Is it valid to have two successive UTF-16 codepoints where the first is contained in [HIGH_SURROGATE_BEGIN, HIGH_SURROGATE_END) and the second is not contained in [LOW_SURROGATE_BEGIN, LOW_SURROGATE_END)? If not, then I think we are missing a parse error here for this case.

AFAIK, it's considered as error for UTF-16.
https://en.wikipedia.org/wiki/UTF-16#Code_points_from_U+010000_to_U+10FFFF
Since the ranges for the high surrogates (0xD800–0xDBFF), low surrogates (0xDC00–0xDFFF), and valid BMP characters (0x0000–0xD7FF, 0xE000–0xFFFF) are disjoint, it is not possible for a surrogate to match a BMP character, or for two adjacent code units to look like a legal surrogate pair. This simplifies searches a great deal.

https://en.wikipedia.org/wiki/UTF-16#U+D800_to_U+DFFF_(surrogates)

It is possible to unambiguously encode an unpaired surrogate (a high surrogate code point not followed by a low one, or a low one not preceded by a high one) in the format of UTF-16 by using a code unit equal to the code point. The result is not valid UTF-16, but the majority of UTF-16 encoder and decoder implementations do this when translating between encodings.

OK, so if I understand correctly, we want logic that is something like:

hex_val = parse_unicode_hex(stream); if (is_high_surrogate(hex_val)) { auto hex_val_low = parse_unicode_hex(stream); if (is_low_surrogate(hex_val_low)) { // valid surrogate pair, decode } else { // invalid surrogate pair, parse error // I think this case is missing return {bytes, parse_error}; } } else { // non-surrogate pair, decode as normal }

But I think the parse error in the case that we have a pair where the first is a surrogate pair and the second is not handled. Except that maybe this is deliberate per:

The result is not valid UTF-16, but the majority of UTF-16 encoder and decoder implementations do this when translating between encodings.

So you're just "carrying on" and letting an invalid character appear in the output stream. Is that right?

If so, maybe

suggestion: Add a comment explaining this parsing behaviour.

That's right.
added a comment at the place where this is skipped, not thrown as error.

cpp/src/io/utilities/data_casting.cu

…_string_perf

wence- · 2023-09-19T16:12:12Z

/ok to test

wence- · 2023-09-19T16:12:44Z

Thanks for your hard work here!

karthikeyann · 2023-09-19T16:24:14Z

/ok to test

karthikeyann · 2023-09-19T17:50:15Z

Final benchmark:

~70-80% runtime reduction for long strings. (overall range 30%-89%)

json_read_string_column (single column) (branch-23.10 vs this PR)

single column with max length string, json file size 512MB

[0] Quadro GV100

max_length	is_fixed_length	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff
16	1	247.072 ms	0.67%	244.442 ms	0.21%	-2629.677 us	-1.06%
32	1	161.946 ms	0.48%	160.267 ms	0.18%	-1678.671 us	-1.04%
64	1	160.508 ms	0.07%	161.867 ms	0.44%	1.358 ms	0.85%
128	1	225.849 ms	0.43%	162.441 ms	0.36%	-63408.243 us	-28.08%
256	1	291.920 ms	0.33%	144.881 ms	0.47%	-147039.083 us	-50.37%
512	1	317.194 ms	0.20%	132.372 ms	0.48%	-184822.043 us	-58.27%
1024	1	353.383 ms	0.23%	125.344 ms	0.37%	-228039.301 us	-64.53%
4096	1	235.406 ms	0.16%	121.457 ms	0.50%	-113949.635 us	-48.41%
16384	1	123.327 ms	0.03%	135.223 ms	0.49%	11.896 ms	9.65%
65536	1	272.949 ms	0.04%	142.435 ms	0.27%	-130513.632 us	-47.82%
262144	1	951.804 ms	0.03%	150.038 ms	0.41%	-801766.458 us	-84.24%
1048576	1	3.243 s	0.03%	347.749 ms	0.29%	-2895555.682 us	-89.28%
10485760	1	15.121 s	inf%	2.737 s	0.08%	-12384781.445 us	-81.90%
104857600	1	59.740 s	inf%	18.052 s	inf%	-41687603.516 us	-69.78%
16	0	341.065 ms	0.20%	338.887 ms	0.18%	-2177.289 us	-0.64%
32	0	203.926 ms	0.41%	202.162 ms	0.26%	-1764.044 us	-0.87%
64	0	148.662 ms	0.06%	148.158 ms	0.46%	-504.291 us	-0.34%
128	0	179.354 ms	0.80%	188.805 ms	0.14%	9.450 ms	5.27%
256	0	243.734 ms	0.44%	153.771 ms	0.50%	-89962.469 us	-36.91%
512	0	287.616 ms	0.26%	141.374 ms	0.10%	-146241.812 us	-50.85%
1024	0	315.399 ms	0.07%	132.822 ms	0.49%	-182577.268 us	-57.89%
4096	0	296.523 ms	0.20%	126.245 ms	0.47%	-170277.805 us	-57.42%
16384	0	114.834 ms	0.12%	128.630 ms	0.30%	13.796 ms	12.01%
65536	0	210.012 ms	0.34%	145.012 ms	0.17%	-64999.411 us	-30.95%
262144	0	697.045 ms	0.04%	145.651 ms	0.10%	-551394.577 us	-79.10%
1048576	0	2.579 s	0.01%	292.832 ms	0.11%	-2285863.312 us	-88.64%
10485760	0	11.416 s	inf%	2.323 s	0.02%	-9093176.855 us	-79.65%
104857600	0	41.310 s	inf%	10.164 s	inf%	-31145317.383 us	-75.39%

json_read_string_column (multiple columns) (branch-23.10 vs this PR)

64 column in json, json file size 512MB

[0] Quadro GV100

max_length	is_fixed_length	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff
16	1	375.037 ms	0.37%	395.404 ms	0.24%	20.367 ms	5.43%
32	1	312.670 ms	0.13%	335.295 ms	0.02%	22.624 ms	7.24%
64	1	172.643 ms	0.12%	179.064 ms	0.34%	6.421 ms	3.72%
128	1	129.609 ms	0.49%	161.565 ms	0.30%	31.956 ms	24.66%
256	1	136.144 ms	0.42%	150.128 ms	0.48%	13.984 ms	10.27%
512	1	169.633 ms	0.35%	141.509 ms	0.02%	-28123.434 us	-16.58%
1024	1	267.141 ms	0.02%	144.759 ms	0.49%	-122382.312 us	-45.81%
4096	1	868.568 ms	0.11%	150.625 ms	0.10%	-717943.390 us	-82.66%
16384	1	3.227 s	0.02%	429.324 ms	0.26%	-2797406.024 us	-86.69%
65536	1	5.988 s	inf%	1.177 s	0.07%	-4811603.776 us	-80.35%
262144	1	18.942 s	inf%	4.006 s	inf%	-14936076.111 us	-78.85%
1048576	1	35.072 s	inf%	10.725 s	inf%	-24346885.742 us	-69.42%
10485760	1	186.692 s	inf%	24.209 s	inf%	-162482992.188 us	-87.03%
16	0	446.509 ms	0.30%	463.299 ms	0.30%	16.790 ms	3.76%
32	0	310.329 ms	0.26%	325.206 ms	0.29%	14.878 ms	4.79%
64	0	229.938 ms	0.19%	244.593 ms	0.41%	14.655 ms	6.37%
128	0	141.235 ms	0.34%	164.888 ms	0.46%	23.653 ms	16.75%
256	0	125.721 ms	0.24%	162.090 ms	0.19%	36.369 ms	28.93%
512	0	143.359 ms	0.39%	154.406 ms	0.40%	11.047 ms	7.71%
1024	0	210.963 ms	0.39%	151.757 ms	0.48%	-59205.129 us	-28.06%
4096	0	640.442 ms	0.04%	179.280 ms	0.22%	-461162.570 us	-72.01%
16384	0	2.338 s	0.04%	279.270 ms	0.07%	-2058996.429 us	-88.06%
65536	0	4.671 s	inf%	929.830 ms	0.05%	-3741301.990 us	-80.09%
262144	0	14.595 s	inf%	3.134 s	0.03%	-11460648.486 us	-78.53%
1048576	0	32.698 s	inf%	11.862 s	inf%	-20835319.336 us	-63.72%
10485760	0	125.915 s	inf%	16.261 s	inf%	-109654444.336 us	-87.09%

karthikeyann · 2023-09-19T17:50:43Z

Thanks to all the reviewers for the great inputs! Many suggestions lead to great performance improvements 🚀
@vuule @elstehle @wence- @robertmaynard

vuule · 2023-09-19T22:12:15Z

/merge

karthikeyann added 3 commits August 2, 2023 22:21

warp per string parsing of string columns (unicode)

91742af

remove dependency of data_casting.cuh in write_json.cu

51e70e2

cleanup

318b4a3

karthikeyann added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Performance Performance related issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Aug 2, 2023

karthikeyann added 13 commits August 7, 2023 17:36

try load balancing with global counter for string index

363d5ab

fix intra-warp divergence issue with cub::WarpScan stuck

58e0d6c

remove unnecessary WarpReduce, reduce shmem usage

c0edf8f

cleanup comments, unused code

086dfa9

add block per string algorithm

0aa2c0e

cleanup, kernel name

57ea056

add BLOCK_SIZE to block kernel

7e4cfd2

clean up, add constants

6622460

add long string json test

efe7898

remove debug prints

589e0a3

Merge branch 'branch-23.10' of github.com:rapidsai/cudf into enh-json…

4f8e413

…_string_perf

comment

d3dc8cf

style fix, add constants

e17589e

karthikeyann added 3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond 4 - Needs cuIO Reviewer and removed 2 - In Progress Currently a work in progress labels Aug 17, 2023

karthikeyann marked this pull request as ready for review August 17, 2023 19:30

karthikeyann requested a review from a team as a code owner August 17, 2023 19:30

karthikeyann requested review from harrism and vuule August 17, 2023 19:30

karthikeyann changed the title ~~Long string optimization for string column parsing in JSON reader~~ Long string optimization for string column parsing in JSON reader Sep 12, 2023

vuule approved these changes Sep 12, 2023

View reviewed changes

cpp/tests/io/json_test.cpp Outdated Show resolved Hide resolved

cpp/src/io/utilities/data_casting.cu Outdated Show resolved Hide resolved

cpp/src/io/utilities/data_casting.cu Outdated Show resolved Hide resolved

karthikeyann added 6 commits September 13, 2023 20:30

reorg json type test code

d0c8612

add error cases for parse_data

56d7fb6

address review comments (vuule)

d088e8e

Merge branch 'branch-23.10' of github.com:rapidsai/cudf into enh-json…

eabb7a8

…_string_perf

fix review comments, remove nvtx ranges

79b4f38

fix unit test cases nullability

403a374

elstehle approved these changes Sep 15, 2023

View reviewed changes

Merge branch 'branch-23.10' into enh-json_string_perf

af334fc

wence- requested changes Sep 18, 2023

View reviewed changes

harrism removed their request for review September 18, 2023 23:22

karthikeyann added 2 commits September 19, 2023 21:07

address review comments, split code for string type

72d23fb

Merge branch 'branch-23.10' of github.com:rapidsai/cudf into enh-json…

c8e1f69

…_string_perf

wence- approved these changes Sep 19, 2023

View reviewed changes

add comments, style fix

d0a5e23

karthikeyann added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 4 - Needs Review Waiting for reviewer to review or respond 4 - Needs cuIO Reviewer 3 - Ready for Review Ready for review by team labels Sep 19, 2023

rapids-bot bot merged commit 97501d8 into rapidsai:branch-23.10 Sep 20, 2023
54 of 55 checks passed

davidwendt mentioned this pull request Sep 20, 2023

[BUG] Memcheck error found in JSON_TEST JsonReaderTest.ErrorStrings #14141

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long string optimization for string column parsing in JSON reader #13803

Long string optimization for string column parsing in JSON reader #13803

karthikeyann commented Aug 2, 2023 •

edited

Loading

vuule left a comment

karthikeyann commented Sep 13, 2023

wence- commented Sep 18, 2023

wence- left a comment

wence- Sep 18, 2023

karthikeyann Sep 19, 2023

wence- Sep 19, 2023

karthikeyann Sep 19, 2023 •

edited

Loading

wence- commented Sep 19, 2023

wence- commented Sep 19, 2023

karthikeyann commented Sep 19, 2023

karthikeyann commented Sep 19, 2023 •

edited

Loading

karthikeyann commented Sep 19, 2023 •

edited

Loading

vuule commented Sep 19, 2023

Long string optimization for string column parsing in JSON reader #13803

Long string optimization for string column parsing in JSON reader #13803

Conversation

karthikeyann commented Aug 2, 2023 • edited Loading

Description

Checklist

vuule left a comment

Choose a reason for hiding this comment

karthikeyann commented Sep 13, 2023

wence- commented Sep 18, 2023

wence- left a comment

Choose a reason for hiding this comment

wence- Sep 18, 2023

Choose a reason for hiding this comment

karthikeyann Sep 19, 2023

Choose a reason for hiding this comment

wence- Sep 19, 2023

Choose a reason for hiding this comment

karthikeyann Sep 19, 2023 • edited Loading

Choose a reason for hiding this comment

wence- commented Sep 19, 2023

wence- commented Sep 19, 2023

karthikeyann commented Sep 19, 2023

karthikeyann commented Sep 19, 2023 • edited Loading

Final benchmark:

json_read_string_column (single column) (branch-23.10 vs this PR)

[0] Quadro GV100

json_read_string_column (multiple columns) (branch-23.10 vs this PR)

[0] Quadro GV100

karthikeyann commented Sep 19, 2023 • edited Loading

vuule commented Sep 19, 2023

karthikeyann commented Aug 2, 2023 •

edited

Loading

karthikeyann Sep 19, 2023 •

edited

Loading

karthikeyann commented Sep 19, 2023 •

edited

Loading

karthikeyann commented Sep 19, 2023 •

edited

Loading