-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Compute] Pre-solve chunked indices before merging chunks in Sort kernels #44084
Comments
cc @felipecrv |
Two potential downsides to this approach:
Experimenting will tell whether this can be beneficial. |
I made some initial experiments on this and I came to the following conclusion:
I might dedicate some time to this. |
pitrou
added a commit
to pitrou/arrow
that referenced
this issue
Sep 24, 2024
pitrou
added a commit
to pitrou/arrow
that referenced
this issue
Sep 24, 2024
pitrou
added a commit
to pitrou/arrow
that referenced
this issue
Sep 25, 2024
pitrou
added a commit
to pitrou/arrow
that referenced
this issue
Sep 25, 2024
pitrou
added a commit
to pitrou/arrow
that referenced
this issue
Sep 25, 2024
pitrou
added a commit
to pitrou/arrow
that referenced
this issue
Nov 18, 2024
pitrou
added a commit
to pitrou/arrow
that referenced
this issue
Nov 26, 2024
pitrou
added a commit
that referenced
this issue
Nov 26, 2024
### Rationale for this change When merge-sorting the chunks of a chunked array or table, we would currently repeatedly resolve the chunk indices for each individual value lookup. This requires `O(n*log k)` chunk resolutions with `n` being the chunked array or table length, and `k` the number of chunks. Instead, this PR translates the logical indices to physical all at once, without even requiring expensive chunk resolution as the logical indices are initially chunk-partitioned. This change yields significant speedups on chunked array and table sorting: ``` benchmark baseline contender change % counters ChunkedArraySortIndicesInt64Narrow/1048576/100 345.419 MiB/sec 628.334 MiB/sec 81.905 {'family_index': 0, 'per_family_instance_index': 6, 'run_name': 'ChunkedArraySortIndicesInt64Narrow/1048576/100', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 242, 'null_percent': 1.0} TableSortIndicesInt64Narrow/1048576/0/1/32 25.997M items/sec 44.550M items/sec 71.366 {'family_index': 3, 'per_family_instance_index': 11, 'run_name': 'TableSortIndicesInt64Narrow/1048576/0/1/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 17, 'chunks': 32.0, 'columns': 1.0, 'null_percent': 0.0} ChunkedArraySortIndicesInt64Wide/32768/10000 91.182 MiB/sec 153.756 MiB/sec 68.625 {'family_index': 1, 'per_family_instance_index': 0, 'run_name': 'ChunkedArraySortIndicesInt64Wide/32768/10000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2067, 'null_percent': 0.01} ChunkedArraySortIndicesInt64Wide/32768/10 96.536 MiB/sec 161.648 MiB/sec 67.449 {'family_index': 1, 'per_family_instance_index': 2, 'run_name': 'ChunkedArraySortIndicesInt64Wide/32768/10', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2238, 'null_percent': 10.0} TableSortIndicesInt64Narrow/1048576/100/1/32 24.290M items/sec 40.513M items/sec 66.791 {'family_index': 3, 'per_family_instance_index': 9, 'run_name': 'TableSortIndicesInt64Narrow/1048576/100/1/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 16, 'chunks': 32.0, 'columns': 1.0, 'null_percent': 1.0} ChunkedArraySortIndicesInt64Wide/32768/100 90.030 MiB/sec 149.633 MiB/sec 66.203 {'family_index': 1, 'per_family_instance_index': 1, 'run_name': 'ChunkedArraySortIndicesInt64Wide/32768/100', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2017, 'null_percent': 1.0} ChunkedArraySortIndicesInt64Wide/32768/0 91.982 MiB/sec 152.840 MiB/sec 66.163 {'family_index': 1, 'per_family_instance_index': 5, 'run_name': 'ChunkedArraySortIndicesInt64Wide/32768/0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2115, 'null_percent': 0.0} ChunkedArraySortIndicesInt64Narrow/8388608/100 240.335 MiB/sec 387.423 MiB/sec 61.201 {'family_index': 0, 'per_family_instance_index': 7, 'run_name': 'ChunkedArraySortIndicesInt64Narrow/8388608/100', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 21, 'null_percent': 1.0} ChunkedArraySortIndicesInt64Wide/32768/2 172.376 MiB/sec 274.133 MiB/sec 59.032 {'family_index': 1, 'per_family_instance_index': 3, 'run_name': 'ChunkedArraySortIndicesInt64Wide/32768/2', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3770, 'null_percent': 50.0} TableSortIndicesInt64Wide/1048576/4/1/32 7.407M items/sec 11.621M items/sec 56.904 {'family_index': 4, 'per_family_instance_index': 10, 'run_name': 'TableSortIndicesInt64Wide/1048576/4/1/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 5, 'chunks': 32.0, 'columns': 1.0, 'null_percent': 25.0} TableSortIndicesInt64Wide/1048576/100/1/32 5.788M items/sec 9.062M items/sec 56.565 {'family_index': 4, 'per_family_instance_index': 9, 'run_name': 'TableSortIndicesInt64Wide/1048576/100/1/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4, 'chunks': 32.0, 'columns': 1.0, 'null_percent': 1.0} TableSortIndicesInt64Wide/1048576/0/1/32 5.785M items/sec 9.049M items/sec 56.409 {'family_index': 4, 'per_family_instance_index': 11, 'run_name': 'TableSortIndicesInt64Wide/1048576/0/1/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4, 'chunks': 32.0, 'columns': 1.0, 'null_percent': 0.0} ChunkedArraySortIndicesInt64Narrow/32768/2 194.743 MiB/sec 291.432 MiB/sec 49.649 {'family_index': 0, 'per_family_instance_index': 3, 'run_name': 'ChunkedArraySortIndicesInt64Narrow/32768/2', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4340, 'null_percent': 50.0} TableSortIndicesInt64Narrow/1048576/4/1/32 25.686M items/sec 38.087M items/sec 48.279 {'family_index': 3, 'per_family_instance_index': 10, 'run_name': 'TableSortIndicesInt64Narrow/1048576/4/1/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 17, 'chunks': 32.0, 'columns': 1.0, 'null_percent': 25.0} TableSortIndicesInt64Wide/1048576/0/8/32 5.766M items/sec 8.374M items/sec 45.240 {'family_index': 4, 'per_family_instance_index': 5, 'run_name': 'TableSortIndicesInt64Wide/1048576/0/8/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4, 'chunks': 32.0, 'columns': 8.0, 'null_percent': 0.0} TableSortIndicesInt64Wide/1048576/0/16/32 5.752M items/sec 8.352M items/sec 45.202 {'family_index': 4, 'per_family_instance_index': 2, 'run_name': 'TableSortIndicesInt64Wide/1048576/0/16/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4, 'chunks': 32.0, 'columns': 16.0, 'null_percent': 0.0} ChunkedArraySortIndicesInt64Narrow/32768/10000 121.253 MiB/sec 175.286 MiB/sec 44.562 {'family_index': 0, 'per_family_instance_index': 0, 'run_name': 'ChunkedArraySortIndicesInt64Narrow/32768/10000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2673, 'null_percent': 0.01} TableSortIndicesInt64Wide/1048576/100/2/32 5.549M items/sec 7.984M items/sec 43.876 {'family_index': 4, 'per_family_instance_index': 6, 'run_name': 'TableSortIndicesInt64Wide/1048576/100/2/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4, 'chunks': 32.0, 'columns': 2.0, 'null_percent': 1.0} ChunkedArraySortIndicesInt64Wide/1048576/100 69.599 MiB/sec 99.666 MiB/sec 43.200 {'family_index': 1, 'per_family_instance_index': 6, 'run_name': 'ChunkedArraySortIndicesInt64Wide/1048576/100', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 49, 'null_percent': 1.0} TableSortIndicesInt64Narrow/1048576/0/1/4 55.940M items/sec 79.984M items/sec 42.982 {'family_index': 3, 'per_family_instance_index': 23, 'run_name': 'TableSortIndicesInt64Narrow/1048576/0/1/4', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 37, 'chunks': 4.0, 'columns': 1.0, 'null_percent': 0.0} TableSortIndicesInt64Wide/1048576/100/16/32 5.554M items/sec 7.909M items/sec 42.417 {'family_index': 4, 'per_family_instance_index': 0, 'run_name': 'TableSortIndicesInt64Wide/1048576/100/16/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4, 'chunks': 32.0, 'columns': 16.0, 'null_percent': 1.0} ChunkedArraySortIndicesInt64Narrow/32768/10 127.758 MiB/sec 181.407 MiB/sec 41.992 {'family_index': 0, 'per_family_instance_index': 2, 'run_name': 'ChunkedArraySortIndicesInt64Narrow/32768/10', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2856, 'null_percent': 10.0} TableSortIndicesInt64Wide/1048576/100/8/32 5.572M items/sec 7.775M items/sec 39.548 {'family_index': 4, 'per_family_instance_index': 3, 'run_name': 'TableSortIndicesInt64Wide/1048576/100/8/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4, 'chunks': 32.0, 'columns': 8.0, 'null_percent': 1.0} ChunkedArraySortIndicesInt64Narrow/32768/100 119.600 MiB/sec 166.454 MiB/sec 39.176 {'family_index': 0, 'per_family_instance_index': 1, 'run_name': 'ChunkedArraySortIndicesInt64Narrow/32768/100', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2667, 'null_percent': 1.0} TableSortIndicesInt64Wide/1048576/0/2/32 5.781M items/sec 8.016M items/sec 38.669 {'family_index': 4, 'per_family_instance_index': 8, 'run_name': 'TableSortIndicesInt64Wide/1048576/0/2/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4, 'chunks': 32.0, 'columns': 2.0, 'null_percent': 0.0} TableSortIndicesInt64Narrow/1048576/100/1/4 52.252M items/sec 72.193M items/sec 38.162 {'family_index': 3, 'per_family_instance_index': 21, 'run_name': 'TableSortIndicesInt64Narrow/1048576/100/1/4', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 35, 'chunks': 4.0, 'columns': 1.0, 'null_percent': 1.0} ChunkedArraySortIndicesInt64Narrow/32768/0 121.868 MiB/sec 168.364 MiB/sec 38.152 {'family_index': 0, 'per_family_instance_index': 5, 'run_name': 'ChunkedArraySortIndicesInt64Narrow/32768/0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2691, 'null_percent': 0.0} TableSortIndicesInt64Wide/1048576/4/2/32 5.017M items/sec 6.720M items/sec 33.934 {'family_index': 4, 'per_family_instance_index': 7, 'run_name': 'TableSortIndicesInt64Wide/1048576/4/2/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3, 'chunks': 32.0, 'columns': 2.0, 'null_percent': 25.0} ChunkedArraySortIndicesInt64Wide/8388608/100 54.785 MiB/sec 72.642 MiB/sec 32.593 {'family_index': 1, 'per_family_instance_index': 7, 'run_name': 'ChunkedArraySortIndicesInt64Wide/8388608/100', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 5, 'null_percent': 1.0} TableSortIndicesInt64Wide/1048576/4/8/32 4.222M items/sec 5.483M items/sec 29.861 {'family_index': 4, 'per_family_instance_index': 4, 'run_name': 'TableSortIndicesInt64Wide/1048576/4/8/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3, 'chunks': 32.0, 'columns': 8.0, 'null_percent': 25.0} ChunkedArraySortIndicesString/32768/10 146.866 MiB/sec 190.314 MiB/sec 29.583 {'family_index': 2, 'per_family_instance_index': 2, 'run_name': 'ChunkedArraySortIndicesString/32768/10', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3494, 'null_percent': 10.0} TableSortIndicesInt64Wide/1048576/4/16/32 4.225M items/sec 5.433M items/sec 28.599 {'family_index': 4, 'per_family_instance_index': 1, 'run_name': 'TableSortIndicesInt64Wide/1048576/4/16/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3, 'chunks': 32.0, 'columns': 16.0, 'null_percent': 25.0} TableSortIndicesInt64Narrow/1048576/100/16/32 2.193M items/sec 2.711M items/sec 23.652 {'family_index': 3, 'per_family_instance_index': 0, 'run_name': 'TableSortIndicesInt64Narrow/1048576/100/16/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2, 'chunks': 32.0, 'columns': 16.0, 'null_percent': 1.0} ChunkedArraySortIndicesString/32768/100 156.401 MiB/sec 191.910 MiB/sec 22.704 {'family_index': 2, 'per_family_instance_index': 1, 'run_name': 'ChunkedArraySortIndicesString/32768/100', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3488, 'null_percent': 1.0} TableSortIndicesInt64Narrow/1048576/4/1/4 47.342M items/sec 58.062M items/sec 22.644 {'family_index': 3, 'per_family_instance_index': 22, 'run_name': 'TableSortIndicesInt64Narrow/1048576/4/1/4', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 32, 'chunks': 4.0, 'columns': 1.0, 'null_percent': 25.0} ChunkedArraySortIndicesString/32768/0 161.457 MiB/sec 195.782 MiB/sec 21.259 {'family_index': 2, 'per_family_instance_index': 5, 'run_name': 'ChunkedArraySortIndicesString/32768/0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3644, 'null_percent': 0.0} TableSortIndicesInt64Narrow/1048576/4/16/32 1.915M items/sec 2.309M items/sec 20.561 {'family_index': 3, 'per_family_instance_index': 1, 'run_name': 'TableSortIndicesInt64Narrow/1048576/4/16/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1, 'chunks': 32.0, 'columns': 16.0, 'null_percent': 25.0} TableSortIndicesInt64Narrow/1048576/0/16/32 2.561M items/sec 3.079M items/sec 20.208 {'family_index': 3, 'per_family_instance_index': 2, 'run_name': 'TableSortIndicesInt64Narrow/1048576/0/16/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2, 'chunks': 32.0, 'columns': 16.0, 'null_percent': 0.0} ChunkedArraySortIndicesString/32768/10000 157.786 MiB/sec 189.412 MiB/sec 20.043 {'family_index': 2, 'per_family_instance_index': 0, 'run_name': 'ChunkedArraySortIndicesString/32768/10000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3539, 'null_percent': 0.01} ChunkedArraySortIndicesString/32768/2 139.241 MiB/sec 164.172 MiB/sec 17.904 {'family_index': 2, 'per_family_instance_index': 3, 'run_name': 'ChunkedArraySortIndicesString/32768/2', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3155, 'null_percent': 50.0} TableSortIndicesInt64Narrow/1048576/0/8/32 2.595M items/sec 3.038M items/sec 17.081 {'family_index': 3, 'per_family_instance_index': 5, 'run_name': 'TableSortIndicesInt64Narrow/1048576/0/8/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2, 'chunks': 32.0, 'columns': 8.0, 'null_percent': 0.0} TableSortIndicesInt64Narrow/1048576/4/8/32 1.999M items/sec 2.298M items/sec 14.936 {'family_index': 3, 'per_family_instance_index': 4, 'run_name': 'TableSortIndicesInt64Narrow/1048576/4/8/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1, 'chunks': 32.0, 'columns': 8.0, 'null_percent': 25.0} ChunkedArraySortIndicesString/8388608/100 81.026 MiB/sec 93.120 MiB/sec 14.926 {'family_index': 2, 'per_family_instance_index': 7, 'run_name': 'ChunkedArraySortIndicesString/8388608/100', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 7, 'null_percent': 1.0} TableSortIndicesInt64Narrow/1048576/100/8/32 2.382M items/sec 2.719M items/sec 14.168 {'family_index': 3, 'per_family_instance_index': 3, 'run_name': 'TableSortIndicesInt64Narrow/1048576/100/8/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2, 'chunks': 32.0, 'columns': 8.0, 'null_percent': 1.0} ChunkedArraySortIndicesString/1048576/100 107.722 MiB/sec 122.229 MiB/sec 13.467 {'family_index': 2, 'per_family_instance_index': 6, 'run_name': 'ChunkedArraySortIndicesString/1048576/100', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 77, 'null_percent': 1.0} TableSortIndicesInt64Narrow/1048576/100/2/32 4.019M items/sec 4.477M items/sec 11.383 {'family_index': 3, 'per_family_instance_index': 6, 'run_name': 'TableSortIndicesInt64Narrow/1048576/100/2/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3, 'chunks': 32.0, 'columns': 2.0, 'null_percent': 1.0} TableSortIndicesInt64Wide/1048576/4/1/4 11.595M items/sec 12.791M items/sec 10.314 {'family_index': 4, 'per_family_instance_index': 22, 'run_name': 'TableSortIndicesInt64Wide/1048576/4/1/4', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 8, 'chunks': 4.0, 'columns': 1.0, 'null_percent': 25.0} TableSortIndicesInt64Wide/1048576/0/1/4 9.231M items/sec 10.181M items/sec 10.294 {'family_index': 4, 'per_family_instance_index': 23, 'run_name': 'TableSortIndicesInt64Wide/1048576/0/1/4', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 6, 'chunks': 4.0, 'columns': 1.0, 'null_percent': 0.0} ``` However, performance also regresses when the input is all-nulls (which is probably rare): ``` benchmark baseline contender change % counters ChunkedArraySortIndicesString/32768/1 5.636 GiB/sec 4.336 GiB/sec -23.068 {'family_index': 2, 'per_family_instance_index': 4, 'run_name': 'ChunkedArraySortIndicesString/32768/1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 127778, 'null_percent': 100.0} ChunkedArraySortIndicesInt64Narrow/32768/1 3.963 GiB/sec 2.852 GiB/sec -28.025 {'family_index': 0, 'per_family_instance_index': 4, 'run_name': 'ChunkedArraySortIndicesInt64Narrow/32768/1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 91209, 'null_percent': 100.0} ChunkedArraySortIndicesInt64Wide/32768/1 4.038 GiB/sec 2.869 GiB/sec -28.954 {'family_index': 1, 'per_family_instance_index': 4, 'run_name': 'ChunkedArraySortIndicesInt64Wide/32768/1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 94090, 'null_percent': 100.0} ``` ### Are these changes tested? Yes, by existing tests. ### Are there any user-facing changes? No. * GitHub Issue: #44084 Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Issue resolved by pull request 44217 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the enhancement requested
In the chunked sort kernels (for ChunkedArray and Table), the most expensive step can be the recursive merge of sorted chunks after each individual chunk was sorted.
Currently, this merge step resolves chunked indices every time an access is made to read a value. This means chunked resolution is computed
O(n*log2(k))
times (wheren
is the input length andk
is the number of chunks).However, we could instead compute chunked indices after sorting the individual chunks. Then there would be no chunk resolution when merging, just direct accesses through
ResolvedChunk
s.Component(s)
C++
The text was updated successfully, but these errors were encountered: