[Enhancement] split chunk of HashTable (backport #51175) #52118

mergify · 2024-10-18T16:25:56Z

Why I'm doing:

@          0x2f9d4b5  malloc
@          0x8f9c745  operator new()
@          0x2ddb2ee  std::vector<>::_M_range_insert<>()
@          0x2dde914  starrocks::BinaryColumnBase<>::append()
@          0x365fa4e  starrocks::NullableColumn::append()
@          0x37458f9  starrocks::JoinHashTable::append_chunk()
@          0x3c86e80  starrocks::HashJoinBuilder::append_chunk()
@          0x3c8100c  starrocks::HashJoiner::append_chunk_to_ht()
@          0x3ab6649  starrocks::pipeline::HashJoinBuildOperator::push_chunk()
@          0x3a6769c  starrocks::pipeline::PipelineDriver::process()
@          0x3a58b9e  starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
@          0x305ebac  starrocks::ThreadPool::dispatch_thread()
@          0x305882a  starrocks::Thread::supervise_thread()

JoinHashTable::build_chunk is a Chunk which contains all data from build side, it means it can be very large for particular cases. As a result, it can easily encounter the memory allocation issue, when jemalloc/os cannot allocate a large continuous memory, as above exception.

The particular cases can be:

use string column as build side
use array column as build side

What I'm doing:

Split that chunk into multiple smaller segments(whose rows is usually 131072) to get rid of this issue:

Introduce a SegmentedChunk and SegmentedColumn to replace original Chunk and Column
They're not transparent replacement, but implemented most of required interfaces. So minimal code changes are required
To deal with the address problem(map the global offset to segment offset): we choose to translate the index just-in-time, like offset%segment_size, rather than maintaining a index for it. It's effective enough with static segment_size.
We use static segment_size rather than dynamic, which is easier to implement and more efficient

Potential downside and considerations of this approach:

When generate output for JoinHashMap, it needs to randomly copy data from the build_chunk according to build_index. With SegmentedChunk, since the memory address is not continuous anymore, we need to lookup the segment first then lookup the record in it. To deal with it, we try best to use the SegmentedChunkVisitor to reduce this overhead via eliminating the virtual function call
The key_column of JoinHashMap cannot not use columns of build_chunk anymore. Since their memory layout is different, key_column use a continuous column, but build_chunk uses a segmented way. It would introduce some memory overhead and memory copy overhead.
- Why not make the key_column segmented ? The overhead is relatively larger for the probe procedure, and also it needs to change a lot of code, which is beyond the scope. So we choose the easy path

Performance

Running ./shuffle_chunk_bench
Run on (104 X 3200.25 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x52)
  L1 Instruction 32 KiB (x52)
  L2 Unified 1024 KiB (x52)
  L3 Unified 36608 KiB (x2)
Load Average: 100.59, 89.61, 83.99
--------------------------------------------------------------------------------------
Benchmark                            Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------
bench_chunk_clone 3992519949343932 ns     21223730 ns            1 items_per_second=192.992k/s
bench_segmented_chunk_clone 3992510186082870 ns     22087674 ns            1 items_per_second=185.443k/s

The bench_segmented_chunk_clone is still slower than regular chunk_clone, it mostly comes from the unpredictable random memory access during copy. Considering it can help memory allocation, i think it's worth to do it.

We can further optimize the performance through make the memory access more sequential.

Fixes #issue

What type of PR is this:

Does this PR entail a change in behavior?

Yes, this PR will result in a change in behavior.
No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

Interface/UI changes: syntax, type conversion, expression evaluation, display information
Parameter changes: default values, similar parameters but with different default values
Policy changes: use new policy to replace old one, functionality automatically enabled
Feature removed
Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

I have added test cases for my bug fix or my new feature
This pr needs user documentation (for new or modified features or behaviors)
- I have added documentation for my new feature or new function
This is a backport pr

Bugfix cherry-pick branch check:

This is an automatic backport of pull request #51175 done by [Mergify](https://mergify.com). ## Why I'm doing:

@          0x2f9d4b5  malloc
@          0x8f9c745  operator new()
@          0x2ddb2ee  std::vector<>::_M_range_insert<>()
@          0x2dde914  starrocks::BinaryColumnBase<>::append()
@          0x365fa4e  starrocks::NullableColumn::append()
@          0x37458f9  starrocks::JoinHashTable::append_chunk()
@          0x3c86e80  starrocks::HashJoinBuilder::append_chunk()
@          0x3c8100c  starrocks::HashJoiner::append_chunk_to_ht()
@          0x3ab6649  starrocks::pipeline::HashJoinBuildOperator::push_chunk()
@          0x3a6769c  starrocks::pipeline::PipelineDriver::process()
@          0x3a58b9e  starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
@          0x305ebac  starrocks::ThreadPool::dispatch_thread()
@          0x305882a  starrocks::Thread::supervise_thread()

JoinHashTable::build_chunk is a Chunk which contains all data from build side, it means it can be very large for particular cases. As a result, it can easily encounter the memory allocation issue, when jemalloc/os cannot allocate a large continuous memory, as above exception.

The particular cases can be:

use string column as build side
use array column as build side

What I'm doing:

Split that chunk into multiple smaller segments(whose rows is usually 131072) to get rid of this issue:

Introduce a SegmentedChunk and SegmentedColumn to replace original Chunk and Column
They're not transparent replacement, but implemented most of required interfaces. So minimal code changes are required
To deal with the address problem(map the global offset to segment offset): we choose to translate the index just-in-time, like offset%segment_size, rather than maintaining a index for it. It's effective enough with static segment_size.
We use static segment_size rather than dynamic, which is easier to implement and more efficient

Potential downside and considerations of this approach:

When generate output for JoinHashMap, it needs to randomly copy data from the build_chunk according to build_index. With SegmentedChunk, since the memory address is not continuous anymore, we need to lookup the segment first then lookup the record in it. To deal with it, we try best to use the SegmentedChunkVisitor to reduce this overhead via eliminating the virtual function call
The key_column of JoinHashMap cannot not use columns of build_chunk anymore. Since their memory layout is different, key_column use a continuous column, but build_chunk uses a segmented way. It would introduce some memory overhead and memory copy overhead.
- Why not make the key_column segmented ? The overhead is relatively larger for the probe procedure, and also it needs to change a lot of code, which is beyond the scope. So we choose the easy path

Performance

Running ./shuffle_chunk_bench
Run on (104 X 3200.25 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x52)
  L1 Instruction 32 KiB (x52)
  L2 Unified 1024 KiB (x52)
  L3 Unified 36608 KiB (x2)
Load Average: 100.59, 89.61, 83.99
--------------------------------------------------------------------------------------
Benchmark                            Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------
bench_chunk_clone 3992519949343932 ns     21223730 ns            1 items_per_second=192.992k/s
bench_segmented_chunk_clone 3992510186082870 ns     22087674 ns            1 items_per_second=185.443k/s

The bench_segmented_chunk_clone is still slower than regular chunk_clone, it mostly comes from the unpredictable random memory access during copy. Considering it can help memory allocation, i think it's worth to do it.

We can further optimize the performance through make the memory access more sequential.

Fixes #issue

What type of PR is this:

Does this PR entail a change in behavior?

Yes, this PR will result in a change in behavior.
No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

Interface/UI changes: syntax, type conversion, expression evaluation, display information
Parameter changes: default values, similar parameters but with different default values
Policy changes: use new policy to replace old one, functionality automatically enabled
Feature removed
Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

I have added test cases for my bug fix or my new feature
This pr needs user documentation (for new or modified features or behaviors)
- I have added documentation for my new feature or new function
This is a backport pr

Signed-off-by: Murphy <[email protected]> (cherry picked from commit 5dd0cc5) # Conflicts: # be/src/column/binary_column.h # be/src/exec/join_hash_map.cpp # be/src/exec/join_hash_map.h # be/src/exec/join_hash_map.tpp # be/src/exec/pipeline/hashjoin/spillable_hash_join_build_operator.cpp # be/src/exec/pipeline/hashjoin/spillable_hash_join_build_operator.h # be/src/exec/spill/mem_table.cpp

mergify · 2024-10-18T16:25:57Z

Cherry-pick of 5dd0cc5 has failed:

On branch mergify/bp/branch-3.2/pr-51175
Your branch is up to date with 'origin/branch-3.2'.

You are currently cherry-picking commit 5dd0cc5154.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	modified:   be/src/bench/shuffle_chunk_bench.cpp
	modified:   be/src/column/binary_column.cpp
	modified:   be/src/column/column_helper.cpp
	modified:   be/src/column/column_helper.h
	modified:   be/src/column/const_column.h
	modified:   be/src/column/nullable_column.h
	modified:   be/src/column/vectorized_fwd.h
	modified:   be/src/storage/chunk_helper.cpp
	modified:   be/src/storage/chunk_helper.h
	modified:   be/test/storage/chunk_helper_test.cpp

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   be/src/column/binary_column.h
	both modified:   be/src/exec/join_hash_map.cpp
	both modified:   be/src/exec/join_hash_map.h
	both modified:   be/src/exec/join_hash_map.tpp
	both modified:   be/src/exec/pipeline/hashjoin/spillable_hash_join_build_operator.cpp
	both modified:   be/src/exec/pipeline/hashjoin/spillable_hash_join_build_operator.h
	both modified:   be/src/exec/spill/mem_table.cpp

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

mergify · 2024-10-18T16:26:32Z

@mergify[bot]: Backport conflict, please reslove the conflict and resubmit the pr

mergify bot added the conflicts label Oct 18, 2024

mergify bot mentioned this pull request Oct 18, 2024

[Enhancement] split chunk of HashTable #51175

Merged

24 tasks

github-actions bot assigned murphyatwork Oct 18, 2024

mergify bot closed this Oct 18, 2024

github-actions bot added the automerge label Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement] split chunk of HashTable (backport #51175) #52118

[Enhancement] split chunk of HashTable (backport #51175) #52118

mergify bot commented Oct 18, 2024 •

edited by wanpengfei-git

Loading

mergify bot commented Oct 18, 2024

mergify bot commented Oct 18, 2024

[Enhancement] split chunk of HashTable (backport #51175) #52118

[Enhancement] split chunk of HashTable (backport #51175) #52118

Conversation

mergify bot commented Oct 18, 2024 • edited by wanpengfei-git Loading

Why I'm doing:

What I'm doing:

Performance

What type of PR is this:

Checklist:

Bugfix cherry-pick branch check:

What I'm doing:

Performance

What type of PR is this:

Checklist:

mergify bot commented Oct 18, 2024

mergify bot commented Oct 18, 2024

mergify bot commented Oct 18, 2024 •

edited by wanpengfei-git

Loading