Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement] split chunk of HashTable (backport #51175) #52118

Closed
wants to merge 1 commit into from

Conversation

mergify[bot]
Copy link
Contributor

@mergify mergify bot commented Oct 18, 2024

Why I'm doing:

@          0x2f9d4b5  malloc
@          0x8f9c745  operator new()
@          0x2ddb2ee  std::vector<>::_M_range_insert<>()
@          0x2dde914  starrocks::BinaryColumnBase<>::append()
@          0x365fa4e  starrocks::NullableColumn::append()
@          0x37458f9  starrocks::JoinHashTable::append_chunk()
@          0x3c86e80  starrocks::HashJoinBuilder::append_chunk()
@          0x3c8100c  starrocks::HashJoiner::append_chunk_to_ht()
@          0x3ab6649  starrocks::pipeline::HashJoinBuildOperator::push_chunk()
@          0x3a6769c  starrocks::pipeline::PipelineDriver::process()
@          0x3a58b9e  starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
@          0x305ebac  starrocks::ThreadPool::dispatch_thread()
@          0x305882a  starrocks::Thread::supervise_thread()

JoinHashTable::build_chunk is a Chunk which contains all data from build side, it means it can be very large for particular cases. As a result, it can easily encounter the memory allocation issue, when jemalloc/os cannot allocate a large continuous memory, as above exception.

The particular cases can be:

  • use string column as build side
  • use array column as build side

What I'm doing:

Split that chunk into multiple smaller segments(whose rows is usually 131072) to get rid of this issue:

  • Introduce a SegmentedChunk and SegmentedColumn to replace original Chunk and Column
  • They're not transparent replacement, but implemented most of required interfaces. So minimal code changes are required
  • To deal with the address problem(map the global offset to segment offset): we choose to translate the index just-in-time, like offset%segment_size, rather than maintaining a index for it. It's effective enough with static segment_size.
  • We use static segment_size rather than dynamic, which is easier to implement and more efficient

Potential downside and considerations of this approach:

  • When generate output for JoinHashMap, it needs to randomly copy data from the build_chunk according to build_index. With SegmentedChunk, since the memory address is not continuous anymore, we need to lookup the segment first then lookup the record in it. To deal with it, we try best to use the SegmentedChunkVisitor to reduce this overhead via eliminating the virtual function call
  • The key_column of JoinHashMap cannot not use columns of build_chunk anymore. Since their memory layout is different, key_column use a continuous column, but build_chunk uses a segmented way. It would introduce some memory overhead and memory copy overhead.
    • Why not make the key_column segmented ? The overhead is relatively larger for the probe procedure, and also it needs to change a lot of code, which is beyond the scope. So we choose the easy path

Performance

Running ./shuffle_chunk_bench
Run on (104 X 3200.25 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x52)
  L1 Instruction 32 KiB (x52)
  L2 Unified 1024 KiB (x52)
  L3 Unified 36608 KiB (x2)
Load Average: 100.59, 89.61, 83.99
--------------------------------------------------------------------------------------
Benchmark                            Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------
bench_chunk_clone 3992519949343932 ns     21223730 ns            1 items_per_second=192.992k/s
bench_segmented_chunk_clone 3992510186082870 ns     22087674 ns            1 items_per_second=185.443k/s

The bench_segmented_chunk_clone is still slower than regular chunk_clone, it mostly comes from the unpredictable random memory access during copy. Considering it can help memory allocation, i think it's worth to do it.

We can further optimize the performance through make the memory access more sequential.

Fixes #issue

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 3.3
    • 3.2
    • 3.1
    • 3.0
    • 2.5

This is an automatic backport of pull request #51175 done by [Mergify](https://mergify.com). ## Why I'm doing:
@          0x2f9d4b5  malloc
@          0x8f9c745  operator new()
@          0x2ddb2ee  std::vector<>::_M_range_insert<>()
@          0x2dde914  starrocks::BinaryColumnBase<>::append()
@          0x365fa4e  starrocks::NullableColumn::append()
@          0x37458f9  starrocks::JoinHashTable::append_chunk()
@          0x3c86e80  starrocks::HashJoinBuilder::append_chunk()
@          0x3c8100c  starrocks::HashJoiner::append_chunk_to_ht()
@          0x3ab6649  starrocks::pipeline::HashJoinBuildOperator::push_chunk()
@          0x3a6769c  starrocks::pipeline::PipelineDriver::process()
@          0x3a58b9e  starrocks::pipeline::GlobalDriverExecutor::_worker_thread()
@          0x305ebac  starrocks::ThreadPool::dispatch_thread()
@          0x305882a  starrocks::Thread::supervise_thread()

JoinHashTable::build_chunk is a Chunk which contains all data from build side, it means it can be very large for particular cases. As a result, it can easily encounter the memory allocation issue, when jemalloc/os cannot allocate a large continuous memory, as above exception.

The particular cases can be:

  • use string column as build side
  • use array column as build side

What I'm doing:

Split that chunk into multiple smaller segments(whose rows is usually 131072) to get rid of this issue:

  • Introduce a SegmentedChunk and SegmentedColumn to replace original Chunk and Column
  • They're not transparent replacement, but implemented most of required interfaces. So minimal code changes are required
  • To deal with the address problem(map the global offset to segment offset): we choose to translate the index just-in-time, like offset%segment_size, rather than maintaining a index for it. It's effective enough with static segment_size.
  • We use static segment_size rather than dynamic, which is easier to implement and more efficient

Potential downside and considerations of this approach:

  • When generate output for JoinHashMap, it needs to randomly copy data from the build_chunk according to build_index. With SegmentedChunk, since the memory address is not continuous anymore, we need to lookup the segment first then lookup the record in it. To deal with it, we try best to use the SegmentedChunkVisitor to reduce this overhead via eliminating the virtual function call
  • The key_column of JoinHashMap cannot not use columns of build_chunk anymore. Since their memory layout is different, key_column use a continuous column, but build_chunk uses a segmented way. It would introduce some memory overhead and memory copy overhead.
    • Why not make the key_column segmented ? The overhead is relatively larger for the probe procedure, and also it needs to change a lot of code, which is beyond the scope. So we choose the easy path

Performance

Running ./shuffle_chunk_bench
Run on (104 X 3200.25 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x52)
  L1 Instruction 32 KiB (x52)
  L2 Unified 1024 KiB (x52)
  L3 Unified 36608 KiB (x2)
Load Average: 100.59, 89.61, 83.99
--------------------------------------------------------------------------------------
Benchmark                            Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------
bench_chunk_clone 3992519949343932 ns     21223730 ns            1 items_per_second=192.992k/s
bench_segmented_chunk_clone 3992510186082870 ns     22087674 ns            1 items_per_second=185.443k/s

The bench_segmented_chunk_clone is still slower than regular chunk_clone, it mostly comes from the unpredictable random memory access during copy. Considering it can help memory allocation, i think it's worth to do it.

We can further optimize the performance through make the memory access more sequential.

Fixes #issue

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
  • This is a backport pr

Signed-off-by: Murphy <[email protected]>
(cherry picked from commit 5dd0cc5)

# Conflicts:
#	be/src/column/binary_column.h
#	be/src/exec/join_hash_map.cpp
#	be/src/exec/join_hash_map.h
#	be/src/exec/join_hash_map.tpp
#	be/src/exec/pipeline/hashjoin/spillable_hash_join_build_operator.cpp
#	be/src/exec/pipeline/hashjoin/spillable_hash_join_build_operator.h
#	be/src/exec/spill/mem_table.cpp
Copy link
Contributor Author

mergify bot commented Oct 18, 2024

Cherry-pick of 5dd0cc5 has failed:

On branch mergify/bp/branch-3.2/pr-51175
Your branch is up to date with 'origin/branch-3.2'.

You are currently cherry-picking commit 5dd0cc5154.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	modified:   be/src/bench/shuffle_chunk_bench.cpp
	modified:   be/src/column/binary_column.cpp
	modified:   be/src/column/column_helper.cpp
	modified:   be/src/column/column_helper.h
	modified:   be/src/column/const_column.h
	modified:   be/src/column/nullable_column.h
	modified:   be/src/column/vectorized_fwd.h
	modified:   be/src/storage/chunk_helper.cpp
	modified:   be/src/storage/chunk_helper.h
	modified:   be/test/storage/chunk_helper_test.cpp

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   be/src/column/binary_column.h
	both modified:   be/src/exec/join_hash_map.cpp
	both modified:   be/src/exec/join_hash_map.h
	both modified:   be/src/exec/join_hash_map.tpp
	both modified:   be/src/exec/pipeline/hashjoin/spillable_hash_join_build_operator.cpp
	both modified:   be/src/exec/pipeline/hashjoin/spillable_hash_join_build_operator.h
	both modified:   be/src/exec/spill/mem_table.cpp

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

Copy link
Contributor Author

mergify bot commented Oct 18, 2024

@mergify[bot]: Backport conflict, please reslove the conflict and resubmit the pr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant