[Enhancement] Skip rebalancing scan ranges for hdfs backend selector by default when using datacache. #51996

GavinMar · 2024-10-16T11:54:11Z

Why I'm doing:

Now we use consistent hash algorithm to select backend for hdfs scan ranges, which cannot make sure the scan ranges will be evenly distributed among all backends. So, we rebalance the scan range from one backend to another one if the data distribution on the former exceeds 10% of the average bytes.

However, this may cause random cache miss because the same scan range may be rebalanced to a different one. So, even if the same query is executed multiple times, it still cannot fully hit the cache each time. This will lead to significant performance degradation in many scenarios.

What I'm doing:

Considering with the help of so many virtual nodes, consistent hashing usually does not result in significant deviations in data distribution. So, we skip rebalancing scan ranges by default when using datacache.

Also, we add a session variable to change this default behavior in some special cases.

What type of PR is this:

Does this PR entail a change in behavior?

Yes, this PR will result in a change in behavior.
No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

Interface/UI changes: syntax, type conversion, expression evaluation, display information
Parameter changes: default values, similar parameters but with different default values
Policy changes: use new policy to replace old one, functionality automatically enabled
Feature removed
Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

I have added test cases for my bug fix or my new feature
This pr needs user documentation (for new or modified features or behaviors)
- I have added documentation for my new feature or new function
This is a backport pr

Bugfix cherry-pick branch check:

…by default when using datacache. Signed-off-by: GavinMar <[email protected]>

sonarcloud · 2024-10-16T12:01:15Z

Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

github-actions · 2024-10-16T13:31:13Z

[Java-Extensions Incremental Coverage Report]

✅ pass : 0 / 0 (0%)

github-actions · 2024-10-16T13:31:47Z

[FE Incremental Coverage Report]

✅ pass : 8 / 10 (80.00%)

file detail

	path	covered_line	new_line	coverage	not_covered_line_detail
🔵	com/starrocks/qe/SessionVariable.java	2	4	50.00%	[2736, 2737]
🔵	com/starrocks/qe/HDFSBackendSelector.java	6	6	100.00%	[]

github-actions · 2024-10-16T13:32:03Z

[BE Incremental Coverage Report]

✅ pass : 0 / 0 (0%)

zombee0 · 2024-10-18T02:11:19Z

fe/fe-core/src/main/java/com/starrocks/qe/HDFSBackendSelector.java

+ boolean enableDataCache = ConnectContext.get() != null ? ConnectContext.get().getSessionVariable().
+ isEnableScanDataCache() : false;
+ // If force-rebalancing is not specified and cache is used, skip the rebalancing directly.
+ if (!forceReBalance && enableDataCache) {


maybe we need add user guide that only when cache enabled, forceReBalance does it's work.

maybe we need add user guide that only when cache enabled, forceReBalance does it's work.

Ok，set it invisible currently, and if it is necessary to expose it to users, we will add relevant documentation to explain it

github-actions · 2024-10-18T02:33:07Z

@Mergifyio backport branch-3.3

mergify · 2024-10-18T02:33:26Z

backport branch-3.3

✅ Backports have been created

#52072 [Enhancement] Skip rebalancing scan ranges for hdfs backend selector by default when using datacache. (backport #51996) has been created for branch branch-3.3

…by default when using datacache. (#51996) Signed-off-by: GavinMar <[email protected]> (cherry picked from commit fe00c0b)

…by default when using datacache. (backport #51996) (#52072) Co-authored-by: Gavin <[email protected]>

[Enhancement] Skip rebalancing scan ranges for hdfs backend selector …

e162bd3

…by default when using datacache. Signed-off-by: GavinMar <[email protected]>

mergify bot assigned GavinMar Oct 16, 2024

github-actions bot added the 3.3 label Oct 16, 2024

Smith-Cruise approved these changes Oct 16, 2024

View reviewed changes

zombee0 approved these changes Oct 18, 2024

View reviewed changes

dirtysalt approved these changes Oct 18, 2024

View reviewed changes

Youngwb approved these changes Oct 18, 2024

View reviewed changes

Youngwb merged commit fe00c0b into StarRocks:main Oct 18, 2024
69 of 70 checks passed

github-actions bot removed the 3.3 label Oct 18, 2024

mergify bot pushed a commit that referenced this pull request Oct 18, 2024

[Enhancement] Skip rebalancing scan ranges for hdfs backend selector …

740f839

…by default when using datacache. (#51996) Signed-off-by: GavinMar <[email protected]> (cherry picked from commit fe00c0b)

mergify bot mentioned this pull request Oct 18, 2024

[Enhancement] Skip rebalancing scan ranges for hdfs backend selector by default when using datacache. (backport #51996) #52072

Merged

42 tasks

wanpengfei-git pushed a commit that referenced this pull request Oct 18, 2024

[Enhancement] Skip rebalancing scan ranges for hdfs backend selector …

0666b60

…by default when using datacache. (backport #51996) (#52072) Co-authored-by: Gavin <[email protected]>

github-actions bot added the 3.3-merged label Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement] Skip rebalancing scan ranges for hdfs backend selector by default when using datacache. #51996

[Enhancement] Skip rebalancing scan ranges for hdfs backend selector by default when using datacache. #51996

GavinMar commented Oct 16, 2024 •

edited

Loading

sonarcloud bot commented Oct 16, 2024

github-actions bot commented Oct 16, 2024

github-actions bot commented Oct 16, 2024

github-actions bot commented Oct 16, 2024

zombee0 Oct 18, 2024

GavinMar Oct 18, 2024

github-actions bot commented Oct 18, 2024

mergify bot commented Oct 18, 2024 •

edited

Loading

[Enhancement] Skip rebalancing scan ranges for hdfs backend selector by default when using datacache. #51996

[Enhancement] Skip rebalancing scan ranges for hdfs backend selector by default when using datacache. #51996

Conversation

GavinMar commented Oct 16, 2024 • edited Loading

Why I'm doing:

What I'm doing:

What type of PR is this:

Checklist:

Bugfix cherry-pick branch check:

sonarcloud bot commented Oct 16, 2024

Quality Gate passed

github-actions bot commented Oct 16, 2024

[Java-Extensions Incremental Coverage Report]

github-actions bot commented Oct 16, 2024

[FE Incremental Coverage Report]

file detail

github-actions bot commented Oct 16, 2024

[BE Incremental Coverage Report]

zombee0 Oct 18, 2024

Choose a reason for hiding this comment

GavinMar Oct 18, 2024

Choose a reason for hiding this comment

github-actions bot commented Oct 18, 2024

mergify bot commented Oct 18, 2024 • edited Loading

✅ Backports have been created

GavinMar commented Oct 16, 2024 •

edited

Loading

mergify bot commented Oct 18, 2024 •

edited

Loading