Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement] merge full/sample statistics collect #52693

Open
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

Seaven
Copy link
Contributor

@Seaven Seaven commented Nov 7, 2024

Why I'm doing:

  1. want to merge sample/full statistics together, save in column_statistics table
  2. want to improve full statistics collect performance
  3. want to improve some metric in sample statsitcs

What I'm doing:

This is 1st PR
Fulll/Sample statistics collect process

  1. split count(1)/min/max from full statistics query, use meta query to collect it.
  2. collect ndv/count null by full statistics query only

Serious issues:
For high cardinality(NDV > 0.1%), sample statistics will get severely distorted NDV:

  1. the unpartition table: the HLL-NDV max value should be the min(row_count * 0.1%, 20w)
  2. the partition table: the HLL-NDV max value should be the partition_nums * min(row_count * 0.1%, 20w)

we will handle the question later, maybe not use hll or use other algorithm

modify code:

  1. refactor some code, for merge sample/full statistic code later
  2. add HyperStatisticsJob, to refactor the sample/full statistics job process
  3. update column statistics query process,only query column_statistics table

the HyperStatisticsJob process, same as FullStatisticsJob

  1. collect mertic by query, and save batch in FE (refactor to HyperQueryJob, FullQueryJob, and later will add SampleQueryJob)
  2. insert batch value to column_statistics

RoadMap

next step:

  1. support Analyze stmt work on partition
  2. update Sample Statistics NDV algorithm
  3. remove SampleStatisticsJob/FullStatisticsJob code

Fixes #issue

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 3.3
    • 3.2
    • 3.1
    • 3.0
    • 2.5

@Seaven Seaven requested a review from a team as a code owner November 7, 2024 07:48
@mergify mergify bot assigned Seaven Nov 7, 2024
@Seaven Seaven changed the title [Enhancement] support meta statistics [Enhancement] merge full/sample statistics collect Nov 13, 2024
@Seaven Seaven requested a review from a team as a code owner November 14, 2024 08:58
@@ -2098,6 +2098,9 @@ public class Config extends ConfigBase {
"we would use sample statistics instead of full statistics")
public static double statistic_sample_collect_ratio_threshold_of_first_load = 0.1;

@ConfField(mutable = true)
public static boolean statistic_use_meta_statistics = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put some comment on it ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's temp config on main, I will remove it in next PR

import java.util.List;
import java.util.Map;

public class HyperStatisticsCollectJob extends StatisticsCollectJob {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hyper ?
I guess it's actually a regular sample-like collection job

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think is hyper, because Full/Sample always use it


import java.util.List;

public class ColumnClassifier {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment

Copy link
Contributor Author

@Seaven Seaven Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactor from com.starrocks.statistic.sample.ColumnSampleManager, will save only one in next pr. For classifiy different column, different column type need different collect way


public abstract class ColumnStats {

protected final String columnName;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider add columnId ? as SR has already supported rename column

Copy link
Contributor Author

@Seaven Seaven Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactor from com.starrocks.statistic.sample.ColumnStats, will save only one in next pr. I think add columnId a complex work, don't update it in this PR

import java.util.List;
import java.util.stream.Collectors;

public class SubFieldColumnStats extends PrimitiveTypeColumnStats {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how to use it ? how to store them in memory if there're thousands of fields in a struct ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactor from com.starrocks.statistic.sample.SubFieldColumnStats, will save only one in next pr. it's a tool-class for generate statistics SQL, don't store it in memory.

@@ -1457,6 +1457,7 @@ private void executeAnalyze(AnalyzeStmt analyzeStmt, AnalyzeStatus analyzeStatus
statsConnectCtx.getSessionVariable().setStatisticCollectParallelism(
context.getSessionVariable().getStatisticCollectParallelism());
statsConnectCtx.setThreadLocalInfo();
statsConnectCtx.setStatisticsConnection(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be set in the StatisticExecutor, is it necessary to set it here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not all statistic SQL need set the variable, only collect job need

if (table.isTemporaryTable()) {
context.setSessionId(((OlapTable) table).getSessionId());
}
context.getSessionVariable().setEnableAnalyzePhasePruneColumns(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it necessary to restore these variables? what if the connection is reused by other jobs ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think don't need, because ConnectContext of statistics is create by itself

int subStart = 0;
int pos = 0;
int subEnd;
while ((subEnd = columnName.indexOf(".", pos)) > 0 && type.isStructType()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this piece of code is too complex to read

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to split&vaild struct column, like a.b.c.d

Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Copy link

sonarcloud bot commented Nov 25, 2024

Quality Gate Failed Quality Gate failed

Failed conditions
6.2% Duplication on New Code (required ≤ 3%)
B Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

Copy link

[Java-Extensions Incremental Coverage Report]

pass : 0 / 0 (0%)

Copy link

[FE Incremental Coverage Report]

fail : 13 / 22 (59.09%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 com/starrocks/statistic/StatisticsCollectJobFactory.java 3 7 42.86% [114, 115, 117, 134]
🔵 com/starrocks/qe/SessionVariable.java 1 2 50.00% [1505]
🔵 com/starrocks/statistic/StatisticExecutor.java 7 11 63.64% [146, 157, 163, 164]
🔵 com/starrocks/common/Config.java 1 1 100.00% []
🔵 com/starrocks/qe/StmtExecutor.java 1 1 100.00% []

Copy link

[BE Incremental Coverage Report]

pass : 0 / 0 (0%)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants