[Enhancement] merge full/sample statistics collect #52693

Seaven · 2024-11-07T07:48:44Z

Why I'm doing:

want to merge sample/full statistics together, save in column_statistics table
want to improve full statistics collect performance
want to improve some metric in sample statsitcs

What I'm doing:

This is 1st PR
Fulll/Sample statistics collect process

split count(1)/min/max from full statistics query, use meta query to collect it.
collect ndv/count null by full statistics query only

Serious issues:
For high cardinality(NDV > 0.1%), sample statistics will get severely distorted NDV:

the unpartition table: the HLL-NDV max value should be the min(row_count * 0.1%, 20w)
the partition table: the HLL-NDV max value should be the partition_nums * min(row_count * 0.1%, 20w)

we will handle the question later, maybe not use hll or use other algorithm

modify code:

refactor some code, for merge sample/full statistic code later
add HyperStatisticsJob, to refactor the sample/full statistics job process
update column statistics query process，only query column_statistics table

the HyperStatisticsJob process, same as FullStatisticsJob

collect mertic by query, and save batch in FE (refactor to HyperQueryJob, FullQueryJob, and later will add SampleQueryJob)
insert batch value to column_statistics

RoadMap

next step:

support Analyze stmt work on partition
update Sample Statistics NDV algorithm
remove SampleStatisticsJob/FullStatisticsJob code

Fixes #issue

What type of PR is this:

Does this PR entail a change in behavior?

Yes, this PR will result in a change in behavior.
No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

Interface/UI changes: syntax, type conversion, expression evaluation, display information
Parameter changes: default values, similar parameters but with different default values
Policy changes: use new policy to replace old one, functionality automatically enabled
Feature removed
Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

I have added test cases for my bug fix or my new feature
This pr needs user documentation (for new or modified features or behaviors)
- I have added documentation for my new feature or new function
This is a backport pr

Bugfix cherry-pick branch check:

murphyatwork · 2024-11-18T06:54:05Z

fe/fe-core/src/main/java/com/starrocks/common/Config.java

@@ -2098,6 +2098,9 @@ public class Config extends ConfigBase {
            "we would use sample statistics instead of full statistics")
    public static double statistic_sample_collect_ratio_threshold_of_first_load = 0.1;

+    @ConfField(mutable = true)
+    public static boolean statistic_use_meta_statistics = true;


can you put some comment on it ?

it's temp config on main, I will remove it in next PR

murphyatwork · 2024-11-18T06:56:43Z

fe/fe-core/src/main/java/com/starrocks/statistic/HyperStatisticsCollectJob.java

+import java.util.List;
+import java.util.Map;
+
+public class HyperStatisticsCollectJob extends StatisticsCollectJob {


Hyper ?
I guess it's actually a regular sample-like collection job

I think is hyper, because Full/Sample always use it

murphyatwork · 2024-11-18T06:58:36Z

fe/fe-core/src/main/java/com/starrocks/statistic/base/ColumnClassifier.java

+
+import java.util.List;
+
+public class ColumnClassifier {


refactor from com.starrocks.statistic.sample.ColumnSampleManager, will save only one in next pr. For classifiy different column, different column type need different collect way

murphyatwork · 2024-11-18T06:59:14Z

fe/fe-core/src/main/java/com/starrocks/statistic/base/ColumnStats.java

+
+public abstract class ColumnStats {
+
+    protected final String columnName;


consider add columnId ? as SR has already supported rename column

refactor from com.starrocks.statistic.sample.ColumnStats, will save only one in next pr. I think add columnId a complex work, don't update it in this PR

murphyatwork · 2024-11-18T07:00:43Z

fe/fe-core/src/main/java/com/starrocks/statistic/base/SubFieldColumnStats.java

+import java.util.List;
+import java.util.stream.Collectors;
+
+public class SubFieldColumnStats extends PrimitiveTypeColumnStats {


how to use it ? how to store them in memory if there're thousands of fields in a struct ?

refactor from com.starrocks.statistic.sample.SubFieldColumnStats, will save only one in next pr. it's a tool-class for generate statistics SQL, don't store it in memory.

murphyatwork · 2024-11-19T03:01:30Z

fe/fe-core/src/main/java/com/starrocks/qe/StmtExecutor.java

@@ -1457,6 +1457,7 @@ private void executeAnalyze(AnalyzeStmt analyzeStmt, AnalyzeStatus analyzeStatus
        statsConnectCtx.getSessionVariable().setStatisticCollectParallelism(
                context.getSessionVariable().getStatisticCollectParallelism());
        statsConnectCtx.setThreadLocalInfo();
+        statsConnectCtx.setStatisticsConnection(true);


it would be set in the StatisticExecutor, is it necessary to set it here ?

not all statistic SQL need set the variable, only collect job need

murphyatwork · 2024-11-19T03:03:15Z

fe/fe-core/src/main/java/com/starrocks/statistic/HyperStatisticsCollectJob.java

+        if (table.isTemporaryTable()) {
+            context.setSessionId(((OlapTable) table).getSessionId());
+        }
+        context.getSessionVariable().setEnableAnalyzePhasePruneColumns(true);


is it necessary to restore these variables? what if the connection is reused by other jobs ?

I think don't need, because ConnectContext of statistics is create by itself

murphyatwork · 2024-11-19T03:09:53Z

fe/fe-core/src/main/java/com/starrocks/statistic/base/ColumnClassifier.java

+                            int subStart = 0;
+                            int pos = 0;
+                            int subEnd;
+                            while ((subEnd = columnName.indexOf(".", pos)) > 0 && type.isStructType()) {


this piece of code is too complex to read

to split&vaild struct column, like a.b.c.d

Signed-off-by: Seaven <[email protected]>

sonarcloud · 2024-11-25T04:45:57Z

Quality Gate failed

Failed conditions
6.2% Duplication on New Code (required ≤ 3%)
B Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

github-actions · 2024-11-25T05:36:06Z

[Java-Extensions Incremental Coverage Report]

✅ pass : 0 / 0 (0%)

github-actions · 2024-11-25T05:36:07Z

[FE Incremental Coverage Report]

❌ fail : 13 / 22 (59.09%)

file detail

	path	covered_line	new_line	coverage	not_covered_line_detail
🔵	com/starrocks/statistic/StatisticsCollectJobFactory.java	3	7	42.86%	[114, 115, 117, 134]
🔵	com/starrocks/qe/SessionVariable.java	1	2	50.00%	[1505]
🔵	com/starrocks/statistic/StatisticExecutor.java	7	11	63.64%	[146, 157, 163, 164]
🔵	com/starrocks/common/Config.java	1	1	100.00%	[]
🔵	com/starrocks/qe/StmtExecutor.java	1	1	100.00%	[]

github-actions · 2024-11-25T05:36:55Z

[BE Incremental Coverage Report]

✅ pass : 0 / 0 (0%)

Seaven requested a review from a team as a code owner November 7, 2024 07:48

mergify bot assigned Seaven Nov 7, 2024

Seaven force-pushed the stat branch from 6338d87 to daf0a3a Compare November 7, 2024 09:34

github-actions bot added behavior_changed and removed behavior_changed labels Nov 7, 2024

Seaven force-pushed the stat branch from de3c991 to 1f75bcf Compare November 11, 2024 12:01

Seaven changed the title ~~[Enhancement] support meta statistics~~ [Enhancement] merge full/sample statistics collect Nov 13, 2024

Seaven force-pushed the stat branch 2 times, most recently from c9963a4 to 243f61d Compare November 13, 2024 12:46

Seaven requested a review from a team as a code owner November 14, 2024 08:58

Seaven force-pushed the stat branch from b2bc329 to 910be34 Compare November 14, 2024 08:58

murphyatwork reviewed Nov 18, 2024

View reviewed changes

murphyatwork reviewed Nov 19, 2024

View reviewed changes

Seaven force-pushed the stat branch from 713429f to 08b652c Compare November 19, 2024 06:53

Seaven added 16 commits November 20, 2024 17:27

[Enhancement] support meta statistics

fcc432d

Signed-off-by: Seaven <[email protected]>

update

5c0aba0

Signed-off-by: Seaven <[email protected]>

--amend

af3b679

Signed-off-by: Seaven <[email protected]>

iiiii

4b61762

Signed-off-by: Seaven <[email protected]>

updateddd

a1b64ba

Signed-off-by: Seaven <[email protected]>

fix

2d93f4c

Signed-off-by: Seaven <[email protected]>

fix ut

97ace57

Signed-off-by: Seaven <[email protected]>

update sample statistic

a8bf0a1

Signed-off-by: Seaven <[email protected]>

fix

3571644

Signed-off-by: Seaven <[email protected]>

fixxxx

5a8a453

Signed-off-by: Seaven <[email protected]>

fix ut

6afd07f

Signed-off-by: Seaven <[email protected]>

fix

b0db8a5

Signed-off-by: Seaven <[email protected]>

update name

f304426

Signed-off-by: Seaven <[email protected]>

uppp

5ff8703

Signed-off-by: Seaven <[email protected]>

fixxxxx

8315df5

Signed-off-by: Seaven <[email protected]>

add ut

893454f

Signed-off-by: Seaven <[email protected]>

Seaven added 6 commits November 20, 2024 17:27

fix ut

f85e101

Signed-off-by: Seaven <[email protected]>

fixcomment

4d088ec

Signed-off-by: Seaven <[email protected]>

fixxxx

f5fd1f8

Signed-off-by: Seaven <[email protected]>

fix comment

436b2a5

Signed-off-by: Seaven <[email protected]>

add comment

fa089df

Signed-off-by: Seaven <[email protected]>

udpate

ee07ff6

Signed-off-by: Seaven <[email protected]>

Seaven force-pushed the stat branch from 591af59 to ee07ff6 Compare November 20, 2024 09:27

Seaven added 2 commits November 21, 2024 09:47

fix ut

91b143f

Signed-off-by: Seaven <[email protected]>

fix ut

a6e7bdb

Signed-off-by: Seaven <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement] merge full/sample statistics collect #52693

[Enhancement] merge full/sample statistics collect #52693

Seaven commented Nov 7, 2024 •

edited

Loading

murphyatwork Nov 18, 2024

Seaven Nov 18, 2024

murphyatwork Nov 18, 2024

Seaven Nov 18, 2024

murphyatwork Nov 18, 2024

Seaven Nov 18, 2024 •

edited

Loading

murphyatwork Nov 18, 2024

Seaven Nov 18, 2024 •

edited

Loading

murphyatwork Nov 18, 2024

Seaven Nov 18, 2024

murphyatwork Nov 19, 2024

Seaven Nov 19, 2024

murphyatwork Nov 19, 2024

Seaven Nov 19, 2024

murphyatwork Nov 19, 2024

Seaven Nov 19, 2024

sonarcloud bot commented Nov 25, 2024

github-actions bot commented Nov 25, 2024

github-actions bot commented Nov 25, 2024

github-actions bot commented Nov 25, 2024


		public abstract class ColumnStats {

		protected final String columnName;

[Enhancement] merge full/sample statistics collect #52693

Are you sure you want to change the base?

[Enhancement] merge full/sample statistics collect #52693

Conversation

Seaven commented Nov 7, 2024 • edited Loading

Why I'm doing:

What I'm doing:

RoadMap

What type of PR is this:

Checklist:

Bugfix cherry-pick branch check:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Seaven Nov 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Seaven Nov 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sonarcloud bot commented Nov 25, 2024

Quality Gate failed

github-actions bot commented Nov 25, 2024

[Java-Extensions Incremental Coverage Report]

github-actions bot commented Nov 25, 2024

[FE Incremental Coverage Report]

file detail

github-actions bot commented Nov 25, 2024

[BE Incremental Coverage Report]

Seaven commented Nov 7, 2024 •

edited

Loading

Seaven Nov 18, 2024 •

edited

Loading

Seaven Nov 18, 2024 •

edited

Loading