-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Enhancement] merge full/sample statistics collect #52693
base: main
Are you sure you want to change the base?
Conversation
c9963a4
to
243f61d
Compare
@@ -2098,6 +2098,9 @@ public class Config extends ConfigBase { | |||
"we would use sample statistics instead of full statistics") | |||
public static double statistic_sample_collect_ratio_threshold_of_first_load = 0.1; | |||
|
|||
@ConfField(mutable = true) | |||
public static boolean statistic_use_meta_statistics = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you put some comment on it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's temp config on main, I will remove it in next PR
import java.util.List; | ||
import java.util.Map; | ||
|
||
public class HyperStatisticsCollectJob extends StatisticsCollectJob { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hyper
?
I guess it's actually a regular sample-like collection job
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think is hyper, because Full/Sample always use it
|
||
import java.util.List; | ||
|
||
public class ColumnClassifier { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refactor from com.starrocks.statistic.sample.ColumnSampleManager, will save only one in next pr. For classifiy different column, different column type need different collect way
|
||
public abstract class ColumnStats { | ||
|
||
protected final String columnName; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider add columnId
? as SR has already supported rename column
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refactor from com.starrocks.statistic.sample.ColumnStats, will save only one in next pr. I think add columnId a complex work, don't update it in this PR
import java.util.List; | ||
import java.util.stream.Collectors; | ||
|
||
public class SubFieldColumnStats extends PrimitiveTypeColumnStats { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how to use it ? how to store them in memory if there're thousands of fields in a struct ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refactor from com.starrocks.statistic.sample.SubFieldColumnStats
, will save only one in next pr. it's a tool-class for generate statistics SQL, don't store it in memory.
@@ -1457,6 +1457,7 @@ private void executeAnalyze(AnalyzeStmt analyzeStmt, AnalyzeStatus analyzeStatus | |||
statsConnectCtx.getSessionVariable().setStatisticCollectParallelism( | |||
context.getSessionVariable().getStatisticCollectParallelism()); | |||
statsConnectCtx.setThreadLocalInfo(); | |||
statsConnectCtx.setStatisticsConnection(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be set in the StatisticExecutor
, is it necessary to set it here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not all statistic SQL need set the variable, only collect job need
if (table.isTemporaryTable()) { | ||
context.setSessionId(((OlapTable) table).getSessionId()); | ||
} | ||
context.getSessionVariable().setEnableAnalyzePhasePruneColumns(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it necessary to restore these variables? what if the connection is reused by other jobs ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think don't need, because ConnectContext of statistics is create by itself
int subStart = 0; | ||
int pos = 0; | ||
int subEnd; | ||
while ((subEnd = columnName.indexOf(".", pos)) > 0 && type.isStructType()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this piece of code is too complex to read
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to split&vaild struct column, like a.b.c.d
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Signed-off-by: Seaven <[email protected]>
Quality Gate failedFailed conditions See analysis details on SonarQube Cloud Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE |
[Java-Extensions Incremental Coverage Report]✅ pass : 0 / 0 (0%) |
[FE Incremental Coverage Report]❌ fail : 13 / 22 (59.09%) file detail
|
[BE Incremental Coverage Report]✅ pass : 0 / 0 (0%) |
Why I'm doing:
What I'm doing:
This is 1st PR
Fulll/Sample statistics collect process
Serious issues:
For high cardinality(NDV > 0.1%), sample statistics will get severely distorted NDV:
we will handle the question later, maybe not use hll or use other algorithm
modify code:
the HyperStatisticsJob process, same as FullStatisticsJob
RoadMap
next step:
Fixes #issue
What type of PR is this:
Does this PR entail a change in behavior?
If yes, please specify the type of change:
Checklist:
Bugfix cherry-pick branch check: