Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement] Compare mv rowCount when both mv dimensions contains query dimensions #51511

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

kaijianding
Copy link
Contributor

@kaijianding kaijianding commented Sep 27, 2024

Why I'm doing:

-- rows: 100 mv1: 
select sum(v1) from t1 group by a, b, c, f; 

-- rows: 10000 mv2: 
select sum(v1) from t1 group by a, b, d; 

query: select sum(v1) from t1 where b = 'a' group by a;

Before: prefer mv with less dimensions -> mv2, but mv2 has more rows which is not expected.
After: when many mvs satisfy query, prefer mv with less rows, so choose mv1.

What I'm doing:

  1. MaterializationContext.RewriteOrdering decides which MVs are retained in case too many MV candidates and candidate list should be truncated. MaterializationContext.RewriteOrdering should compare MV's maxPartitionRowCount rather than MV's total row count, in case a MV is with less partitions and total row count but maxPartitionRowCount is large.
  2. BestMvSelector is the actually place to decide which MV should be chose. The comparator in BestMvSelector should consider row count first if MV's outputRowCount is not zero in statistics.

Fixes #issue

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 3.3
    • 3.2
    • 3.1
    • 3.0
    • 2.5

@kaijianding kaijianding requested a review from a team as a code owner September 27, 2024 09:39
@github-actions github-actions bot added the 3.3 label Sep 27, 2024
return orderingRowCount(mv);
}
return (long) r;
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eg:

-- rows: 100
mv1:  select sum(v1) from t1 group by a, b, c;
-- rows: 10000
mv2: select sum(v1) from t1 group by a, b;

query: select sum(v1) from t1 where b = 'a' group by a;

Before: prefer mv with less dimensions -> mv2, but mv2 has more rows which is not expected.
After: when many mvs satisfy query, prefer mv with less rows, so choose mv1.

am I right? can you add more comments about this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And can you move this codes into orderingAggregation to make it more clear?

Copy link
Contributor Author

@kaijianding kaijianding Oct 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes,this is what I expect。but mv definition is a little different.

-- rows: 100 mv1: 
select sum(v1) from t1 group by a, b, c, f; 

-- rows: 10000 mv2: 
select sum(v1) from t1 group by a, b, d; 

query: select sum(v1) from t1 where b = 'a' group by a;

Copy link
Contributor Author

@kaijianding kaijianding Oct 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

orderingAggregation is about single MV, while these codes are about 2 MVs, so I don't change logic in orderingAggregation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems solid.
Can you add this example into comments?

return orderingRowCount(mv);
}
return (long) r;
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems solid.
Can you add this example into comments?

@@ -370,8 +371,16 @@ public int compare(MaterializationContext o1, MaterializationContext o2) {
OperatorType o2Type = o2.getMvExpression().getOp().getOpType();

if (o1Type == o2Type && (o1Type == OperatorType.LOGICAL_AGGR)) {
boolean mvHasDifferentRows = orderingRowCount(o1) != 0 && orderingRowCount(o2) != 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this policy too sensitive? for an exmaple, MV0 and MV1 have N rows and N+1 rows respectively, MV1 may be a promising one for the certain query? so mvHasDifferentRows should introduce an epsilon(ε) so that |MV1-MV1|/max(|MV1|,|MV2|) > ε, then we consider that the larger one is superior to the smaller one significantly.

Copy link
Contributor Author

@kaijianding kaijianding Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we don't have enough clue to tell if MV1 is actually better than MV0 or not here.
if mvHasDifferentRows is set to false when MV0 and MV1 have similar row number, the later code still compares the dimension diff number and later the row number.

My point is that we can't exactlly tell which mv is better for a particular query, we can only believe that comparing row count is more promising than comparing dimension diff number.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rowCount is promising only when there is a significant gap between two MVs' rowcounts. if the row counts are almost the same, we also can not tell out which MV is better. I show an exmaple:

  1. MV0: [d0,d1,d2, sum(m0)];
  2. MV1: [d1,d0,d3, sum(m0)];
    we suppose that |MV0| = N, |MV1|= M, N>M and N-M is small;

for query: select d1, sum(m0) from t where predicate(d0) group by d1;
if MV0 has short key on d0, this query fetches data from IO layer quickly.
so I think only two MV's rowcount are different drastically, the one of smaller rowcount is promissing. if they are close, we should not use the rowcounts.

Copy link
Contributor Author

@kaijianding kaijianding Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your scenario and the limitation of exactly comparing mv row count.

What I do in this PR is prefering comparing mv rount to comparing the deminsion diff number, I get that this is not the most perfect solution. But it is much better than current logic.

Current we don't have enough information(eg: the key columns used in query's predicate and how many rows can be reduced by this predicate, etc) to make sure that MV1 is better choice than MV0 when MV1 has larger row count, other than not choosing MV0 when MV0 has less row count.

My point is that we can't tell which mv is better when the two mv have similar row count according to the current information from the context at current stage, it's still better comparing row count than comparing dimension diff number.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed the logic in BestMvSeclector which is the actually class choosing mv according to your comment

Signed-off-by: kaijian.ding <[email protected]>
LiShuMing
LiShuMing previously approved these changes Oct 17, 2024
Copy link

sonarcloud bot commented Oct 18, 2024

Copy link

[Java-Extensions Incremental Coverage Report]

pass : 0 / 0 (0%)

Copy link

[FE Incremental Coverage Report]

pass : 21 / 22 (95.45%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 com/starrocks/sql/optimizer/rule/transformation/materialization/BestMvSelector.java 8 9 88.89% [155]
🔵 com/starrocks/catalog/MaterializedView.java 5 5 100.00% []
🔵 com/starrocks/sql/optimizer/MaterializationContext.java 8 8 100.00% []

Copy link

[BE Incremental Coverage Report]

pass : 0 / 0 (0%)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants