fix: 1.9x improvement to median q-error by fixing multi-equality join selectivity #171

wangpatrick57 · 2024-04-28T00:03:08Z

Summary: Previously, we computed multi-equality join selectivity by building an MST of the join graph. However, the correct method is to take the N-1 nodes with the highest n-distinct values.

Demo:
This fix causes us to beat Postgres on median q-error for the first time ever. We also now beat Postgres on p90 q-error for the first time ever. Overall, it improves our median q-error by 1.9x, p90 q-error by 3.4x, p95 q-error by 42.1x, p99 q-error by 2.6x, and lets us beat Postgres on 9 queries we previously didn't beat them on.

Before (after changing DEFAULT_PRECISION and DEFAULT_K_TO_TRACK but before multi-equality fix):

After:

Details:

To see the problem, consider joining three tables where T1 has 2 distinct values, T2 has 3, and T3 has 4. Assume all values only occur once per table, and all values in the smaller tables appear in the larger tables. The cartesian product of all three tables is 24 and the join result is 2, so the overall selectivity should be 1/12. However, the old method would sometimes give an overall selectivity of 1/16 because there are two edges in the join graph (T1-T3 and T2-T3) which both have a selectivity of 1/4.
Added rigorous unit tests to test all possible permutations of three-table joins. Before the fix, some of these tests were failing. After the fix, these tests passed.
Properly handling cases where we are adding a predicate that either extends an existing connected component or merges two existing connected components. Added unit tests for both of these cases as well.

…which_maintains_mst

…oin_selectivity_from_most_selective_columns

… conds

optd-datafusion-repr/src/cost/base_cost/join.rs

Gun9niR · 2024-04-28T02:39:30Z

optd-datafusion-repr/src/cost/base_cost/join.rs

@@ -430,27 +393,43 @@ impl<
    /// NOTE: This function modifies `past_eq_columns` by adding `predicate` to it.
    fn get_join_selectivity_adjustment_from_redundant_predicates(


This function should be renamed to something like get_join_selectivity_from_col_eq_predicate, and the comment should emphasize the principle of inclusion.

Gun9niR

Nice catch!

…nt_when_adding_to_multi_equality_graph()

wangpatrick57 added 10 commits April 27, 2024 14:53

added actual set of working job light queries

3c7aba5

fmt and clip

c2c99d8

changed precision of mg and hll, getting us from 35 -> 40 queries ahead

ec81ca6

test_inner_redundant_predicate -> test_add_edge_to_multi_equal_graph_…

af00dd6

…which_maintains_mst

wrote test_three_table_join_for_join1_on_cond test. not passing yet

28a3d26

turned get_join_selectivity_from_most_selective_predicates into get_j…

9104a81

…oin_selectivity_from_most_selective_columns

fixed bug of adding new predicate that touches one col of existing pred

91791f7

generalized three table join test to allow for arbitrary initial join…

d9a8f29

… conds

added test_join_which_connects_two_components_together

42a9162

fmt and clip

0474fcd

wangpatrick57 marked this pull request as ready for review April 28, 2024 00:03

wangpatrick57 requested a review from Gun9niR April 28, 2024 00:05

Gun9niR reviewed Apr 28, 2024

View reviewed changes

optd-datafusion-repr/src/cost/base_cost/join.rs Outdated Show resolved Hide resolved

Gun9niR reviewed Apr 28, 2024

View reviewed changes

Gun9niR approved these changes Apr 28, 2024

View reviewed changes

wangpatrick57 added 2 commits April 28, 2024 11:01

changed comment and name of what is now get_join_selectivity_adjustme…

4fc54c1

…nt_when_adding_to_multi_equality_graph()

inclusion principle comment

e5b2adf

wangpatrick57 merged commit 5958b3d into main Apr 28, 2024
1 check passed

wangpatrick57 deleted the phw2/multi-equality-fix branch April 28, 2024 17:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: 1.9x improvement to median q-error by fixing multi-equality join selectivity #171

fix: 1.9x improvement to median q-error by fixing multi-equality join selectivity #171

wangpatrick57 commented Apr 28, 2024

Gun9niR Apr 28, 2024

Gun9niR left a comment

		@@ -430,27 +393,43 @@ impl<
		/// NOTE: This function modifies `past_eq_columns` by adding `predicate` to it.
		fn get_join_selectivity_adjustment_from_redundant_predicates(

fix: 1.9x improvement to median q-error by fixing multi-equality join selectivity #171

fix: 1.9x improvement to median q-error by fixing multi-equality join selectivity #171

Conversation

wangpatrick57 commented Apr 28, 2024

Gun9niR Apr 28, 2024

Choose a reason for hiding this comment

Gun9niR left a comment

Choose a reason for hiding this comment