Skip to content

Commit

Permalink
fix: 1.9x improvement to median q-error by fixing multi-equality join…
Browse files Browse the repository at this point in the history
… selectivity (#171)

**Summary**: Previously, we computed multi-equality join selectivity by
building an MST of the join graph. However, the correct method is to
take the N-1 _nodes_ with the highest n-distinct values.

**Demo**:
This fix causes us to **beat Postgres on median q-error** for the first
time ever. We also now beat Postgres on p90 q-error for the first time
ever. Overall, it improves our median q-error by 1.9x, p90 q-error by
3.4x, p95 q-error by 42.1x, p99 q-error by 2.6x, and lets us beat
Postgres on 9 queries we previously didn't beat them on.

Before (after changing `DEFAULT_PRECISION` and `DEFAULT_K_TO_TRACK` but
before multi-equality fix):
![Screenshot 2024-04-27 at 16 00
18](https://github.com/cmu-db/optd/assets/20631215/7f53fcbc-a755-42eb-8f23-3216790d5a50)

After:
![Screenshot 2024-04-27 at 20 02
00](https://github.com/cmu-db/optd/assets/20631215/5bae0945-9375-4b2d-990b-949a1c3f3317)

**Details**:
* To see the problem, consider joining three tables where T1 has 2
distinct values, T2 has 3, and T3 has 4. Assume all values only occur
once per table, and all values in the smaller tables appear in the
larger tables. The cartesian product of all three tables is 24 and the
join result is 2, so the overall selectivity should be 1/12. However,
the old method would sometimes give an overall selectivity of 1/16
because there are two edges in the join graph (T1-T3 and T2-T3) which
both have a selectivity of 1/4.
* Added rigorous unit tests to test all possible permutations of
three-table joins. Before the fix, some of these tests were failing.
After the fix, these tests passed.
* Properly handling cases where we are adding a predicate that either
_extends_ an existing connected component or _merges_ two existing
connected components. Added unit tests for both of these cases as well.
  • Loading branch information
wangpatrick57 authored Apr 28, 2024
1 parent 2fa6c62 commit 5958b3d
Show file tree
Hide file tree
Showing 10 changed files with 328 additions and 147 deletions.
1 change: 1 addition & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion dev_scripts/which_queries_work.sh
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#!/bin/bash
benchmark_name=$1
USAGE="Usage: $0 [job|tpch]"
USAGE="Usage: $0 [job|joblight|tpch]"

if [ $# -ne 1 ]; then
echo >&2 $USAGE
Expand Down
1 change: 1 addition & 0 deletions optd-datafusion-repr/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -26,3 +26,4 @@ serde = { version = "1.0", features = ["derive"] }
serde_with = {version = "3.7.0", features = ["json"]}
bincode = "1.3.3"
union-find = { git = "https://github.com/Gun9niR/union-find-rs.git", rev = "794821514f7daefcbb8d5f38ef04e62fc18b5665" }
test-case = "3.3"
44 changes: 44 additions & 0 deletions optd-datafusion-repr/src/cost/base_cost.rs
Original file line number Diff line number Diff line change
Expand Up @@ -318,6 +318,7 @@ mod tests {
pub const TABLE1_NAME: &str = "table1";
pub const TABLE2_NAME: &str = "table2";
pub const TABLE3_NAME: &str = "table3";
pub const TABLE4_NAME: &str = "table4";

// one column is sufficient for all filter selectivity tests
pub fn create_one_column_cost_model(per_column_stats: TestPerColumnStats) -> TestOptCostModel {
Expand Down Expand Up @@ -379,6 +380,49 @@ mod tests {
)
}

/// Create a cost model with three columns, one for each table. Each column has 100 values.
pub fn create_four_table_cost_model(
tbl1_per_column_stats: TestPerColumnStats,
tbl2_per_column_stats: TestPerColumnStats,
tbl3_per_column_stats: TestPerColumnStats,
tbl4_per_column_stats: TestPerColumnStats,
) -> TestOptCostModel {
OptCostModel::new(
vec![
(
String::from(TABLE1_NAME),
TableStats::new(
100,
vec![(vec![0], tbl1_per_column_stats)].into_iter().collect(),
),
),
(
String::from(TABLE2_NAME),
TableStats::new(
100,
vec![(vec![0], tbl2_per_column_stats)].into_iter().collect(),
),
),
(
String::from(TABLE3_NAME),
TableStats::new(
100,
vec![(vec![0], tbl3_per_column_stats)].into_iter().collect(),
),
),
(
String::from(TABLE4_NAME),
TableStats::new(
100,
vec![(vec![0], tbl4_per_column_stats)].into_iter().collect(),
),
),
]
.into_iter()
.collect(),
)
}

/// We need custom row counts because some join algorithms rely on the row cnt
pub fn create_two_table_cost_model_custom_row_cnts(
tbl1_per_column_stats: TestPerColumnStats,
Expand Down
Loading

0 comments on commit 5958b3d

Please sign in to comment.