fix: 1.9x improvement to median q-error by fixing multi-equality join…

… selectivity (#171) **Summary**: Previously, we computed multi-equality join selectivity by building an MST of the join graph. However, the correct method is to take the N-1 _nodes_ with the highest n-distinct values. **Demo**: This fix causes us to **beat Postgres on median q-error** for the first time ever. We also now beat Postgres on p90 q-error for the first time ever. Overall, it improves our median q-error by 1.9x, p90 q-error by 3.4x, p95 q-error by 42.1x, p99 q-error by 2.6x, and lets us beat Postgres on 9 queries we previously didn't beat them on. Before (after changing `DEFAULT_PRECISION` and `DEFAULT_K_TO_TRACK` but before multi-equality fix): ![Screenshot 2024-04-27 at 16 00 18](https://github.com/cmu-db/optd/assets/20631215/7f53fcbc-a755-42eb-8f23-3216790d5a50) After: ![Screenshot 2024-04-27 at 20 02 00](https://github.com/cmu-db/optd/assets/20631215/5bae0945-9375-4b2d-990b-949a1c3f3317) **Details**: * To see the problem, consider joining three tables where T1 has 2 distinct values, T2 has 3, and T3 has 4. Assume all values only occur once per table, and all values in the smaller tables appear in the larger tables. The cartesian product of all three tables is 24 and the join result is 2, so the overall selectivity should be 1/12. However, the old method would sometimes give an overall selectivity of 1/16 because there are two edges in the join graph (T1-T3 and T2-T3) which both have a selectivity of 1/4. * Added rigorous unit tests to test all possible permutations of three-table joins. Before the fix, some of these tests were failing. After the fix, these tests passed. * Properly handling cases where we are adding a predicate that either _extends_ an existing connected component or _merges_ two existing connected components. Added unit tests for both of these cases as well.
cmu-db · Apr 28, 2024 · 5958b3d · 5958b3d
1 parent 2fa6c62
commit 5958b3d
Show file tree

Hide file tree

Showing 10 changed files with 328 additions and 147 deletions.
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/dev_scripts/which_queries_work.sh b/dev_scripts/which_queries_work.sh
@@ -1,6 +1,6 @@
 #!/bin/bash
 benchmark_name=$1
-USAGE="Usage: $0 [job|tpch]"
+USAGE="Usage: $0 [job|joblight|tpch]"
 
 if [ $# -ne 1 ]; then
     echo >&2 $USAGE

diff --git a/optd-datafusion-repr/Cargo.toml b/optd-datafusion-repr/Cargo.toml
@@ -26,3 +26,4 @@ serde = { version = "1.0", features = ["derive"] }
 serde_with = {version = "3.7.0", features = ["json"]}
 bincode = "1.3.3"
 union-find = { git = "https://github.com/Gun9niR/union-find-rs.git", rev = "794821514f7daefcbb8d5f38ef04e62fc18b5665" }
+test-case = "3.3"
diff --git a/optd-datafusion-repr/src/cost/base_cost.rs b/optd-datafusion-repr/src/cost/base_cost.rs
@@ -318,6 +318,7 @@ mod tests {
     pub const TABLE1_NAME: &str = "table1";
     pub const TABLE2_NAME: &str = "table2";
     pub const TABLE3_NAME: &str = "table3";
+    pub const TABLE4_NAME: &str = "table4";
 
     // one column is sufficient for all filter selectivity tests
     pub fn create_one_column_cost_model(per_column_stats: TestPerColumnStats) -> TestOptCostModel {
@@ -379,6 +380,49 @@ mod tests {
         )
     }
 
+    /// Create a cost model with three columns, one for each table. Each column has 100 values.
+    pub fn create_four_table_cost_model(
+        tbl1_per_column_stats: TestPerColumnStats,
+        tbl2_per_column_stats: TestPerColumnStats,
+        tbl3_per_column_stats: TestPerColumnStats,
+        tbl4_per_column_stats: TestPerColumnStats,
+    ) -> TestOptCostModel {
+        OptCostModel::new(
+            vec![
+                (
+                    String::from(TABLE1_NAME),
+                    TableStats::new(
+                        100,
+                        vec![(vec![0], tbl1_per_column_stats)].into_iter().collect(),
+                    ),
+                ),
+                (
+                    String::from(TABLE2_NAME),
+                    TableStats::new(
+                        100,
+                        vec![(vec![0], tbl2_per_column_stats)].into_iter().collect(),
+                    ),
+                ),
+                (
+                    String::from(TABLE3_NAME),
+                    TableStats::new(
+                        100,
+                        vec![(vec![0], tbl3_per_column_stats)].into_iter().collect(),
+                    ),
+                ),
+                (
+                    String::from(TABLE4_NAME),
+                    TableStats::new(
+                        100,
+                        vec![(vec![0], tbl4_per_column_stats)].into_iter().collect(),
+                    ),
+                ),
+            ]
+            .into_iter()
+            .collect(),
+        )
+    }
+
     /// We need custom row counts because some join algorithms rely on the row cnt
     pub fn create_two_table_cost_model_custom_row_cnts(
         tbl1_per_column_stats: TestPerColumnStats,