-
Notifications
You must be signed in to change notification settings - Fork 3k
New Optimizer
(a place to jot down notes about the new plan representation & optimizer)
-
Query plan is modeled as a "program" using intermediate representation comprised by function calls and assignments. The logical type of each expression is some form of relation/collection/stream-of-rows.
-
For each relational expression, we can derive:
- Logical properties such as predicate, uniqueness, type (schema), functional dependencies between fields.
- Physical properties such as global partitioning, local ordering & grouping
-
Functions can be logical and/or physical (i.e., if they can be directly executed: join vs hash-join)
-
Possibly multiple optimizer implementations: heuristics/rewrites, cost-based, etc (TBD)
-
Cost-based optimizer
- Cascades-style
- Components:
- Rules
- Pattern + named arguments + required properties
- Can produce multiple expressions
- Types: logical transformation (e.g., push filter through project), implementation (join -> hash join), enforcement (sort before merge). The may not need to be explicitly identified as such.
- Memo
- Holds equivalence classes (name + list of expressions)
- Memoizes optimization goals (i.e., best expression for a given equivalence class and physical requirements)
- Cost
- Rules
- optimizer "state"
- tracks known equivalences
- memoizes optimization goals (expression + required properties)
- used to decide whether a rule can match a give expression tree shape
- support capturing variables: filter(x:project(<any>))
To allow matching based on attributes not expressible via nesting structure. For example:
a:filter(b:project(<any>))
where,
- isDeterministic(a.condition)
- b.projections.allMatch(p -> isDeterministic(p))
Some rules may need to match and be able to inspect arbitrary trees that cannot be expressed by a simple structural pattern.
Given pattern f1(x:<any>)
and the following equivalence structure:
a := {f1(b)}
b := {g1(c), g2(d)}
c := {k1, k2}
e := {j1, j2}
shallow iteration produces:
f1(x), x = g1(c), c is opaque (trying to resolve it causes an error)
f1(x), x = g2(d), d is opaque (trying to resolve it causes an error)
deep iteration produces:
f1(x), x = g1(c), c = k1
f1(x), x = g1(c), c = k2
f1(x), x = g2(e), e = j1
f1(x), x = g2(e), e = j2
start:
- break up expression into single-assignment expression
- add each assignment to the memo in a separate equivalence class
- optimize(root class, unbounded cost, no physical reqs)
optimize(equivalence class, cost bound, requirements):
- initialize exploration queue (rule + top operator in equivalence class)
- find potential match candidates and add them to queue
- while queue is not empty
- enumerate bindings for each named argument (by iterating over all expressions in each equivalence class that's part of the match)
- if binding + physical requirements can be handled by rule
- apply rule
- for each expression generated by rule
- add to memo
- if top function is physical
- determine cost bound for children
- for each input
- derive required physical properties & cost upper bound
- optimize corresponding equivalence class with required properties and upper bound
- update max bound for remaining children
- find additional potential matches and enqueue
- how to prioritize exploration candidates
- memoize rule application to prevent re-exploration in case of repeated optimization calls (with different physical requirements)
- we may need a way for a rule to short-circuit other exploration tasks for a given group (e.g., after constant folding)
- we may need a way for a rule to prevent application of the same rule on expressions produced by the first application (e.g., join commutativity)