New Optimizer

(a place to jot down notes about the new plan representation & optimizer)

Query plan is modeled as a "program" using intermediate representation comprised by function calls and assignments. The logical type of each expression is some form of relation/collection/stream-of-rows.
For each relational expression, we can derive:
- Logical properties such as predicate, uniqueness, type (schema), functional dependencies between fields.
- Physical properties such as global partitioning, local ordering & grouping
Functions can be logical and/or physical (i.e., if they can be directly executed: join vs hash-join)
Possibly multiple optimizer implementations: heuristics/rewrites, cost-based, etc (TBD)
Cost-based optimizer
- Cascades-style
- Components:
  - Rules
    - Pattern + named arguments + required properties
    - Can produce multiple expressions
    - Types: logical transformation (e.g., push filter through project), implementation (join -> hash join), enforcement (sort before merge). The may not need to be explicitly identified as such.
  - Memo
    - Holds equivalence classes (name + list of expressions)
    - Memoizes optimization goals (i.e., best expression for a given equivalence class and physical requirements)
  - Cost

Memo

optimizer "state"
tracks known equivalences
memoizes optimization goals (expression + required properties)

Pattern

used to decide whether a rule can match a give expression tree shape
support capturing variables: filter(x:project(<any>))

Programmatic constraints

To allow matching based on attributes not expressible via nesting structure. For example:

    a:filter(b:project(<any>))      
       
    where, 
    - isDeterministic(a.condition)
    - b.projections.allMatch(p -> isDeterministic(p))

Deep vs shallow iteration of pattern leaf bindings

Some rules may need to match and be able to inspect arbitrary trees that cannot be expressed by a simple structural pattern.

Given pattern f1(x:<any>) and the following equivalence structure:

a := {f1(b)}
b := {g1(c), g2(d)}
c := {k1, k2}
e := {j1, j2}

shallow iteration produces:

f1(x), x = g1(c), c is opaque (trying to resolve it causes an error)
f1(x), x = g2(d), d is opaque (trying to resolve it causes an error)

deep iteration produces:

f1(x), x = g1(c), c = k1
f1(x), x = g1(c), c = k2
f1(x), x = g2(e), e = j1
f1(x), x = g2(e), e = j2

Optimization loop pseudo-code

start:
- break up expression into single-assignment expression
- add each assignment to the memo in a separate equivalence class
- optimize(root class, unbounded cost, no physical reqs)

optimize(equivalence class, cost bound, requirements):
- initialize exploration queue (rule + top operator in equivalence class)
- find potential match candidates and add them to queue
- while queue is not empty
    - enumerate bindings for each named argument (by iterating over all expressions in each equivalence class that's part of the match)
    - if binding + physical requirements can be handled by rule
        - apply rule
        - for each expression generated by rule
            - add to memo
            - if top function is physical
                - determine cost bound for children
                - for each input
                    - derive required physical properties & cost upper bound
                    - optimize corresponding equivalence class with required properties and upper bound
                    - update max bound for remaining children 
            - find additional potential matches and enqueue

Open issues

how to prioritize exploration candidates
memoize rule application to prevent re-exploration in case of repeated optimization calls (with different physical requirements)
we may need a way for a rule to short-circuit other exploration tasks for a given group (e.g., after constant folding)
we may need a way for a rule to prevent application of the same rule on expressions produced by the first application (e.g., join commutativity)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly