-
Notifications
You must be signed in to change notification settings - Fork 13
Concepts Algebra
Although Mimir includes a SQL parser, internally all queries are represented through an abstract syntax tree (AST) based on classical relational algebra. This AST includes two components: Primitive-valued expressions (mimir.algebra.Expression), and Table-valued expressions (mimir.algebra.Operator). Note that parsers exist for both ASTs (See the mimir.parsers package), but use Scala's parser combinators, which are known to be incredibly slow.
A primitive value is any "simple" value: a string, integer, float, date, or one of a handful of other special-purpose types. Mimir includes a simple AST for defining expressions that can be evaluated down to a primitive value. As with most ASTs in Scala, Mimir uses case-class syntax for this purpose, making it easy to use the match primitive and to define tree structures inline.
For example, Mimir would express the operation 1+x
as:
Arithmetic(Arith.Add, IntPrimitive(1), Var("x"))
Similarly, the computation SQRT((1 + x) * 20.0)
would be:
Function("SQRT", List(
Arithmetic(Arith.Mult,
Arithmetic(Arith.Add,
IntPrimitive(1), Var("x")
), FloatPrimitive(20.0)
)
))
Most types of expression elements are defined in Expression.scala, including the root type Expression that all primitive-valued expressions inherit from.
A special-case of the Expression type is defined in Primitives.scala, the PrimitiveValue type defines values that can not be evaluated further (aka boxed types in compiler parlance). Subclasses of PrimitiveValue include:
- StringPrimitive: A string.
- IntPrimitive: A long integer.
- FloatPrimitive: A double-length floating point number.
- BoolPrimitive: True or False
-
DatePrimitive: A (Year, Month, Date) corresponding to the string
'YYYY-MM-DD'
-
TimestampPrimitive: A (Year, Month, Date, Hour, Minute, Second, MSec) corresponding to the string
'YYYY-MM-DD HH:MM:SS.MSEC'
Another special-case of the Expression type is Var, which defines a variable (e.g., x in x+1
).
The Eval class is used to evaluate expressions within Mimir. A canonical instance of this class may be found as db.interpreter
. The core method is Eval.eval()
will evaluate an expression, which returns a PrimitiveValue
. It optionally takes a Map[String,PrimitiveValue]
scope parameter that defines primitives to 'plug in' for variables. A runtime error occurs if the scope parameter is not given, or if the scope does not include a variable that is referenced.
Eval has helper methods for typed evaluation:
evalBool(e: Expression): Boolean
evalFloat(e: Expression): Double
evalInt(e: Expression): Long
evalString(e: Expression): String
It is sometimes necessary to determine what type an expression evaluates to. The Typechecker class is used to determine the types of an expression (or a query operator). Use Typechecker.typeOf(expression)
to derive the type. If the expression contains variables, you must also define the types of these variables. If you pass an operator instance as the second argument, the Typechecker will use the inferred types of tuples in the output. You may also pass a Map or function from String => Type
.
Functions are defined in a FunctionRegistry object passed to eval. A canonical instance of this class may be found as db.functions
. New functions may be defined by calling FunctionRegistry.register
, which requires:
- The (upper-case) name of the function.
- A function (Seq[PrimitiveValue] => PrimitiveValue) for evaluating the function.
- A function (Seq[Type] => Type) for inferring the function's type.
Alternatively, you may define functions as native expressions with FunctionRegistry.registerExpression
:
- The (upper-case) name of the function.
- A list of argument names.
- A
String
orExpression
defining the expression that defines the function.
A third option is to define functions as a fold expression with FunctionRegistry.registerFold
:
- The (upper-case) name of the function.
- A
String
orExpression
referencing variables namedCURR
andNEXT
. This will be expanded as scala's fold() function.
- Any new types should be added as subclasses of
mimir.algebra.Type
(mimir/algebra/Type.scala). If you add a new type, you will likely also need to add a new PrimitiveValue. Ensure that all methods in Types.scala are properly populated (Scala will warn you if this is not the case). - Any new primitive values should be added as subclasses of
mimir.algebra.PritiveValue
(mimir/algebra/Expression.scala). If you add a new primitive value class, you should make sure to add a corresponding Type as well. Ensure that there are no warnings. - If you are adding new Expression objects, you will need to add them to the following. (Scala should warn you if you miss one of these)
- mimir.algebra.Eval
- mimir.optimizer.SimplifyExpressions
- mimir.exec.EvalInlined
- Even if not adding new expression objects, you may still need to define rules for arithmetic and comparisons (e.g., +, -, >, ≤, etc...). Logic for both Eval and Simplify expressions lives in the class
Eval.applyArith
andEval.applyCmp
. Then, review thecompile
method andcompileFor*
methods inmimir.exec.EvalInlined
. - You will then need to provide corresponding extensions to the Typechecker. Review
Typechecker.typeOf
and make sure that all arithmetic, comparisons, etc... are properly defined. If you are simply adding new arithmetic operations, you may only need to modifyTypechecker.escalate
. - Finally, make sure that it is possible to access your functionality from SQL-land. This may require extending the SQL parser.
If you need to add new logic to Expressions or Eval, 90% of the time you should just add a new Function that does what you want.
Sometimes it will be necessary to an expression type that needs special treatment by the typechecker and/or Eval. If this is the case, you should add the functionality in correctly, by adding new Expression subclasses, and adding these subclasses to match
statements throughout the rest of the codebase.
In very rare cases, there will be something that needs special treatment, but doesn't need to be handled by anything in Mimir other than the Typechecker and Evaluator. Note that this almost never happens, but for these rare cases, we have Proc objects. A Proc subclass must include rules for both evaluation and type inference. A good example of a Proc node in action is the VGTerm node in mimir.ctables.
It is frequently useful to be able to traverse an AST, interacting only with certain specific nodes in the tree. Expression objects are designed to interact with Scala's collection operations (map
, flatten
, filter
, etc...) to make this easier. There are three functions of use here:
-
node.children
returns a list of all of the AST node's children. -
node.rebuild(x)
returns a new, identical copy ofnode
, except with the list of children replaced byx
. Note that the order of the children must be identical to the order returned bynode.children
. -
node.recur(fn)
is a shorthand fornode.rebuild(node.children.map(fn))
For example, let's say I wanted to find all of the Var(iables) used in an expression. I could write
def getVars(expr:Expression): Set[String] =
expr match {
case Var(v) => Set(v)
case _ => expr.children.toSet.flatMap( getVars(_) )
}
Let's say I wanted to add 1 to all integer primitives in an expression. I would write:
def incrementInts(expr: Expression): Expression =
expr match {
case IntPrimitive(x) => IntPrimitive(x+1)
case _ => expr.rebuild( incrementInts(_) )
}
For your convenience, a few common transformations and pattern constructors are defined in the ExpressionUtils object. A few of the useful operations are:
-
makeAnd
,makeOr
: A slightly enhanced version of the constructors for conjunction and disjunction that attempts to simplify its return value (e.g., if either argument is a BoolPrimitive). An overloaded variant of these also exists that constructs the Con/Disjunction of a list of expressions. -
makeNot
: As above, attempt to simplify the returned expression if possible (e.g., apply DeMorgan's law to push the Not down). -
getConjuncts
,getDisjuncts
: Flatten out a set of nested Con/Disjunctions into a list. -
makeSum
: List version of the binary constructor -
makeCaseExpression
: Mimir uses an IfThenElse-style construct instead of SQL's case types. This handy method allows for quick conversions to a sequence of nested If/Then/Else operations if needed. -
foldConditionalsToCase
: Converts from Mimir's IfThenElse operations to SQL's case formatting -
getColumns
: Get all of theVar
terms in an Expression.
The second type of AST in Mimir defines tables/relations, or equivalently, expresses queries. As such it is based on Relational Algebra with a few extensions. Nodes in this tree are subclasses of Operator, defined in Operator.scala. Unlike primitive-valued expressions, queries draw on a much simpler set of operators
- Project(List[ProjectArg], input) remaps its input. Each ProjectArg defines one column of output, obtained by evaluating the provided expression on the input table.
- Select(condition, input) filters the input. The output includes only rows of the input table that satisfy the provided condition.
- Join(lhs, rhs) combines two tables horizontally. A small misnomer here that stuck: Join is actually a cartesian product of its inputs. The output includes every row of lhs paired with every row of rhs.
- Union(lhs, rhs) combines two tables vertically. The output includes every row from lhs and every row from rhs. Note that this is a Bag union (i.e., UNION ALL).
- Aggregate(List[AggregateArg], groupByColumns, input) aggregates its input. Rows of the input are grouped by the groupByColumns, and for each group we compute summary statistics given by the AggregateArgs.
- Table(name, schema, metadata) identifies a table or view in the database. To simplify type-checking, the Table must be defined with its schema (a list of Name/Type pairs).
Tables may also include metadata columns that extract implicit attributes of the table. For example, most databases define an implicit ROWID
attribute for each table. Every metadata element includes
- The name of the attribute as far as Mimir is concerned.
- An expression defining how the attribute is to be extracted on the backend database.
- The type of the attribute.
MimirSQL is based on JSQLParser (or rather UB's fork of it). Mimir actually needs to fork the grammar file (JSqlParserCC.jj), because we add a few things to the grammar (e.g., CREATE VIEW
, CREATE LENS
). However, JSQLParser is still a dependency, because we re-use many of the AST node classes from the original. New node types are defined in src/main/java/mimir/sql.
Mimir defines two classes to perform conversions between JSQLParser SQL ASTs and Mimir's internal Relational Algebra AST:
-
SQLToRA (aka
db.sql
) includes methods for converting JSQLParser queries and expressions to Mimir's internal representation. Generally, the method to use is one of the overloaded variants ofdb.sql.convert
. -
RAToSQL (aka
db.ra
) includes methods for converting Mimir's internal representation to a JSQLParser query (which you can then stringify). Generally, the method to use isdb.ra.convert
.
Just like Expression, instances of Operator include utility methods for recursive traversal
-
node.children
: Get all of the child Operator nodes. -
node.rebuild(x)
: Reconstruct the node using a new list of children. -
node.recur(fn)
: Reconstruct the node by remapping all of the child Operator nodes. -
node.expressions
: Get all of the child Expression nodes (e.g., the ProjectArg expressions in a Project). -
node.rebuildExpressions(x)
: Reconstruct the node by using a new list of children. -
node.recurExpressions(fn)
: Reconstruct the node by remapping all of the child Expression nodes. An optional variant of this includes a node-specific typechecker as the second argument of the function.
In general, code that manipulates operator trees (intentionally) creates sub-optimal tree structures. For example, a sequence of data transformations might be represented as a set of nested Project operators. Similarly, SQLToRA creates a large number of redundant projections to simplify bookkeeping.
Code that generates and manipulates operators is already going to be super complex. Don't try to make it more complex by optimizing it. Instead, Mimir includes a reasonably robust optimization pipeline (mimir.optimizer, which is invoked by Compiler) to remove redundancies. For example, InlineProjections will collapse multiple nested Project operators. In our experience, it's preferable to use simpler code to create a more complex operator tree and then write an optimization rule to simplify it later. A good example of this policy being ignored (by Oliver) is Percolator's percolateLite
method, which could be made much simpler for what it does.
As with the primitive-valued expressions, we have a set of convenience methods for common use cases called OperatorUtils. Most of these are constructor wrappers. It should be noted that several of these methods (e.g., applyFilter
) exist primarily for interoperability with Java, as Java code doesn't play nicely with Scala case constructors.