Skip to content

Commit

Permalink
wip: insert store and load statements
Browse files Browse the repository at this point in the history
Co-Authored-By: Eric Kidd <[email protected]>
  • Loading branch information
hanakslr and emk committed Dec 6, 2024
1 parent 86d86a3 commit fda888d
Show file tree
Hide file tree
Showing 10 changed files with 324 additions and 41 deletions.
21 changes: 11 additions & 10 deletions ARCHITECTURE.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,20 +7,21 @@ Here's a high level overview of how `joinery` works.
Compilation procedes in several phases:

1. [Tokenize](./src/tokenizer.rs).
- Split the source into identifiers, punctuation, literals, etc. All tokens contain the original source code, location information, and surrounding whitespace.
- Split the source into identifiers, punctuation, literals, etc. All tokens contain the original source code, location information, and surrounding whitespace.
2. [Parse into AST](./src/ast.rs).
- We use the [`peg` crate](https://docs.rs/peg/). This is a [Parsing Expression Grammar](https://en.wikipedia.org/wiki/Parsing_expression_grammar) (PEG) parser. This is a bit _ad hoc_ as grammars go, but `peg` is a very nice library.
- We make heavy use of `#[derive]` macros to implement the AST types.
- We use the [`peg` crate](https://docs.rs/peg/). This is a [Parsing Expression Grammar](https://en.wikipedia.org/wiki/Parsing_expression_grammar) (PEG) parser. This is a bit _ad hoc_ as grammars go, but `peg` is a very nice library.
- We make heavy use of `#[derive]` macros to implement the AST types.
3. [Check types](./src/infer/mod.rs).
- The internal type system is defined in [`src/types.rs`](./src/types.rs). This is distinct from the simplisitic "source level" type system parsed by [`src/ast.rs`](./src/ast.rs), and better suited to doing inference.
- Name lookup is handled in [`src/scopes.rs`](./src/scopes.rs). Note that SQL requires several different kinds of scopes.
- The internal type system is defined in [`src/types.rs`](./src/types.rs). This is distinct from the simplisitic "source level" type system parsed by [`src/ast.rs`](./src/ast.rs), and better suited to doing inference.
- Name lookup is handled in [`src/scopes.rs`](./src/scopes.rs). Note that SQL requires several different kinds of scopes.
- Type checking also needs to know about "memory" types (like Trino's UUID) versus "storage" types (like Trino's VARCHAR when using Hive, which doesn't allow storing UUID). And it needs make sure that all appropriate `LoadExpression` and `StoreExpression` values get inserted.
4. [Apply transforms](./src/transforms/mod.rs).
- A list of transforms is supplied by each database driver.
- Transforms use Rust pattern-matching to match parts of the AST, and build new AST nodes using `sql_quote!`. Note that `sql_quote!` outputs _tokens_, so we need to call back into the parser. This is closely patterned after Rust programmatic macros using [`syn`](https://docs.rs/syn/) and [`quote`](https://docs.rs/quote/).
- After applying a transform, we _may_ need to check types again to support later transforms. This works a bit like an LLVM analysis pass, where specific transforms may indicate that the require types, and the harness ensures that valid types are available.
- The output of a transform must be structurally valid BigQuery SQL, though after a certain point it may no longer type check.
- A list of transforms is supplied by each database driver.
- Transforms use Rust pattern-matching to match parts of the AST, and build new AST nodes using `sql_quote!`. Note that `sql_quote!` outputs _tokens_, so we need to call back into the parser. This is closely patterned after Rust programmatic macros using [`syn`](https://docs.rs/syn/) and [`quote`](https://docs.rs/quote/).
- After applying a transform, we _may_ need to check types again to support later transforms. This works a bit like an LLVM analysis pass, where specific transforms may indicate that the require types, and the harness ensures that valid types are available.
- The output of a transform must be structurally valid BigQuery SQL, though after a certain point it may no longer type check.
5. [Emit SQL](./src/ast.rs).
- This consumes AST nodes and emits them as database-specific strings. We prefer to do as much work as possible using AST transforms, but sometimes we can't represent database-specific features in the AST.
- This consumes AST nodes and emits them as database-specific strings. We prefer to do as much work as possible using AST transforms, but sometimes we can't represent database-specific features in the AST.
6. [Run](./src/drivers/mod.rs).
- This is a slightly dodgy layer that knows how to run SQL. Mostly it's intended for running our test suites, not for production use. Some of the Rust database drivers have problems reading complex data types back into Rust.

Expand Down
19 changes: 19 additions & 0 deletions LOAD.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
All SQL must be run through joinery.
Every time that we create a table, we need to insert its BigQuery `CREATE TABLE name (col...)` into pg.
That table looks like this:

```sql
PRIMARY KEY (bq_project, bq_dataset, bq_table_name)

bq_project -- The BigQuery project, which gets mapped to a Trino catalog somehow
bq_dataset -- The BigQuery dataset, which is a trino schema with the same name
bq_table_name -- The BigQuery table, which is a trino table with the same name
create_table_sql -- this is always a raw typed CREATE TABLE statement in BigQuery SQL. CREATE TABLE (my_col data_type, ...);
```

When we run another SQL query tomorrow, we need to make sure that we have access to the BigQuery table names and their `CREATE TABLE` SQL
We can load those table definitions into our scope
Then run type inference normally

So when we try to access prod_gke.my_dataset.my_table,
...we find a CREATE TABLE for it, parse it, and inject it in the scope
51 changes: 40 additions & 11 deletions src/ast.rs
Original file line number Diff line number Diff line change
Expand Up @@ -34,12 +34,14 @@ use crate::{
trino::{TrinoString, KEYWORDS as TRINO_KEYWORDS},
},
errors::{format_err, Error, Result},
infer::{InferTypes, InsertStoreExpressions as _},
known_files::{FileId, KnownFiles},
scope::{Scope, ScopeHandle},
tokenizer::{
tokenize_sql, EmptyFile, Ident, Keyword, Literal, LiteralValue, PseudoKeyword, Punct,
RawToken, Span, Spanned, ToTokens, Token, TokenStream, TokenWriter,
},
types::{StructType, TableType, ValueType},
types::{SimpleType, StructType, TableType, ValueType},
util::{is_c_ident, AnsiIdent},
};

Expand Down Expand Up @@ -621,6 +623,16 @@ pub struct SqlProgram {
pub statements: NodeVec<Statement>,
}

impl SqlProgram {
/// Call `infer_types` for the first time, using the root scope, and doing
/// the one-time task of inserting [`StoreExpression`] where needed.
pub fn infer_types_for_first_time(&mut self) -> Result<(Option<TableType>, ScopeHandle)> {
self.insert_store_expressions()?;
let scope = Scope::root();
self.infer_types(&scope)
}
}

/// A statement in our abstract syntax tree.
#[derive(Clone, Debug, Drive, DriveMut, Emit, EmitDefault, Spanned, ToTokens)]
pub enum Statement {
Expand Down Expand Up @@ -1646,7 +1658,7 @@ pub struct LoadExpression {
#[emit(skip)]
#[to_tokens(skip)]
#[drive(skip)]
memory_type: Option<ValueType>,
pub memory_type: Option<ValueType>,

/// Our underlying expression.
pub expression: Box<Expression>,
Expand All @@ -1655,6 +1667,11 @@ pub struct LoadExpression {
impl Emit for LoadExpression {
fn emit(&self, t: Target, f: &mut TokenWriter<'_>) -> ::std::io::Result<()> {
match t {
Target::BigQuery => {
f.write_token_start("%LOAD(")?;
self.expression.emit(t, f)?;
f.write_token_start(")")
}
Target::Trino(connector_type) => {
let bq_memory_type = self
.memory_type
Expand Down Expand Up @@ -1692,7 +1709,7 @@ pub struct StoreExpression {
#[emit(skip)]
#[to_tokens(skip)]
#[drive(skip)]
memory_type: Option<ValueType>,
pub memory_type: Option<ValueType>,

/// Our underlying expression.
pub expression: Box<Expression>,
Expand All @@ -1701,21 +1718,33 @@ pub struct StoreExpression {
impl Emit for StoreExpression {
fn emit(&self, t: Target, f: &mut TokenWriter<'_>) -> ::std::io::Result<()> {
match t {
Target::BigQuery => {
f.write_token_start("%STORE(")?;
self.expression.emit(t, f)?;
f.write_token_start(")")
}
Target::Trino(connector_type) => {
let bq_memory_type = self
.memory_type
.as_ref()
.expect("memory_type should have been filled in by type inference");
let trino_memory_type =
TrinoDataType::try_from(bq_memory_type).map_err(io::Error::other)?;
let transform = connector_type.storage_transform_for(&trino_memory_type);
let (prefix, suffix) = transform.store_prefix_and_suffix();

f.write_token_start(&prefix)?;
self.expression.emit(t, f)?;
f.write_token_start(&suffix)
// If our bq_memory_type is NULL, we don't need to do any transforms because
// NULL is NULL in both storage and memory types and dbcrossbar_trino doesn't
// support NULL as a memory type.
if let ValueType::Simple(SimpleType::Null) = bq_memory_type {
self.expression.emit(t, f)
} else {
let trino_memory_type =
TrinoDataType::try_from(bq_memory_type).map_err(io::Error::other)?;
let transform = connector_type.storage_transform_for(&trino_memory_type);
let (prefix, suffix) = transform.store_prefix_and_suffix();

f.write_token_start(&prefix)?;
self.expression.emit(t, f)?;
f.write_token_start(&suffix)
}
}
_ => self.emit_default(t, f),
}
}
}
Expand Down
3 changes: 1 addition & 2 deletions src/cmd/run.rs
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,7 @@ pub async fn cmd_run(files: &mut KnownFiles, opt: &RunOpt) -> Result<()> {
let mut ast = parse_sql(files, file_id)?;

// Run the type checker, but do not fail on errors.
let scope = Scope::root();
if let Err(err) = ast.infer_types(&scope) {
if let Err(err) = ast.infer_types_for_first_time() {
err.emit(files);
eprintln!("\nType checking failed. Manual fixes will probably be required!");
}
Expand Down
3 changes: 1 addition & 2 deletions src/cmd/sql_test.rs
Original file line number Diff line number Diff line change
Expand Up @@ -148,8 +148,7 @@ async fn run_test(
let mut ast = parse_sql(files, file_id)?;

// Type check the AST.
let scope = Scope::root();
ast.infer_types(&scope)?;
ast.infer_types_for_first_time()?;

//eprintln!("SQLite3: {}", ast.emit_to_string(Target::SQLite3));
let output_tables = find_output_tables(&ast)?;
Expand Down
5 changes: 1 addition & 4 deletions src/cmd/transpile.rs
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,7 @@ use crate::{
ast::{parse_sql, Emit},
drivers,
errors::Result,
infer::InferTypes,
known_files::KnownFiles,
scope::Scope,
};

/// Run SQL tests from a directory.
Expand All @@ -38,8 +36,7 @@ pub async fn cmd_transpile(files: &mut KnownFiles, opt: &TranspileOpt) -> Result
let mut ast = parse_sql(files, file_id)?;

// Run the type checker, but do not fail on errors.
let scope = Scope::root();
if let Err(err) = ast.infer_types(&scope) {
if let Err(err) = ast.infer_types_for_first_time() {
err.emit(files);
eprintln!("\nType checking failed. Manual fixes will probably be required!");
}
Expand Down
2 changes: 1 addition & 1 deletion src/drivers/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ use tracing::{debug, trace};
use crate::{
ast::{self, Emit, Target},
errors::{format_err, Error, Result},
infer::InferTypes,
infer::InferTypes as _,
scope::Scope,
transforms::{Transform, TransformExtra},
};
Expand Down
156 changes: 156 additions & 0 deletions src/infer/insert_store_expressions.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
//! A preliminary, once-only type inference step where we patch up the AST
//! to include [`ast::StoreExpression`].
use crate::{
ast::{self},
errors::Result,
};

use super::nyi;

/// Walk an AST tree, inserting [`ast::StoreExpression`] everywhere we need it.
///
/// This is called only once, before the first time we run type inference.
pub trait InsertStoreExpressions {
/// Find all the places that need a [`ast::StoreExpression`] and insert them.
fn insert_store_expressions(&mut self) -> Result<()>;
}

impl InsertStoreExpressions for ast::SqlProgram {
fn insert_store_expressions(&mut self) -> Result<()> {
self.statements.insert_store_expressions()
}
}

impl InsertStoreExpressions for ast::Statement {
fn insert_store_expressions(&mut self) -> Result<()> {
match self {
ast::Statement::Query(stmt) => stmt.insert_store_expressions(),
ast::Statement::DeleteFrom(_) => Ok(()),
ast::Statement::InsertInto(stmt) => stmt.insert_store_expressions(),
ast::Statement::CreateTable(stmt) => stmt.insert_store_expressions(),
// This is a problem for another day and another poor developer. Do
// views output values in memory format or backend-specific storage
// format?
ast::Statement::CreateView(_) => Err(nyi(self, "CREATE VIEW storage expressions")),
ast::Statement::DropTable(_) => Ok(()),
ast::Statement::DropView(_) => Ok(()),
}
}
}
impl InsertStoreExpressions for ast::QueryStatement {
fn insert_store_expressions(&mut self) -> Result<()> {
self.query_expression.insert_store_expressions()
}
}

impl InsertStoreExpressions for ast::QueryExpression {
fn insert_store_expressions(&mut self) -> Result<()> {
self.query.insert_store_expressions()
}
}

impl InsertStoreExpressions for ast::QueryExpressionQuery {
fn insert_store_expressions(&mut self) -> Result<()> {
match self {
ast::QueryExpressionQuery::Select(expr) => expr.insert_store_expressions(),
ast::QueryExpressionQuery::Nested { query, .. } => query.insert_store_expressions(),
ast::QueryExpressionQuery::SetOperation { left, right, .. } => {
left.insert_store_expressions()?;
right.insert_store_expressions()
}
}
}
}

impl InsertStoreExpressions for ast::SelectExpression {
fn insert_store_expressions(&mut self) -> Result<()> {
self.select_list.insert_store_expressions()
}
}

impl InsertStoreExpressions for ast::SelectList {
fn insert_store_expressions(&mut self) -> Result<()> {
self.items.insert_store_expressions()
}
}

impl InsertStoreExpressions for ast::SelectListItem {
fn insert_store_expressions(&mut self) -> Result<()> {
match self {
ast::SelectListItem::Expression { expression, .. } => {
expression.insert_store_expressions()
}
ast::SelectListItem::Wildcard { .. } => {
Err(nyi(self, "InsertStoreExpressions(Wildcard)"))
}
ast::SelectListItem::TableNameWildcard { .. } => {
Err(nyi(self, "InsertStoreExpressions(TableNameWildcard)"))
}
ast::SelectListItem::ExpressionWildcard { .. } => {
Err(nyi(self, "InsertStoreExpressions(TableNameWildcard)"))
}
}
}
}

impl InsertStoreExpressions for ast::Expression {
/// Wrap ourselves in a `StoreExpression`. Not recursive!
fn insert_store_expressions(&mut self) -> Result<()> {
let store_expr = ast::Expression::Store(ast::StoreExpression {
memory_type: None,
expression: Box::new(self.clone()),
});
*self = store_expr;
Ok(())
}
}

impl InsertStoreExpressions for ast::InsertIntoStatement {
fn insert_store_expressions(&mut self) -> Result<()> {
self.inserted_data.insert_store_expressions()
}
}

impl InsertStoreExpressions for ast::InsertedData {
fn insert_store_expressions(&mut self) -> Result<()> {
match self {
ast::InsertedData::Values { rows, .. } => rows.insert_store_expressions(),
ast::InsertedData::Select { query, .. } => query.insert_store_expressions(),
}
}
}

impl InsertStoreExpressions for ast::ValuesRow {
fn insert_store_expressions(&mut self) -> Result<()> {
self.expressions.insert_store_expressions()
}
}

impl InsertStoreExpressions for ast::CreateTableStatement {
fn insert_store_expressions(&mut self) -> Result<()> {
self.definition.insert_store_expressions()
}
}

impl InsertStoreExpressions for ast::CreateTableDefinition {
fn insert_store_expressions(&mut self) -> Result<()> {
match self {
// We don't need to do anything here because we aren't actually
// storing anything. It is a plain column definition.
ast::CreateTableDefinition::Columns { .. } => Ok(()),
ast::CreateTableDefinition::As {
query_statement, ..
} => query_statement.insert_store_expressions(),
}
}
}

impl<T: InsertStoreExpressions + ast::Node> InsertStoreExpressions for ast::NodeVec<T> {
fn insert_store_expressions(&mut self) -> Result<()> {
for item in self.node_iter_mut() {
item.insert_store_expressions()?;
}
Ok(())
}
}
Loading

0 comments on commit fda888d

Please sign in to comment.