Skip to content

DevGuide Gotchas

Oliver Kennedy edited this page Oct 11, 2023 · 3 revisions

Gotchas

If you take nothing else away from the developer guide, read this section. It highlights the oddities of this codebase. These are things that will cause you grief if you try to make changes without realizing what's going on.

ScalikeJDBC

ScalikeJDBC is a scala-native wrapper around JDBC connections (analogous to Python's Alchemy). It makes extensive use of scala implicits, which can be a little confusing if you're just getting started with scala, so let's start with a quick primer on the part of implicits that you need to know.

Global state is annoying, but can be more programmer friendly when you need to keep passing that state into functions. For example, let's say you have a JDBC connection. You need to keep passing that connection into every function that might potentially need it. That makes code hard to read and write, because you always have that extra argument hanging off the end. It can be easier just to dump the connection into a global variable and use it when needed. Global state, however, drives functional people mad (for good reasons).

Scala (kind of?) avoids the pitfals of global state through something it calls implicit parameters. Take the following function:

def foo(implicit x: Bar) = x.baz

To call this function, you don't need to pass the x parameter explicitly. You would just write

println(s"${foo}")

Of course, this example is a little too simple... if you actually try the following, the compiler will yell at you:

def myFunction() = 
{
  println(s"${foo}")
}

What's happening under the hood here is that instead of manually passing a Bar object around, the compiler tries to automatically pick and plug in a Bar object that you happen to have lying around. Since you don't have one, it gets confused. Let's try the following:

def myFunction() = 
{
  val x = Bar(baz = "stuff")
  println(s"${foo}")
}

That still won't work, because the compiler won't look for just any old variables. You have to explicitly mark a variable as implicit for it to work, like so:

def myFunction() = 
{
  implicit val x = Bar(baz = "stuff")
  println(s"${foo}")
}

Again, looking under the hood, when you call foo (without parameters), the compiler will see the implicit x parameter of type Bar, try to find a Bar object in scope (possibly with a different name), and automatically plug that object in to the function call.


ScalikeJDBC makes extensive use of implicit variables to keep track of session state. Specifically, in this codebase you will see a lot of functions with (implicit session: DBSession). These are functions that DO assume that you have an open session. Conversely, functions without this parameter assume that you DO NOT have an implicit session active.

To create a session, use one of the following two constructs:

DB.readOnly { implicit s => /* your code goes here */ }

or

DB.autoCommit { implicit s => /* your code goes here */ }

Gotcha: Nested session creation is verbotten. If you find code myseriously hanging or getting SQL timeouts, you are creating a nested session.

The following guidelines apply with respect to session creation:

  • info.vizierdb.api: All classes/methods assume that you DO NOT have a session.
  • info.vizierdb.artifacts: Classes/methods do not care whether you have a session.
  • info.vizierdb.catalog: Classes/methods assume that you DO have a session.
  • info.vizierdb.commands: Classes/methods assume that you DO NOT have a session.
  • info.vizierdb.filestore: Classes/methods do not care whether you have a session.
  • info.vizierdb.viztrails.MutableProject: Assumes that you DO NOT have a session.
  • info.vizierdb.viztrails.Provenance: Assumes that you DO have a session.
  • info.vizierdb.viztrails.Scheduler: Assumes that you DO have a session.

Continuation-style Programming in Database Code.

In addition to Apache Spark, Vizier uses a local SQLite database to store metadata (e.g., project structure, cells, historical traces, etc...). As noted above, due to limitations in SQLite's JDBC driver, Vizier can only maintain one 'connection' to SQLite at a time. This is known to be a significant bottleneck.

The problem is exacerbated by the fact that connections are implemented transactionally. For example, consider the following pseudocode.

API Endpoint: 
  1. Open database connection
  2. Call Artifact.get(pid, aid).dataframe
    3. Vizier retrieves the artifact
    4. Artifact retrieves dataframe metadata
    5. Artifact code assembles dataframe [EXPENSIVE]
  6. Compute a thing with the dataframe [EXPENSIVE]
  7. Close the database connection

Although steps 4 and 5 do not depend on SQLite, both are run while the database connection is open and thus block everything else that could be running. There's one simple optimization that fixes part of the problem:

API Endpoint: 
  1. Open database connection
  2. Call Artifact.get(pid, aid).dataframe
    3. Vizier retrieves the artifact
    4. Artifact retrieves dataframe metadata
    5. Artifact code assembles dataframe [EXPENSIVE]
  6. Close the database connection
  7. Compute a thing with the dataframe [EXPENSIVE]

Move step 6 to the end, and we avoid one expensive computation. However, since steps 4 and 5 are nested inside the same function, we cannot call them after closing the database. Instead, what happens is:

API Endpoint: 
  1. Open database connection
  2. Call Artifact.get(pid, aid).dataframe
    3. Vizier retrieves the artifact
    4. Artifact retrieves dataframe metadata
    5. Package the metadata up into a dataframe constructor
  6. Close the database connection
  7. Invoke the dataframe constructor
    8. Artifact code assembles dataframe [EXPENSIVE]
  9. Compute a thing with the dataframe [EXPENSIVE]

In short, we wrap up all of the state needed to build the dataframe and return that (this is commonly called a 'continuation'). The caller can then call the code after the connection is closed.

Scala provides a very convenient way to 'wrap up' all of the state needed to construct the dataframe.

var foo = { () => do some stuff to build the dataframe }

foo is now a 'lambda function', with all of the state needed to build the dataframe packaged up along with it. When we run the foo function, we actually do the work of building the dataframe. i.e., the actual dataframe is constructed by calling

foo()