-
Notifications
You must be signed in to change notification settings - Fork 11
Migrating from Jupyter
Vizier is similar to Jupyter, but makes several design choices that require you to think a little differently when writing code. This page outlines what you need to watch out for, describes workarounds, and details what we're doing to make the transition smoother in the future. If you find a pain point especially troubling, go and upvote/discuss the related issue!
Unlike Jupyter, where you manually execute cells, Vizier cells run in order. If you go back and edit an earlier cell, all cells that depend on it will be re-executed.
- When you update a cell, all of its dependencies will be re-executed immediately.
- You won't be able to access datasets/artifacts generated by later cells.
Reproducibility is a huge problem for Jupyter, in large part because cells can be executed in arbitrary orders. This makes Jupyter hard to approach for newcomers, and encourages lots of bad habits (like making notebooks that have to be executed in "the right" order). The main reason Juypyter does this is because dependency tracking in python is hard. If you edit one cell, Jupyter would have to execute all cells below it, since they might have changed. Vizier, on the other hand, knows which cells depend on which cells. When you edit a cell, only dependent cells will be re-executed (although you can still re-execute cells manually if you like).
Unlike Jupyter, where you treat the entire notebook as one big script run in a single interpreter, Vizier runs each python script cell with a fresh interpreter.
- You'll need to repeat
import
directives in each python cell (we're cleaning this up). - To pass a variable or dataset from one cell to the next, you'll need to use the
vizierdb
module. For example, to get a dataset created by an earlier cell, usevizierdb.get_dataset(ds_name)
. To allow later cells to use your dataset, usevizierdb.update_dataset
orvizierdb.create_dataset
.
First and foremost, it makes dependency tracking feasible. Every time you call one of the vizierdb
dataset accessors, Vizier records the dataset you got, created, or updated. This means Vizier knows which cells your python cell depends on, and can figure out which cells depend on your cell, and can avoid re-executing your cell when possible.
Another benefit of going through accessors, is that it makes it easier to translate artifacts between different languages. This is what powers Vizier's polyglot features: A dataset created by SQL can be accessed by Python, and when Python creates a dataset, R cells can access it without problem.
In contrast to state-based notebooks like Jupyter, artifacts produced by cells (e.g., datasets) in Vizier are immutable: Once created, they exist in perpetuity as-is. Vizier creates the illusion of mutability by allowing names to be updated to point to new versions of the artifact.
When you update a dataset (e.g., in a SQL or python cell), what you're actually doing under the hood is creating a new immutable dataset, and updating the name to point to the new dataset. This means a few things. First, there's no need to manually cache datasets, since datasets (and other artifacts) are automatically cached by default. Vizier does have a "Clone dataset" cell type. This cell always executes almost immediately, because it doesn't need to copy the dataset. It just creates a new name for the same dataset!
Second, it means that you can always go back to an earlier version of the dataset. You can always review a dataset as it was earlier in the notebook and compare it against a later version of the dataset.
We see this as a strictly good thing.