-
Notifications
You must be signed in to change notification settings - Fork 11
DevGuide Cells
Note: This page describes the v2.0 API
A cell command defines logic for one type of 'module' in a Vizier workflow. This includes everything from language-specific cells, to data visualization cells, and file loading cells. Every single cell has a corresponding command. See the info.vizierdb.commands package for examples.
Cells are grouped into a three-tiered hierarchy:
- category: A collection of packages that are grouped together for display purposes
- package: A collection of related commands.
- command: A specific operation in a cell.
A specific command is identified by the 2-part identifier: package.command
Commands are managed by info.vizierdb.commands.Commands. Since Vizier does not yet support plugin modules, all commands are hard-coded into this package. To make the command visible in Vizier's new cell interface, add your command object to one of the existing calls to register
in Commands, or add a new call to add a new package. The format is:
"uniqueCommandId" -> info.vizierdb.commands.command.object.classpath.here
A command is defined by an object
that mixes in the info.vizierdb.commands.Command trait. This trait requires implementations of the following methods:
-
name
: The user-facing name of the command. -
parameters
: A sequence of info.vizierdb.commands.Parameters that describes the arguments to the command. The parameter list is used to generate a default interface for the cell in the UI, and to serialize/deserialize the actual arguments. -
format
: Generate a short string representation of the command that will be displayed in the notebook. -
title
: Generate a shorter string representation of the command that will be displayed in the table of contents. -
process
: This is the 'main' method. This method is invoked when the cell is run. -
predictProvenance
: Currently unused. Return None.
See info.vizierdb.commands.Parameter for specifics. Common parameter options include:
-
id
: A unique identifier for the parameter. This is how the parameter is retrieved. It is customary to define IDs as constants in the Command objects. -
name
: A human-readable name for the parameter. This will be displayed to the user in the default-generated user interface. -
required
(defaulttrue
): True if the user should be required to enter a value for the parameter (note: This option is ignore if a default value is given). -
hidden
(defaultfalse
): True if the parameter should be stored in the background. This is primarily useful for caching state across invocations of a command. -
default
: Specifies a default value for the parameter (note: Not available on all parameter types)
Special parameter types include:
- BooleanParameter: Simple yes/no. Implemented by default as a checkbox.
- DecimalParameter: Any floating-point number. Implemented by default as a numeric Input box.
- IntParameter: Any integer. Implemented by default as a numeric Input box.
- StringParameter: Any (short) string (e.g., the name of an output dataset). Implemented by default as an Input box.
- DataTypeParameter: A data type. Implemented by default as a drop-down menu with standard options (e.g., String, Integer, etc...).
-
CodeParameter: Source code. The
language
option must be one ofpython
,scala
,sql
, ormarkdown
. Implemented by default as a CodeMirror. - EnumerableParameter: A list of possible options. Each option may be provided with human readable text and a string 'value' for the backend. Implemented by default as a drop down menu.
- DatasetParameter: A dataset. Implemented by default as a drop-down menu with a list of all Datasets available at that point in the notebook.
- ColIdParameter: A column in a dataset (note: Requires that there be an accompanying DatasetParameter). Implemented by default as a drop-down menu with a list of columns in the most recently specified dataset.
-
ArtifactParameter: Any artifact. Implemented by default as a drop-down menu with a list of all Artifacts available at that point in the notebook. (The
artifactType
parameter may be used to specify a class of artifacts to be selected) - FileParameter: A user-uploaded file. Implemented by default as a file drop area.
-
ListParameter: A table of parameter values. Each
Parameter
provided in thecomponents
option forms one column of the table. Users may define as many rows as desired. Implemented by default as a 2-D grid with a [+] and [-] annotations to allow insertion and deletion of rows. -
RecordParameter: Like
ListParameter
but limited to a single row. Useful for grouping related parameters together. Implemented by default as an option group. -
EnvironmentParameter: An execution environment.
language
must be one ofpython
. Implemented by default as a drop-down menu. - RowIdParameter: Deprecated.
- CachedStateParameter: A workaround allowing cells to preserve "cached" state in between executions. Generally, you should not use this.
format
, title
and process
receive an info.vizierdb.commands.Arguments object that provides access to the values of the parameters specified by parameters
. Retrieve the parameter value based on the [Parameter] as follows:
-
BooleanParameter:
args.get[Boolean](parameterId)
(returns true/false according to the parameter) -
DecimalParameter:
args.get[Double](parameterId)
(returns the double value of the parameter) -
IntParameter:
args.get[Int](parameterId)
(returns the integer value of the parameter) -
StringParameter:
args.get[String](parameterId)
(returns the string value of the parameter) -
DataTypeParameter:
args.get[DataType](parameterId)
(returns an Apache Spark DataType) -
CodeParameter:
args.get[String](parameterId)
(returns the code as a String -- note, this may be quite large) -
EnumerableParameter:
args.get[String](parameterId)
(returns thevalue
field of the selected parameter value) -
DatasetParameter:
args.get[String](parameterId)
(returns the name of the dataset artifact; use ExecutionContext to get the artifact) -
ColIdParameter:
args.get[Int](parameterId)
(returns the integer index of the column) -
ArtifactParameter:
args.get[String](parameterId)
(returns the name of the artifact; use ExecutionContext to get the artifact) -
FileParameter:
args.get[FileArgument](parameterId)
(returns a FileArgument to the uploaded/provided file) -
ListParameter:
args.getList(parameterId)
(returns a Sequence of Arguments, one per record) -
RecordParameter:
args.getRecord(parameterId)
(returns an Arguments)
An Opt
version of the above methods (e.g., args.getOpt
) exists to return an Option
if the argument is not provided (e.g., if required = false
for the corresponding Parameter).
Artifacts are used to pass state between cells. Anything that needs to go from one cell to the next must be wrapped in an Artifact.
An artifact is, in general, defined by two items:
-
(id, projectId)
: A unique artifact identifier, coupled with the id of the project that created it. -
t
: An Artifact Type, see below. -
data
: An opaque byte array.
An artifact is immutable. Once the artifact is created, it can not be modified or destroyed (although see below for discussion). Artifact versions are identified by the identifier of the project that created them and a globally unique artifact ID. Additionally, the execution context (see below) maintains a list of mappings from friendly 'names' to artifact identifiers.
Vizier assigns artifacts types to streamline interoperability between languages and to make it possible to display artifacts inline. Supported types include:
-
DATASET
: An Apache Spark dataframe.- The
data
parameter must be a json-encoded Dataset object. See ExecutionContext below for helper functions to create these.
- The
-
FUNCTION
: A snippet of code defining a function.- The
mimeType
parameter defines the type of function, and must be one ofapplication/python
- The
-
BLOB
: An opaque blob of medium-sized data.- The
mimeType
parameter may be used to distinguish between different datatypes, and may be anything.
- The
-
FILE
: A file stored in the filesystem (preferred for large data).- The
mimeType
parameter is used to store the type of the file.
- The
-
PARAMETER
: A small, configurable value (currently used to pass Strings, Integers, etc... between python cells).- The
data
parameter must be a json-encoded ParameterArtifact
- The
-
VEGALITE
: A Vega Lite chart.- The
data
parameter must be a json object conforming to the Vega-Lite spec. e.g., see vizier-vega.
- The
The Artifact class defines several useful helper methods (see the ScalaDoc for the full description):
-
artifact.file
: A java [File] object holding the path to the file for this artifact. This method is usually only helpful forFILE
-typed artifacts. However, any artifact may be defined with an associated file if on-disk storage is required. -
artifact.parameter
(PARAMETER
only): The ParameterArtifact value of the artifact. -
artifact.data
: The raw data bytes of the artifact (Note: if the artifact stores data in a file, you must usefile
instead). -
artifact.string
: The raw data of the artifact as a string (Shortcut fornew String(artifact.data)
). -
artifact.json
: The json value of the artifact (Shortcut forJson.parse(artifact.string)
). If thedata
field is empty, an empty object will be returned. -
artifact.dataframe
(DATASET
only): Obtain the Spark dataframe for the artifact. Note: You must have an active database connection to call this method (see DevGuide-Gotchas). -
artifact.datasetSchema
(DATASET
only): Obtain the schema of the specified dataset (as a sequence of Spark StructFields) -
artifact.datasetPropertyOpt(name)
(DATASET
only): Obtain a specified dataset property or None if the property is not set. -
artifact.datasetProperty(name)
(DATASET
only): Obtain a specified dataset property. -
artifact.updateDatasetProperty(name, value)
(DATASET
only): Update a dataset property. -
artifact.filePropertyOpt(name)
(FILE
only): Obtain a specified file property or None if the property is not set. -
artifact.fileProperty(name)
(FILE
only): Obtain a specified file property. -
artifact.fileDatasetProperty(name, value)
(FILE
only): Update a file property.
DATASET and FILE artifacts may be associated with properties, allowing assorted metadata to be associated with the dataset or file. Although these property sets are mutable, they are intended as a way to enact lazy computations: Expensive computations over the files that are delayed until they are actually needed. For example, a common use is to store profiler metadata like the number of rows in a dataset. In short, property fields should be treated as being append-only.
process
also receives an info.vizierdb.commands.ExecutionContext that describes the notebook state at the point of the cell. The context can be used to retrieve artifacts, create artifacts, or output messages.
With respect to artifacts, an ExecutionContext stores a mapping from user-friendly names to specific artifact versions (as noted above, an artifact version is an immutable object identified by a project and artifact id pair). When a cell is run, the execution context it receives is the accumulation of all artifacts created by preceding cells. To emphasize this point: unlike Jupyter, the state a cell sees is based on the order in which cells appear in the notebook and not the order in which cells are executed.
In addition to artifacts, an ExecutionContext may also be used to send messages to the user. These are displayed below the cell in the notebook, but are not visible to any subsequent cells.
For full documentation, see the ScalaDoc for the class.
-
context.artifact(name)
: Obtain the Artifact with the specifiedname
. -
context.dataframe(name)
: Obtain a Spark DataFrame for the artifact with the specifiedname
; Triggers an error if this artifact does not exist, or is not a DATASET (equivalent to callingcontext.artifact(name).dataframe
, but also creates a database session). -
context.parameter[T](name)
: Obtain the value of a parameter artifact assuming that the parameter has a type that decodes toT
. Throws an error ifname
does not exist, is not a parameter artifact, or decodes to a type other thanT
. -
context.file(name){ source => ... }
: Read the contents of the FILE or BLOB artifact with the specifiedname
. The provided block takes a scala Source object. The method returns the value returned by the provided block. (e.g.,context.file("foo"){ source => Json.parse(source.readlines.mkString) }
would return the json contents of the file)
-
context.message(message)
: Display the provided message formatted in a fixed-width font. -
context.error(message)
: Display the provided message and flag the cell execution as having triggered an error (the cell will be highlighted, and subsequent cells that depend on this one will not be executed). -
context.displayHTML(html[, javascript[, javascriptDependencies]][, cssDependencies])
: Display the providedhtml
, rendered as HTML. See below for a discussion of the remaining parameters. -
context.vega(chart, identifier)
: Output a Vega chart with the specifiedidentifier
(TODO: this parameter should be calledname
for consistency) as both a message and an artifact. The optionalwithMessage
orwithArtifact
parameters can be set to false to hide the chart from either. -
context.vegalite(chart, identifier)
: (Deprecated) Output a VegaLite chart with the specifiedidentifier
(TODO: this parameter should be calledname
for consistency) as both a message and an artifact. The optionalwithMessage
orwithArtifact
parameters can be set to false to hide the chart from either. -
context.displayDataset(name)
: Display the dataset with the provided artifactname
.
-
context.output(name, t, data)
: Allocate and output a new generic artifact of the specified type. This method is only encouraged for BLOB artifacts; Use one of the helper methods below if one exists. -
context.setParameter(name, value, dataType)
: Output a new PARAMETER artifact with the specifiedname
andvalue
;value
is not type-checked, but must be of a type that Spark will accept fordataType
. -
context.outputDataset(name, constructor)
: Output a new DATASET artifact with the specifiedname
.constructor
must be a subclass of DataFrameConstructor; see the info.vizierdb.spark package for existing instances like the SQL ViewConstructor. -
context.outputFile(name, mimeType) { stream => ... }
: Output a new FILE artifact with the specifiedname
andmimeType
. The provided block should write the file's contents to the provided java OutputStream. -
context.outputFilePlaceholder(name, mimeType)
: Output a new FILE artifact with the specifiedname
andmimeType
. This method allocates an artifact placeholder, but does not actually create a file. The caller is responsible for creating the file by writing to the path identified by the artifact'sfile
method. (outputFile
is preferred, as it automatically closes the file) -
context.outputDatasetWithFile(name, gen)
: LikeoutputDatset
, butgen
is a function that takes an artifact and returns a DataFrameConstructor. This can be helpful when the dataframe needs to read from a file, since you get a chicken and egg problem where the DataFrameConstructor needs to know the ID of the artifact that stores it. -
context.createPipeline(input, [output])(stage, [stage, [...]])
: Output a new DATASET artifact from the result of applying a Spark Pipeline to an input dataset. The pipeline will be trained on the specifiedinput
dataset, and theoutput
dataset will be defined by applying the pipeline to theinput
dataset. Ifoutput
is omitted, theinput
dataset will be replaced. -
context.outputDataframe(name, dataframe)
: Output a new DATASET artifact consisting of the contents of the specified dataframe. Note that any provenance for the output data will be lost, sooutputDataset
is generally preferred. -
context.delete(name)
: Delete the specified artifact from the context. Note that the artifact will not actually be deleted. Rather this command makes it so that later cells will be unable to access the artifact.
The context.displayHTML
method allows cells to display messages with arbitrary formatting and interactivity. In addition to the html
itself, the cell may provide javascript
. The provided javascript will run after the html
is mounted in the DOM.
When it is necessary to refer to DOM nodes in the javascript, do not use hard-coded node id
s. If the same workflow module appears twice in a notebook, the same node id
will appear twice. Even if your javascript replaces the node id
, there may be race conditions when the notebook is re-opened. Instead assign nodes an id
based on the return value of context.executionIdentifier
. DOM node id
s prefixed with this value are guaranteed to be reserved for use by the context
's module.
The javascriptDependencies
and cssDependencies
parameters allow mutable dependencies to be dynamically loaded --- for example, javascript and css tied to a specific version of Bokeh. These should be provided as references to a CDN or similar (if Vizier has the reference locally, it will be used). Dependencies will only be loaded once; the first time they appear in a notebook. The provided javascript
will not be executed until any javascript dependencies have loaded.