Skip to content

Cell SQL

Oliver Kennedy edited this page Dec 31, 2023 · 2 revisions

SQL

The SQL Cell executes a single SQL SELECT query, optionally saving the result as a new dataset.

Any existing dataset available in Vizier can be referenced in a FROM clause.

Function Support

  • All SparkSQL functions, scalar and aggregate, are supported.
  • Apache Sedona operations are supported.
  • Any function exported by a Python cell can be invoked as a Spark Python UDF.

Controlling Caveats

Launch Vizier with -X ENABLE-MIMIR to use Caveats.

In the SQL cell you can create caveats on the fly to annotate attribute values as incorrect, suspicious, or noteworthy. Caveats generated by SQL cell propagate just as any Caveat generated by Lenses. A typical use of caveats is to mark errors or problems identified in the data or mark the result of cleaning data as potentially untrustworthy. Consider the a persons dataset recording a persons /name/, /age/, and /job/:

Note that for some persons we do not know their job and some age values are obviously wrong (negative or unrealistically high). In Vizier, you can indicate as part of a SQL queries, which values should be caveated. For sake of the example, let's assume we want to get the names of persons, but know that names where manually entered by Bob who is unreliable. In SQL, we can retrieve the names of persons like this:

SELECT name
FROM persons;

To indicate the that values of the name attribute are untrustworthy, we can use the Caveat function to wrap the name attribute. The Caveat function returns its input unmodified, but marks it as Caveat. As shown below, Vizier indicates in the output of the queries that all names are caveated by showing them in red.

SELECT Caveat(NAME) AS NAME
FROM persons;

If your query result will be used by somebody else (or by ourself in the future), it may be a good idea to record why we have caveated the names. In Vizier's SQL cell we can associated a message with each caveated value to document it.

SELECT Caveat(NAME,'The names where manually inserted by Bob who is unreliable') AS NAME
FROM persons

Note that caveats persist through operations. So if we store the caveated names as a new dataset and then run another query over this dataset, then the result will carry the caveats of all input values they depend on.

Often, only a subset of all values from a column are erroneous and should be caveated, e.g., all NULL values. The Caveat function is sufficient for achieving that. However, since conditional caveating is a common use case, we provide specialized functions for that:

  • CaveatIf(value,cond) caveats input value if cond evaluate to `true
  • CaveatIfNull(value) caveats input value if it is NULL
  • CaveatReplaceIf(value,replacement,cond) replaces value with replacement if cond evaluates true and caveats the result (if it got replaced).

Consider the application of these functions to mark errors in the AGE and JOB columns of our running example:

SELECT NAME,
	CaveatIf(AGE,(AGE < 1 OR AGE > 120)) AS AGE,
    CaveatIfNull(JOB) AS JOB
FROM persons

Marking errors is an important, but only the first step in curating data and preparing it for analysis. Once errors have been detected and documented, e.g,. through Caveats in Vizier, the data needs to be repaired. Typically, insufficient information is available to determine what the correct repair is. For instance, in our example above we know that nobody can have a negative age and, thus, Alice's age can not be correct. However, we do not know her correct age. Thus, when repairing data one typically relies on reasonable, but heuristic, methods. For example, a common methods to repair outliers is to replace them with the median value of the column. For such heuristic methods it is important to document the choices made by the heuristic, because they may not be correct and may affect the results of analysis. In Vizier this can be achieved by caveating value to record the fact that is was changed because of a particular cleaning heuristic. For instance, we may replace AGE values that are below zero with 1 and age values larger than 120 with 120 and caveat the replaced values like this:

SELECT NAME,
       CaveatReplaceIf(AGE,
                       CASE WHEN AGE < 1 THEN 1 ELSE 120 END,
                       (AGE < 1) OR (AGE > 120)
        ) AS AGE,
    CaveatIfNull(JOB) AS JOB
FROM persons

Note that as mentioned above, Vizier propagates such caveats automatically through operations. Thus, indicating to you when result depends on values that are caveated. Continuing with our example, let us count the number of people per age at the granularity of decades and plot the result. As shown below, Vizier has inferred that Alice and Bob could belong to any age group and, thus the counts for all age groups are uncertain. In plots, Vizier marks results that depend on caveated values by putting a small red dot besides the bar (point, line segment).