Skip to content

Commit

Permalink
Add SQL Notebooks post (#17)
Browse files Browse the repository at this point in the history
* disable link

* update theme

* upload sql post

* turn on umami again

* update date
  • Loading branch information
danielroelfs authored Jun 19, 2024
1 parent 4cb145c commit a99b424
Show file tree
Hide file tree
Showing 3 changed files with 482 additions and 1 deletion.
253 changes: 253 additions & 0 deletions content/blog/2024-sql-notebooks-with-quarto/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,253 @@
---
title: SQL Notebooks with Quarto
date: 2024-06-19T00:00:00.000Z
description: SQL Notebooks with Quarto
slug: sql-notebooks-with-quarto
categories:
- miscellaneous
tags:
- miscellaneous
- sql
- database
- notebook
editor_options:
chunk_output_type: console
---


Notebooks are a cornerstone of exploratory data analysis in data science (and sometimes also [in production](https://ploomber.io/blog/nbs-production/) for some godforgiven reason?). Notebooks get a fair share of (justified) [criticism](https://yobibyte.github.io/notebooks.html), but I believe they are useful if used right, particularly in the education sphere where you can get immediate feedback on the code you wrote. The best-known notebook platforms are [Jupyter](https://jupyter.org) for Python (although it can actually [R](https://docs.anaconda.com/navigator/tutorials/r-lang/https://docs.anaconda.com/navigator/tutorials/r-lang/) and [Julia](https://www.jousefmurad.com/coding/install-julia-jupyter-notebook/) too) and [R Markdown](https://rmarkdown.rstudio.com). However, since 2022 a natural successor to R Markdown was released called [Quarto](https://quarto.org) that natively supports Python, R, and Julia. I've become quite an advocate for Quarto since it was released since it works well for both Python and R out-of-the-box without any extensions necessary. It's also Git compatible, which is one of the other major drawbacks from Jupyter in my opinion. One of the main languages that currently does not have a widely used notebook platform is SQL. Although some languages and companies have made attempts. For example Snowflake and SQL Server have a (pseudo)notebook support in their IDEs and Meta uses an [internal tool](https://engineering.fb.com/2022/04/26/developer-tools/sql-notebooks/) that supports analyses in SQL notebooks. That's why I want to show you today how you can turn Quarto into a notebook for SQL analyses that can be rendered into webpages, PDFs, markdown documents, or online books and documentation.

## Background

{{< sidenote br=\"6em\" >}}
I use Quarto documents for all of my blogposts to render Markdown files that [Hugo](https://gohugo.io) builds into webpages
{{< /sidenote >}}

As mentioned, Quarto supports both R and Python, although it is [developed by Posit](https://posit.co/blog/announcing-quarto-a-new-scientific-and-technical-publishing-system/) (formerly RStudio) so it's more popularl with R users since Jupyter remains the gold standard for Python. However, Posit aims to branch out into other languages and has massively improved support in their IDE for Python and Julia. I've used Quarto for Python a fair bit and I think it ticks some boxes that Jupyter does not, while also having some quirks that Jupyter avoids. So while I was learning about how Quarto supports both the R and Python kernels, I learned that the list of officially supported engines (R, Python, Julia, and Javascript) and unofficially supported engines are different.

{{< sidenote br=\"1em\" >}}
[Yihui Xie](https://yihui.org), the main developer of `knitr` deserves to get more praise than he does for his contributions
{{< /sidenote >}}

Quarto inherits the supported language engines from it's predecessor R Markdown, which relies on the [`knitr`](https://yihui.org/knitr/) package to handle different languages. `knitr` supports a [host of engines](https://bookdown.org/yihui/rmarkdown/language-engines.html) including Octave, Fortran, C, Stan, and SQL. So since the `knitr` engine is also used to do the rendering of R code in Quarto, these other languages kinda come for free even though the Quarto development team currently has not built in support for these other languages directly (although [people have asked on GitHub about this](https://github.com/quarto-dev/quarto-cli/discussions/1737)). But SQL code chunks will render nicely in Quarto documents if you set it up correctly. That's what I will show here now: how to use Quarto for creating SQL notebooks.

## Setup

{{< sidenote br=\"1em\" >}}
I found that the version on Homebrew might not be the latest version from the [Quarto website](https://quarto.org/docs/get-started/), so you might want to install using the GUI
{{< /sidenote >}}

So how do you get this up. So obviously, you first need to [install Quarto](https://quarto.org/docs/get-started/). If you have a Mac (with Homebrew installed), you can install it directly [from there](https://formulae.brew.sh/cask/quarto) with the following command:

``` bash
brew install quarto
```

You'll also need to install the R package [`DBI`](https://dbi.r-dbi.org) which will handle the database interfacing and the driver for whatever database you're using. In my case this is [duckdb](http://duckdb.org) (more about duckdb below). The driver for duckdb is easily accessed through the [`duckdb`](https://r.duckdb.org) package. Drivers for other database management systems are of course [available in their respective packages](https://dbi.r-dbi.org). Creating these SQL notebooks with just Python directly is not possibly as of now as far as I know since Quarto uses the Jupyter engine to handle Python code chunks and as mentioned we need the `knitr` engine for SQL support.

{{< sidenote br=\"5em\" >}}
[*RDBMS*](https://en.wikipedia.org/wiki/Relational_database): relational database management system
{{< /sidenote >}}

We also need a database to connect to for demonstration purposes. I've become a fervent advocate for [duckdb](https://duckdb.org) for smaller projects (although support for collaborative projects [seem to be coming](https://motherduck.com)). Duckdb is an open-source RDBMS that is incredibly fast and memory efficient. I've been using it for a while, but I hope it'll become more widely popular now that the [v1.0.0 version has been released](https://duckdb.org/2024/06/03/announcing-duckdb-100.html) in June 2024. It is lightweight and fairly easy to [install](https://duckdb.org/docs/installation/) and works on all major systems. Again with Homebrew you can install it like this:

``` bash
brew install duckdb
```

To use as examples in this post we'll need some data. In this case I used a dataset I found on [GitHub](https://github.com/bbrumm/databasestar/blob/main/sample_databases/sample_db_superheroes/sqlite/01_reference_data.sql) that contains data on fictional superheroes from a variety of publishers (see the [metadata here](https://www.databasestar.com/sample-database-superheroes/)). I imported this data to duckdb using its [import functionality](https://duckdb.org/docs/guides/import/query_sqlite). For the second example I'll the example dataset included in the [`dbplyr`](https://dbplyr.tidyverse.org) R package that contains data from the [`nycflights13`](https://cran.r-project.org/web/packages/nycflights13/index.html) package on all flights that departed from New York City in 2013 that I also imported to duckdb.

## Creating the SQL notebook

Let me first show how to set up this SQL notebook using Quarto and then I'll show some examples of what a SQL notebook would look like when rendered. First, in order to create a SQL code chunk we need to create a connection the way we would normally in R through the `dbConnect()` function from the `DBI` package. This connection object we will then provide as an argument to the code chunk. It's possible to use multiple connections at the same time, but (as far as I know), it's only possible to run queries on a single connection at the time without [connecting to the table from R directly](https://stackoverflow.com/questions/60605192/how-to-use-dbidbconnect-to-read-and-write-tables-from-multiple-databases).

So here's how to connect to the different databases using the relevent driver and the connection URL to the database, which could be either local or hosted elsewhere (e.g. AWS RDS, PlanetScale, Supabase, or Motherduck). In this case it's two local files for simplicity. I'll also connect to to them as read-only to avoid accidentally overwriting anything.

``` r
con_superheroes <- DBI::dbConnect(
drv = duckdb::duckdb(),
dbdir = "./data/superheroes.db",
read_only = TRUE
)

con_flights <- DBI::dbConnect(
drv = duckdb::duckdb(),
dbdir = "./data/flights.db",
read_only = TRUE
)
```

You can then provide this connection object directly to the `connection` argument in the SQL code chunk. You can also provide other arguments (like `label` and `echo` and [some others](https://quarto.org/docs/reference/cells/cells-knitr.html)) to configure the functionality and output of each chunk, but note that not all options work in SQL chunks. You'll have experiment a bit to see which ones work and which ones don't. A typical SQL code chunk in Quarto might look something like this:

{{< sidenote >}}
Note that the semicolon is not strictly necessary for functionality, but it is good practice to include it
{{< /sidenote >}}

```` markdown
```{sql}
#| label: get-carriers
#| connection: con_flights

SELECT name, carrier FROM airlines LIMIT 10;
```
````

Which will provide the following output:

``` sql
SELECT name, carrier FROM airlines LIMIT 10;
```

| name | carrier |
|:----------------------------|:--------|
| Endeavor Air Inc. | 9E |
| American Airlines Inc. | AA |
| Alaska Airlines Inc. | AS |
| JetBlue Airways | B6 |
| Delta Air Lines Inc. | DL |
| ExpressJet Airlines Inc. | EV |
| Frontier Airlines Inc. | F9 |
| AirTran Airways Corporation | FL |
| Hawaiian Airlines Inc. | HA |
| Envoy Air | MQ |

Displaying records 1 - 10

{{< sidenote >}}
Unfortunately [dot commands](https://duckdb.org/docs/api/cli/dot_commands.html) are not supported at the moment
{{< /sidenote >}}

Obviously, you can run any SQL commands since the code chunk is performing the command as if it were done in the SQL prompt directly. The output is generated as set up. In this example the code is presented in a Markdown table which is then styled in the same format as the rest of the tables on this website using the CSS style everything else uses. If I were to render to HTML, it would come out as an HTML table, the same for rendering to PDF and outputting LaTeX tables. Also, if you write multiple queries inside a single chunk (separated by a semicolon), only the last one will be rendered. For example, if you wanted to show all tables in the superheroes database the code chunk would look like this:

{{< sidenote >}}
Let's also hide the code chunk itself and only show the output with the `echo` option
{{< /sidenote >}}

```` markdown
```{sql}
#| label: get-tables-in-superheroes-db
#| echo: false
#| connection: con_flights

SHOW TABLES;
```
````

And the output looks like this:

| name |
|:---------|
| airlines |
| airports |
| flights |
| planes |
| weather |

5 records

## Exploratory data analysis example

From here, you're only limited by your imaginations. So go wild to create queries and output as simple or complex as your analysis demands. I'll include some examples with the output below to show the duckdb functionality and how different queries might look when rendered, starting with a query in the flights database.

``` sql
SELECT
airlines.name,
ROUND(MEAN(flights.dep_delay), 2) AS mean_delay,
ROUND(STDDEV(flights.dep_delay) / SQRT(COUNT(1)), 2) AS sem_delay
FROM flights
INNER JOIN airlines ON flights.carrier = airlines.carrier
GROUP BY airlines.name
ORDER BY mean_delay DESC;
```

| name | mean_delay | sem_delay |
|:----------------------------|-----------:|----------:|
| Frontier Airlines Inc. | 20.22 | 2.23 |
| ExpressJet Airlines Inc. | 19.96 | 0.20 |
| Mesa Airlines Inc. | 19.00 | 2.01 |
| AirTran Airways Corporation | 18.73 | 0.92 |
| Southwest Airlines Co. | 17.71 | 0.39 |
| Endeavor Air Inc. | 16.73 | 0.34 |
| JetBlue Airways | 13.02 | 0.16 |
| Virgin America | 12.87 | 0.62 |
| SkyWest Airlines Inc. | 12.59 | 7.61 |
| United Air Lines Inc. | 12.11 | 0.15 |

Displaying records 1 - 10

And then some CTEs using the superheroes database. For example looking at the gender of the superheroes and the proportion for each publisher.

``` sql
WITH pub_count AS (
SELECT
publisher_id,
COUNT(publisher_id) AS total_sh
FROM superhero
GROUP BY publisher_id
),

gender_count AS (
SELECT
publisher_id,
gender_id,
COUNT(gender_id) AS number
FROM superhero
GROUP BY publisher_id, gender_id
)

SELECT
pub.publisher_name AS publisher,
gen.gender AS gender,
gc.number,
ROUND(100 * gc.number / total_sh, 2) AS proportion
FROM gender_count AS gc
INNER JOIN pub_count AS pc ON gc.publisher_id = pc.publisher_id
INNER JOIN publisher AS pub ON gc.publisher_id = pub.id
INNER JOIN gender AS gen ON gc.gender_id = gen.id
WHERE publisher IN ('Marvel Comics', 'DC Comics')
ORDER BY publisher, gender;
```

| publisher | gender | number | proportion |
|:--------------|:-------|-------:|-----------:|
| DC Comics | Female | 63 | 28.13 |
| DC Comics | Male | 160 | 71.43 |
| DC Comics | N/A | 1 | 0.45 |
| Marvel Comics | Female | 111 | 28.68 |
| Marvel Comics | Male | 252 | 65.12 |
| Marvel Comics | N/A | 24 | 6.20 |

6 records

Which made me curious which superheroes don't have a gender or are non-binary.

``` sql
SELECT
sh.superhero_name AS 'superhero name',
pub.publisher_name AS 'publisher'
FROM
superhero AS sh
INNER JOIN gender AS gen ON sh.gender_id = gen.id
INNER JOIN publisher AS pub ON sh.publisher_id = pub.id
WHERE gen.gender = 'N/A'
ORDER BY pub.publisher_name;
```

| superhero name | publisher |
|:-----------------|:--------------|
| Darkside | |
| Godzilla | |
| Parademon | DC Comics |
| Man of Miracles | Image Comics |
| Bird-Brain | Marvel Comics |
| Captain Universe | Marvel Comics |
| Cecilia Reyes | Marvel Comics |
| Clea | Marvel Comics |
| Cypher | Marvel Comics |
| Ego | Marvel Comics |

Displaying records 1 - 10

## Concluding remarks

I believe the ability to create SQL notebooks this way offers some serious advantages by allowing analysts to annotate their queries using Markdown and comment on the results and describe their methodology and interpretations. This would introduce the same functionality to SQL that already exists in Jupyter Notebooks and R Markdown/Quarto notebooks that have aided their popularity. I really do hope RStudio has plans to further develop this SQL integration into Quarto, because if they do I believe they may have a good shot at creating the ultimate dialect-agnostic SQL notebook available open source. If we were allowed to dream I'd love for them to support linting with for example [sqlfluff](https://sqlfluff.com) for each chunk as well same as it does with `lintr` for R code. In addition, it would make Quarto perhaps the ultimate notebook software since it would then fully support R, Python, SQL, and Julia which would make it incredibly powerful and versatile, offering the ability to switch between languages within a single notebook. I hope to have inspired you have instilled some enthusiasm for Quarto and their concept of SQL notebooks and I hope you give this implementation of a SQL notebook a try!
Loading

0 comments on commit a99b424

Please sign in to comment.