Skip to content

Commit

Permalink
Re-write documentation on data loading (#202)
Browse files Browse the repository at this point in the history
  • Loading branch information
coatless authored May 14, 2024
1 parent 3a709d6 commit fc18875
Show file tree
Hide file tree
Showing 2 changed files with 162 additions and 10 deletions.
170 changes: 160 additions & 10 deletions docs/qwebr-loading-data.qmd
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
---
title: "Accessing and Using Data in webR"
subtitle: "Possible avenues for Data Retrieval"
author: "James Joseph Balamuta"
date: "01-14-2024"
date-modified: last-modified
Expand All @@ -14,36 +13,114 @@ filters:

# Overview

When working with data in [webR](https://docs.r-wasm.org/webr/latest/), there are some tricks to using data in the web environment. This documentation entry guides you through a few changes related to accessing data.
When working with webR in a web environment, there are some modifications and considerations required for using data. This documentation entry guides you through a few changes related to accessing data.

## Background: Virtual File System

Given the browser-based nature of webR, accessing local files is restricted. To overcome this limitation, webR establishes a [**virtual** file system](https://docs.r-wasm.org/webr/latest/communication.html#emscripten-virtual-filesystem) inside of your browser that is separate from your local file system. Consequently, webR does **not** have awareness of local file system and its paths. Thus, to use data we need to download it into the virtual file system either through: an R package, a URL using HTTPS, or a Web API.

By default, the webR virtual file system's home directory and initial working directory is `/home/web_user`. This can be changed using a built-in extension [document-level option `home-dir`](https://quarto-webr.thecoatlessprofessor.com/qwebr-meta-options.html#home-dir).

### Aside

While there are methods for mounting pre-built images using the webR's [Mounting Filesystem](https://docs.r-wasm.org/webr/latest/mounting.html#the-virtual-filesystem) mechanic, the `quarto-webr` Extension does not support this option at the moment.

## Accessing Data through R Data Packages

One approach to accessing data is to store the data inside of an [R data package](https://thecoatlessprofessor.com/programming/r/creating-an-r-data-package/). This kind of R package consists solely of data in an R ready format with the added benefit of help documentation. If the data package is on CRAN, chances are there is a version available for webR on the [main repository (warning not a mobile data friendly link)](https://repo.r-wasm.org/) and, thus, can be accessed using `install.packages("pkg")` or added to the documents `packages` key. Otherwise, the R package will need to be converted, deployed, and accessed through GitHub Pages by following the advice to [setup a custom package repository](qwebr-using-r-packages.qmd#custom-repositories).
The quickest approach for accessing data is to store it inside of an [R data package](https://thecoatlessprofessor.com/programming/r/creating-an-r-data-package/). This kind of R package consists solely of data in an R ready format with the added benefit of help documentation. If the data package is available on CRAN, there's a good chance a version exists for webR on the [main webR package repository (warning not a mobile data friendly link)](https://repo.r-wasm.org/) and, thus, can be accessed using `install.packages("pkg")` or added to the documents `packages` key.

If the R package is not available on CRAN, then it will need to be compiled for webR, deployed, and accessed through GitHub Pages or [r-universe.dev](https://ropensci.org/blog/2023/11/17/runiverse-wasm/) by following the advice on [creating a custom webR/R WASM package repository](qwebr-using-r-packages.qmd#custom-repositories).

## Retrieving Data from the Web

::: {.callout-note}
::: {.callout-important}
Before proceeding, take note of the following considerations when working with remote data:

1. **Security Protocol:** webR necessitates data retrieval via the [HyperText Transfer Protocol Secure (HTTPS)](https://developer.mozilla.org/en-US/docs/Glossary/HTTPS) protocol to ensure secure connections and the [Cross-Origin Resource Sharing (CORS)](https://developer.mozilla.org/en-US/docs/Web/HTTP/CORS) being enabled.
1. **Security Protocol:** webR necessitates data retrieval via the [HyperText Transfer Protocol Secure (HTTPS)](https://developer.mozilla.org/en-US/docs/Glossary/HTTPS) protocol to ensure secure connections and the [Cross-Origin Resource Sharing (CORS)](https://developer.mozilla.org/en-US/docs/Web/HTTP/CORS) being enabled on the server where the data is being served.
2. **Package Compatibility:** In the absence of websockets within webR, packages reliant on `{curl}` methods may require adaptation or alternative solutions.
:::

When retrieving data in webR, it is imperative to always use links that use [HyperText Transfer Protocol Secure (HTTPS)](https://developer.mozilla.org/en-US/docs/Glossary/HTTPS). From there, the data at the HTTPS URL can be downloaded using the `download.file()` function and subsequently reading it into R utilizing a relative path. Follow this example:
### Hosting Data

#### Standalone Repository

We suggest creating a GitHub repository that uses GitHub Pages to host the data. By default, GitHub Pages serves data files using the CORS protocol and can quickly be setup to enforce HTTPS URLs [by checking a box](https://docs.github.com/en/pages/getting-started-with-github-pages/securing-your-github-pages-site-with-https#enforcing-https-for-your-github-pages-site).

You can see an example raw data repository here:

<https://github.com/coatless/raw-data>

The corresponding site deployment of the `main` branch can be seen here:

<https://coatless.github.io/raw-data/>

#### Alongside the document

There may be times when it is not feasible to create a standalone repository to host the data. In cases like this, you may wish to host the data alongside of the document through Quarto's publishing system. In this case, please add the `resources` key to the top of the [HTML document](https://quarto.org/docs/reference/formats/html.html#includes) or inside the [project's `_quarto.yml`](https://quarto.org/docs/reference/projects/websites.html#project).

For `my-document.qmd`, this would be:

```yaml
---
title: "quarto-webr document with data"
format:
html:
resources:
- my-data.csv # Include just the CSV
- my-data-directory/* # Include all files
engine: knitr
filters:
- webr
---
```

For `_quarto.yml`, this would be:

```yaml
---
project:
type: website
resources:
- my-data.csv # Include just the CSV
- my-data-directory/* # Include all files
---
```

Subsequently, reference the data using the URL to where the document is located. For example, if the document is at:

```default
https://example.com/folder/my-document.html
```

Then, the data should be accessed using:

```default
https://example.com/folder/my-data.csv
```

:::{.callout-note}
You may need to publish the document before using the URL. Also, be mindful of data version mismatches, as the data will be fetched from the HTTPS URL instead of being available locally.
:::

### Obtain Data

We can retrieve data at a URL with HTTPS through using the `download.file()` function and, subsequently, reading it into R using a relative path. The later can be done using either Base R or Tidyverse functions.

For example, if we wanted to work with [flights.csv](https://coatless.github.io/raw-data/flights.csv) from the [`nycflights13`](https://cran.r-project.org/package=nycflights13) _R_ package ([Details](https://nycflights13.tidyverse.org/reference/flights.html)), we would specify:

```r
url <- "https://example.com/data.csv"
download.file(url, "data.csv")
url <- "https://coatless.github.io/raw-data/flights.csv"
download.file(url, "flights.csv")
```

This action saves the file into webR's virtual file system to be read into R's analysis environment. Replace `"https://example.com/data.csv"` with the actual URL of your desired data source.
This action saves the file into webR's virtual file system to be read into R's analysis environment. Replace `"https://coatless.github.io/raw-data/flights.csv"` and `"flights.csv"` with the actual URL of your desired data source and desired local file name.

### Base R

For optimized performance, leverage base R's `read.*()` functions, as they do not necessitate additional package dependencies.

```r
data <- read.csv("data.csv")
data <- read.csv("flights.csv")
```

### Tidyverse
Expand All @@ -58,3 +135,76 @@ Note that employing `tidyverse` or `readr` functions entails additional package
install.packages("readr")
data <- readr::read_csv("data.csv")
```

### Try it!

We've setup the above example inside of an interactive cell for your to explore below.

:::{.panel-tabset}

### {quarto-webr}

```{webr-r}
#| autorun: true
# See where we are in the file system:
cat("We're currently at:\n")
getwd()
# View a list of files for the working directory.
cat("We have the following files present:\n")
list.files()
# Specify the data URL using HTTPS
url <- "https://coatless.github.io/raw-data/flights.csv"
# Download the data file from the HTTPS URL and save it as
# flights.csv
cat("Download the data ...\n")
download.file(url, "flights.csv")
# Check for the data.
cat("After downloading the data, we now have:\n")
list.files()
# Read the flights data into R
flights_from_csv <- read.csv("flights.csv")
# See the first few rows of the flights_from_csv data frame.
cat("Let's view the first 6 observations of data:\n")
head(flights_from_csv)
```

### Code cell

```{{webr-r}}
#| autorun: true
# See where we are in the file system:
cat("We're currently at:\n")
getwd()
# View a list of files for the working directory.
cat("We have the following files present:\n")
list.files()
# Specify the data URL using HTTPS
url <- "https://coatless.github.io/raw-data/flights.csv"
# Download the data file from the HTTPS URL and save it as
# flights.csv
cat("Download the data ...\n")
download.file(url, "flights.csv")
# Check for the data.
cat("After downloading the data, we now have:\n")
list.files()
# Read the flights data into R
flights_from_csv <- read.csv("flights.csv")
# See the first few rows of the flights_from_csv data frame.
cat("Let's view the first 6 observations of data:\n")
head(flights_from_csv)
```

:::

2 changes: 2 additions & 0 deletions docs/qwebr-release-notes.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,8 @@ Features listed under the `-dev` version have not yet been solidified and may ch

- Updated community examples covering `quarto-webr` uses in 2024 Q1 ([#193](https://github.com/coatless/quarto-webr/issues/193)).

- Improve the data loading documentation page by clarifying the virtual file system usage ([#201](https://github.com/coatless/quarto-webr/issues/201)).

# 0.4.1: Vivid Montage (03-25-2024)

## Features
Expand Down

0 comments on commit fc18875

Please sign in to comment.