-
Notifications
You must be signed in to change notification settings - Fork 2
/
README.Rmd
258 lines (181 loc) · 16.7 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
---
output: github_document
---
<!-- DO NOT EDIT. CREATED BY README.RMD. Knit that. -->
# <a name="top">NMFS Open Science Docker Stack
### Beta release June 1, 2024.
These are a collection of container images that provide standardized environments for Python and R with Jupyter Lab, RStudio and VS Code IDEs. The images are built off the [Rocker](https://rocker-project.org/images/devcontainer/images.html), [Pangeo](https://github.com/pangeo-data/pangeo-docker-images) and [Jupyter](https://jupyter-docker-stacks.readthedocs.io/en/latest/) base images. This repo holds the stable Docker stack for specific pipelines used in Fisheries. The images are designed to work out-of-box and identically in Jupyter Hubs, Codespaces, Binder, etc.Read the Design section below on what the NMFS Open Sci Docker Stack does. For use, see [Instructions](#instructions) and [Link to files](#files). This Docker Stack was the joint of a number of people. See [Acknowledgements](#thanks).
## Stable set of images
There are many other images in the `images` folder that are experimental in nature. *If you are looking for standard Python or R Docker images, go to the base Docker stacks linked above.*
```{r echo=FALSE}
source("parse_dockerfile.R")
table_line <- function(i){
# fil <- paste0("images/",i,"/DESCRIPTION")
fil <- paste0("images/",i,"/Dockerfile")
desc <- ""
if(file.exists(fil)){
# desc <- readLines(fil, warn=FALSE)
desc <- parse_dockerfile(fil)
}
branch <- system(paste0("git show-ref refs/heads/", i, " ignore.stdout = TRUE"))
binder_button <- ""
if(branch == 0) binder_button <- paste0("[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/nmfs-opensci/container-images/", i, ")")
devfil <- paste0(".devcontainer/",i,"/devcontainer.json")
devcon_button <- ""
if(file.exists(devfil)) devcon_button <- paste0("[![Button GCS]](https://codespaces.new/nmfs-opensci/container-images?devcontainer_path=.devcontainer/", i, "/devcontainer.json)")
cat("| [", i, "](https://github.com/nmfs-opensci/container-images/pkgs/container/container-images%2F", i, ") <br/> ![](https://ghcr-badge.egpl.dev/nmfs-opensci/container-images%2F", i,"/size?color=%2344cc11&tag=latest&label=image+size&trim=) <br/> ![](https://ghcr-badge.egpl.dev/nmfs-opensci%2Fcontainer-images/", i, "/latest_tag?color=%2344cc11&ignore=latest&label=version&trim=) | ", desc, "<br/> <small>ghcr.io/nmfs-opensci/container-images/", i, ":latest</small> | ", devcon_button, " <br/> ", binder_button, " | [dockerfile](https://github.com/nmfs-opensci/container-images/tree/main/images/", i, "/Dockerfile) <br/> [directory](https://github.com/nmfs-opensci/container-images/tree/main/images/", i, ") |\n", sep="")
}
```
| Image | Description | Open | info |
|-----------|---------------------------|----------------|-----------|
| **Base** | Use as the base image when possible | | |
```{r results='asis', echo=FALSE}
imgs <- c("py-rocket-base", "py-rocket-geospatial", "py-rocket-geospatial-2")
for(i in imgs) table_line(i)
```
| | | | |
| **Specialized** | Images for specific analyses | | |
```{r results='asis', echo=FALSE}
imgs <- c("py-geospatial", "r-geospatial-sdm", "arcgis", "coastwatch", "echopype", "vast", "aomlomics-jh")
for(i in imgs) table_line(i)
```
| **Community** | Images from other Docker Stacks | | |
| [pangeo-notebook](https://github.com/pangeo-data/pangeo-docker-images/tree/master/pangeo-notebook) <br/> ![](https://img.shields.io/docker/image-size/pangeo/pangeo-notebook?sort=date) <br/> | Pangeo Notebook | [![Button GCS](https://img.shields.io/badge/launch-codespace-brightgreen?logo=github)](https://codespaces.new/nmfs-opensci/container-images?devcontainer_path=.devcontainer/pangeo-notebook/devcontainer.json) | [directory](https://github.com/pangeo-data/pangeo-docker-images/tree/master/pangeo-notebook) |
[Button GCS]: https://img.shields.io/badge/launch-codespace-brightgreen?logo=github
*Click on the image name in the table above for a current list of installed packages and versions*
## Design principles
The images are designed to be deployable "out of the box" from JupyterHubs, Codespaces, GitPod, Colab, Binder, and on your computer via Docker or Podman with no modification. See instructions below. Each will spin up Jupyter Lab with Jupyter Lab (and Notebook), RStudio and VS Code with the specific development environment.
- Python environment follows Pangeo images with micromamba installed as the solver and base and notebook environments. The Jupyter modules are installed in notebook conda environment and images will launch with the notebook environment activated, again following Pangeo design structure. Images that use Pangeo as base will have user jovyan and user home directory home/jovyan.
- When an image contains both R and Python, the base image is a rocker image and adheres to the rocker norms for R and RStudio environment design. For the Python side of these images, micromamba is installed and the Pangeo conda environment structure is applied as in the Python only images. RStudio will use the Python environment in the conda notebook environment when Python is used from within RStudio. The user is `rstudio` but the home directory is `home/jovyan` so images play nice with standard JupyterHub deployments with persistent memory.
- These images are not terribly light-weight (they are large). Use the original Jupyter, Pangeo or Rocker images if you are looking for lightweight data science images.
## Why use a container?
The main reason is that geospatial, bioinformatics, and TMB/INLA environments can be hard to get working right. Using a Docker image means you use a stable environment. Watch this video from Yuvi Panda (Jupyter Project) [video](https://www.youtube.com/watch?v=qgLPpULvBbQ) and read about the Rocker Project in the R Project Journal [article](https://journal.r-project.org/archive/2017/RJ-2017-065/RJ-2017-065.pdf) by Carl Boettiger and Dirk Eddelbuettel.
**Related Docker Stacks**
* [NASA Openscapes corn]() and [NASA Openscapes py-rocket]()
* [Rocker](https://rocker-project.org/images/devcontainer/images.html)
* [Pangeo](https://github.com/pangeo-data/pangeo-docker-images)
* [Jupyter](https://jupyter-docker-stacks.readthedocs.io/en/latest/)
* [geocompx](https://github.com/geocompx/docker)
* [GPU accelerated docker images and devcontainers](https://github.com/b-data)
### <a name="thanks">Acknowledgements
The motivation of the Docker Stack was the success of the NASA Openscapes "corn" image developed by Luis Lopez (NASA) and used in countless workshops on cloud-computing with NASA Earth Data. Subsequently the NASA Openscapes mentor cloud-infrastructure Slack group met during weekly co-work sessions and plugged away at the problem of helping users 'fledge' off the Openscapes JupyterHub, which involved creating images that could be used outside of JupyterHubs, and updating the original "py-rocket" R image created by Luis. Carl Boettiger (UC Berkeley & Rocker Project) and Eli Holmes (NOAA Fisheries) took on different aspects of this work. The GitHub Action tooling is curtesy of Carl. "py-rocket-base" is derived from Carl's "version 2.conda" of py-rocket. Eli further developed py-rocket into the form in this repo to bring it closer to the "corn" and Pangeo designs. Yuvi Panda (Jupyter, 2i2c) was instrumental in helping sort through so many mystery bugs. The Codespaces and devcontainer code is based on Michael Akridge's [Open Science Codespaces](https://github.com/MichaelAkridge-NOAA/Open-Science-Codespaces) work. Individual images have different core developers: Tim Haverland (arcgis), Sunny Hospital (coastwatch), Luke Thompson (aomlomics-jh), Eli Holmes (the various py-rocket versions).
## License information
All code used in the images is under open licenses. Some is copy-left which means if you modify their code (we don't), you need to also provide your source code. The Dockerfile code is released under Apache 2.0, a very permissive open source license which does not require that you make you own modifications open. See the README.md files for the licenses for specific code used in the Docker files.
- The Dockerfiles are released under Apache 2.0.
- [jupyterhub](https://github.com/jupyterhub/jupyterhub?tab=License-1-ov-file#readme) : Modified BSD License
- [juptyerlab](https://github.com/jupyterlab/jupyterlab?tab=License-1-ov-file#readme): Open license
- [Openscapes base Python image](https://github.com/nasa-openscapes/corn): MIT
- [Pangeo Docker Stack](https://github.com/pangeo-data/pangeo-docker-images): MIT
- [Python](https://docs.python.org/3/license.html): Zero clause BSD
- [Openscapes base rocker image](https://github.com/nasa-openscapes/py-rocket): MIT
- [Rocker Docker Stack](https://github.com/rocker-org/rocker-versioned2?tab=GPL-2.0-1-ov-file#readme): GPL-2
- [R](https://www.r-project.org/Licenses/): GPL-2, GPL-3
- RStudio Server: GPL-3
- conda and mamba solvers: are open source projects with 3-clause BSD license. Anaconda is not used in these images nor are the Anaconda repositories.
<hr>
## Disclaimer
This repository is a scientific product and is not official communication of the National Oceanic and Atmospheric Administration, or the United States Department of Commerce. All NOAA GitHub project content is provided on an ‘as is’ basis and the user assumes responsibility for its use. Any claims against the Department of Commerce or Department of Commerce bureaus stemming from the use of this GitHub project will be governed by all applicable Federal law. Any reference to specific commercial products, processes, or services by service mark, trademark, manufacturer, or otherwise, does not constitute or imply their endorsement, recommendation or favoring by the Department of Commerce. The Department of Commerce seal and logo, or the seal and logo of a DOC bureau, shall not be used in any manner to imply endorsement of any commercial product or activity by DOC or the United States Government.
# <a name="instructions">Instructions for using the images
[back to readme](#top)
There are many ways to use Docker images. Here are common ones. Scroll to the bottom for instructions on linking your container to file systems (so you can get and store files).
## To run images in a JupyterHub with 'bring your image'
If your JupyterHub has this option:
- Click on the 'Bring your own image' radio button at bottom
- Paste in url to your image (or any other image)
- You will find the urls in the right nav bar under 'Packages'
- Example `ghcr.io/nmfs-opensci/jupyter-base-notebook:latest`
## Run with a JupyterHub
Should work out of the box. Put the url to the image whereever you would use images.
## Run with docker
You can run the images on a Virtual Machine or your computer if you have Docker or Podman installed.
```
docker run -p 8888:8888 ghcr.io/nmfs-opensci/container-images/py-rocket-base:latest
```
On a Mac M2+ with Rosetta emulation turned on in the Docker Desktop settings.
```
docker run --platform linux/amd64 -p 8888:8888 ghcr.io/nmfs-opensci/container-images/py-rocket-base:latest
```
In the terminal look for something like and put that in a browser.
```
http://127.0.0.1:8888/lab?token=6d45c7d88aba92a815647c
```
**Running geospatial R Docker images and working with netCDF files**
GDAL netCDF driver needs some extra flags added to the `docker run` for GDAL to work correctly when run inside a Docker container. This doesn't affect Python as much since `xarray` works with netCDF via different drivers, but the `terra` netCDF functions use GDAL drivers under the hood to open netCDF files. You'll get error saying it can't find files. We ran into trouble when accessing cloud-hosted netCDF files. Perhaps it works ok if you download the files.
Add this to the call:
```
--cap-add SYS_PTRACE --security-opt seccomp=unconfined
```
so you call will look like:
```
docker run -p 8888:8888 --cap-add SYS_PTRACE --security-opt seccomp=unconfined ghcr.io/nmfs-opensci/container-images/py-rocket-geospatial:latest
```
Note we had trouble getting this to work on an Mac with Apple chips. You can test if it is going to work by running this Python code and seeing if `you can see if `DCAP_VIRTUALIO` is listed:
```
from osgeo import gdal
nc = gdal.GetDriverByName("netCDF")
nc.GetMetadata().keys()
```
## Run with Binder
Create a file called `Dockerfile` and put in the base of your GitHub repository or in a folder called `binder` or `.binder`. Into that file put the following line (replacing the image url to match your desired image).
```
FROM ghcr.io/nmfs-opensci/container-images/py-rocket-geospatial:latest
```
Then go to <https://mybinder.org> and paste in the url to your GitHub repo or alternatively go to the following url directly:
```
https://mybinder.org/v2/gh/<username or org>/<reponame>
```
## With Codespaces
See the folders in the `.devcontainer` folder and create a `.devcontainer/devcontainer.json` file in your own repo by copying one of `devcontainer.json` file. They all use the same template with just the top lines changed. Note that the folder `.devcontainer/codespace` is also required. If you change the line that starts up Jupyter Lab (at the bottom of the devcontainer.json file, do not use port 8888 or else RStudio will not launch.
The Codespaces code is based on: <https://github.com/MichaelAkridge-NOAA/Open-Science-Codespaces>
## GitPod -- like Codespaces
Work in progress. Approach is similar to Codespaces.
## Run on Google Colab
TBD. This seems harder. See this [issue](https://github.com/nmfs-opensci/container-images/issues/14)
# <a name="files"> Getting access to files
[back to readme](#top)
The container gives you a computing environment, but by design, it is a container and not connected to the file system in whatever is running the container. So you will need to get your files in/out of the container and have a way to save your work.
## Upload/Download files
Under the Files menu in Jupyter Lab or the Files tab in RStudio, you can upload and download files.
## Use a Git repository
Jupyter Lab and RStudio have Git GUIs. Use those or the command line to clone repos and push changes back to the repos.
```
cd ~
git clone <url to the repo>
```
## Connect to a bucket
If you are working with large data sets, you do not want to move these into your container (slow, slow). You will want to create a bucket (like an S3 bucket) and connect to that. This is like having a external drive in the cloud.
Instructions to come.
## Mount a file system
You can mount a local file system and read/write directly from that. Here "local" means the machine that is running the container. "local" might be a virtual machine, a server or your computer.
**On a JupyterHub**: The managers of the hub most likely have created persistent memory for you. If not, use Git, upload/download, or use buckets.
**On your computer**: you'll add a flag to the `docker run` command to mount your local file system to the Docker container.
When you use `--volume` to bind-mount a file or directory, make sure it does not exist on the Docker container. So do not bind a directory like `\usr` which would destroy the container (nothing bad; it just won't work). Use something like `\home\jovyan\mydir`. `--volume` creates the endpoint for you and it is always created as a directory.
In this example, `mydir` needs to exist in the directory where you are running `docker run`. If you get errors, try `ls` to make sure the directory is there.
```
docker run --platform linux/amd64 -p 8888:8888 --volume ./myproject_files:/home/jovyan/mydir ghcr.io/nmfs-opensci/container-images/py-rocket-base:latest
```
as you work in `mydir` in the container, those changes will appear in your computer's `myproject_files` directory. It is as if you are working on your own computer, but you are using the development environment of the docker file.
Mac users with Apple chips, add `--platform linux/amd64`:
```
docker run --platform linux/amd64 -p 8888:8888 --volume ./myproject_files:/home/jovyan/mydir ghcr.io/nmfs-opensci/container-images/py-rocket-base:latest
```
## Use py-rocket-base as a base image
Create a file called `Dockerfile`
```
FROM ghcr.io/nmfs-opensci/container-images/py-geospatial:latest
```
add more code in that file. See the `images` and `draft_images` folders for examples.
Use a GitHub Action to automatically build and push the image to ghcr.io (GitHub packages, associated with every repo). This action is triggered whenever your dockerfile changes.
```
name: Docker Image CI
on:
workflow_dispatch:
push:
branches:
- main
paths:
- 'Dockerfile'
jobs:
build-and-push:
uses: nmfs-opensci/container-images@main
```