Skip to content

Latest commit

 

History

History
358 lines (270 loc) · 33.6 KB

README.md

File metadata and controls

358 lines (270 loc) · 33.6 KB

Beta release June 1, 2024.

These are a collection of container images that provide standardized environments for Python and R with Jupyter Lab, RStudio and VS Code IDEs. The images are built off the Rocker, Pangeo and Jupyter base images. This repo holds the stable Docker stack for specific pipelines used in Fisheries. The images are designed to work out-of-box and identically in Jupyter Hubs, Codespaces, Binder, etc.Read the Design section below on what the NMFS Open Sci Docker Stack does. For use, see Instructions and Link to files. This Docker Stack was the joint of a number of people. See Acknowledgements.

Stable set of images

There are many other images in the images folder that are experimental in nature. If you are looking for standard Python or R Docker images, go to the base Docker stacks linked above.

Image Description Open info
Base Use as the base image when possible
py-rocket-base

Tidyverse based R image with Python
ghcr.io/nmfs-opensci/container-images/py-rocket-base:latest
Button GCS
Binder
dockerfile
directory
py-rocket-geospatial

Geospatial Python (3.10) and R (4.4) image
ghcr.io/nmfs-opensci/container-images/py-rocket-geospatial:latest
Button GCS
Binder
dockerfile
directory
py-rocket-geospatial-2

Geospatial Python (3.12) and R (4.4.1) image with Desktop (QGIS, Panoply, CWUtils, Zotero)
ghcr.io/nmfs-opensci/container-images/py-rocket-geospatial-2:latest
Button GCS
dockerfile
directory
Specialized Images for specific analyses
py-geospatial

NASA Openscapes Python image used in workshops
ghcr.io/nmfs-opensci/container-images/py-geospatial:latest
Button GCS
Binder
dockerfile
directory
r-geospatial-sdm

Geospatial R (vrs 4.4) image with Species Dist Modeling packages including sdmTMB
ghcr.io/nmfs-opensci/container-images/r-geospatial-sdm:latest
Button GCS
Binder
dockerfile
directory
arcgis

ArcGIS Python module image that will run in a JupyterHub
ghcr.io/nmfs-opensci/container-images/arcgis:latest
Button GCS
Binder
dockerfile
directory
coastwatch

CoastWatch image for satellite training courses
ghcr.io/nmfs-opensci/container-images/coastwatch:latest
Button GCS
Binder
dockerfile
directory
echopype

echopype tooling for ocean sonar data processing in Python. Author: Wu-Jung Lee + echopype team.
ghcr.io/nmfs-opensci/container-images/echopype:latest
Button GCS
Binder
dockerfile
directory
vast

VAST with R 4.3.3
ghcr.io/nmfs-opensci/container-images/vast:latest
Button GCS
dockerfile
directory
aomlomics-jh

Tourmaline is an amplicon sequence processing workflow for Illumina sequence data that uses QIIME 2 and the software packages it wraps.
ghcr.io/nmfs-opensci/container-images/aomlomics-jh:latest
Button GCS
dockerfile
directory
Community Images from other Docker Stacks
pangeo-notebook)

Pangeo Notebook Button GCS directory

Click on the image name in the table above for a current list of installed packages and versions

Design principles

The images are designed to be deployable “out of the box” from JupyterHubs, Codespaces, GitPod, Colab, Binder, and on your computer via Docker or Podman with no modification. See instructions below. Each will spin up Jupyter Lab with Jupyter Lab (and Notebook), RStudio and VS Code with the specific development environment.

  • Python environment follows Pangeo images with micromamba installed as the solver and base and notebook environments. The Jupyter modules are installed in notebook conda environment and images will launch with the notebook environment activated, again following Pangeo design structure. Images that use Pangeo as base will have user jovyan and user home directory home/jovyan.
  • When an image contains both R and Python, the base image is a rocker image and adheres to the rocker norms for R and RStudio environment design. For the Python side of these images, micromamba is installed and the Pangeo conda environment structure is applied as in the Python only images. RStudio will use the Python environment in the conda notebook environment when Python is used from within RStudio. The user is rstudio but the home directory is home/jovyan so images play nice with standard JupyterHub deployments with persistent memory.
  • These images are not terribly light-weight (they are large). Use the original Jupyter, Pangeo or Rocker images if you are looking for lightweight data science images.

Why use a container?

The main reason is that geospatial, bioinformatics, and TMB/INLA environments can be hard to get working right. Using a Docker image means you use a stable environment. Watch this video from Yuvi Panda (Jupyter Project) video and read about the Rocker Project in the R Project Journal article by Carl Boettiger and Dirk Eddelbuettel.

Related Docker Stacks

The motivation of the Docker Stack was the success of the NASA Openscapes “corn” image developed by Luis Lopez (NASA) and used in countless workshops on cloud-computing with NASA Earth Data. Subsequently the NASA Openscapes mentor cloud-infrastructure Slack group met during weekly co-work sessions and plugged away at the problem of helping users ‘fledge’ off the Openscapes JupyterHub, which involved creating images that could be used outside of JupyterHubs, and updating the original “py-rocket” R image created by Luis. Carl Boettiger (UC Berkeley & Rocker Project) and Eli Holmes (NOAA Fisheries) took on different aspects of this work. The GitHub Action tooling is curtesy of Carl. “py-rocket-base” is derived from Carl’s “version 2.conda” of py-rocket. Eli further developed py-rocket into the form in this repo to bring it closer to the “corn” and Pangeo designs. Yuvi Panda (Jupyter, 2i2c) was instrumental in helping sort through so many mystery bugs. The Codespaces and devcontainer code is based on Michael Akridge’s Open Science Codespaces work. Individual images have different core developers: Tim Haverland (arcgis), Sunny Hospital (coastwatch), Luke Thompson (aomlomics-jh), Eli Holmes (the various py-rocket versions).

License information

All code used in the images is under open licenses. Some is copy-left which means if you modify their code (we don’t), you need to also provide your source code. The Dockerfile code is released under Apache 2.0, a very permissive open source license which does not require that you make you own modifications open. See the README.md files for the licenses for specific code used in the Docker files.


Disclaimer

This repository is a scientific product and is not official communication of the National Oceanic and Atmospheric Administration, or the United States Department of Commerce. All NOAA GitHub project content is provided on an ‘as is’ basis and the user assumes responsibility for its use. Any claims against the Department of Commerce or Department of Commerce bureaus stemming from the use of this GitHub project will be governed by all applicable Federal law. Any reference to specific commercial products, processes, or services by service mark, trademark, manufacturer, or otherwise, does not constitute or imply their endorsement, recommendation or favoring by the Department of Commerce. The Department of Commerce seal and logo, or the seal and logo of a DOC bureau, shall not be used in any manner to imply endorsement of any commercial product or activity by DOC or the United States Government.

back to readme

There are many ways to use Docker images. Here are common ones. Scroll to the bottom for instructions on linking your container to file systems (so you can get and store files).

To run images in a JupyterHub with ‘bring your image’

If your JupyterHub has this option:

  • Click on the ‘Bring your own image’ radio button at bottom
  • Paste in url to your image (or any other image)
  • You will find the urls in the right nav bar under ‘Packages’
  • Example ghcr.io/nmfs-opensci/jupyter-base-notebook:latest

Run with a JupyterHub

Should work out of the box. Put the url to the image whereever you would use images.

Run with docker

You can run the images on a Virtual Machine or your computer if you have Docker or Podman installed.

docker run -p 8888:8888 ghcr.io/nmfs-opensci/container-images/py-rocket-base:latest

On a Mac M2+ with Rosetta emulation turned on in the Docker Desktop settings.

docker run --platform linux/amd64 -p 8888:8888 ghcr.io/nmfs-opensci/container-images/py-rocket-base:latest

In the terminal look for something like and put that in a browser.

http://127.0.0.1:8888/lab?token=6d45c7d88aba92a815647c

Running geospatial R Docker images and working with netCDF files

GDAL netCDF driver needs some extra flags added to the docker run for GDAL to work correctly when run inside a Docker container. This doesn’t affect Python as much since xarray works with netCDF via different drivers, but the terra netCDF functions use GDAL drivers under the hood to open netCDF files. You’ll get error saying it can’t find files. We ran into trouble when accessing cloud-hosted netCDF files. Perhaps it works ok if you download the files.

Add this to the call:

--cap-add SYS_PTRACE --security-opt seccomp=unconfined

so you call will look like:

docker run -p 8888:8888 --cap-add SYS_PTRACE --security-opt seccomp=unconfined ghcr.io/nmfs-opensci/container-images/py-rocket-geospatial:latest

Note we had trouble getting this to work on an Mac with Apple chips. You can test if it is going to work by running this Python code and seeing if you can see ifDCAP_VIRTUALIO` is listed:

from osgeo import gdal
nc = gdal.GetDriverByName("netCDF")
nc.GetMetadata().keys()

Run with Binder

Create a file called Dockerfile and put in the base of your GitHub repository or in a folder called binder or .binder. Into that file put the following line (replacing the image url to match your desired image).

FROM ghcr.io/nmfs-opensci/container-images/py-rocket-geospatial:latest

Then go to https://mybinder.org and paste in the url to your GitHub repo or alternatively go to the following url directly:

https://mybinder.org/v2/gh/<username or org>/<reponame>

With Codespaces

See the folders in the .devcontainer folder and create a .devcontainer/devcontainer.json file in your own repo by copying one of devcontainer.json file. They all use the same template with just the top lines changed. Note that the folder .devcontainer/codespace is also required. If you change the line that starts up Jupyter Lab (at the bottom of the devcontainer.json file, do not use port 8888 or else RStudio will not launch.

The Codespaces code is based on: https://github.com/MichaelAkridge-NOAA/Open-Science-Codespaces

GitPod – like Codespaces

Work in progress. Approach is similar to Codespaces.

Run on Google Colab

TBD. This seems harder. See this issue

back to readme

The container gives you a computing environment, but by design, it is a container and not connected to the file system in whatever is running the container. So you will need to get your files in/out of the container and have a way to save your work.

Upload/Download files

Under the Files menu in Jupyter Lab or the Files tab in RStudio, you can upload and download files.

Use a Git repository

Jupyter Lab and RStudio have Git GUIs. Use those or the command line to clone repos and push changes back to the repos.

cd ~
git clone <url to the repo>

Connect to a bucket

If you are working with large data sets, you do not want to move these into your container (slow, slow). You will want to create a bucket (like an S3 bucket) and connect to that. This is like having a external drive in the cloud.

Instructions to come.

Mount a file system

You can mount a local file system and read/write directly from that. Here “local” means the machine that is running the container. “local” might be a virtual machine, a server or your computer.

On a JupyterHub: The managers of the hub most likely have created persistent memory for you. If not, use Git, upload/download, or use buckets.

On your computer: you’ll add a flag to the docker run command to mount your local file system to the Docker container.

When you use --volume to bind-mount a file or directory, make sure it does not exist on the Docker container. So do not bind a directory like \usr which would destroy the container (nothing bad; it just won’t work). Use something like \home\jovyan\mydir. --volume creates the endpoint for you and it is always created as a directory.

In this example, mydir needs to exist in the directory where you are running docker run. If you get errors, try ls to make sure the directory is there.

docker run --platform linux/amd64 -p 8888:8888 --volume ./myproject_files:/home/jovyan/mydir ghcr.io/nmfs-opensci/container-images/py-rocket-base:latest

as you work in mydir in the container, those changes will appear in your computer’s myproject_files directory. It is as if you are working on your own computer, but you are using the development environment of the docker file.

Mac users with Apple chips, add --platform linux/amd64:

docker run --platform linux/amd64 -p 8888:8888 --volume ./myproject_files:/home/jovyan/mydir ghcr.io/nmfs-opensci/container-images/py-rocket-base:latest

Use py-rocket-base as a base image

Create a file called Dockerfile

FROM ghcr.io/nmfs-opensci/container-images/py-geospatial:latest

add more code in that file. See the images and draft_images folders for examples.

Use a GitHub Action to automatically build and push the image to ghcr.io (GitHub packages, associated with every repo). This action is triggered whenever your dockerfile changes.

name: Docker Image CI

on:
  workflow_dispatch:
  push:
    branches:
      - main
    paths:
      - 'Dockerfile'

jobs:
  build-and-push:
    uses: nmfs-opensci/container-images@main