Skip to content

Commit

Permalink
Merge pull request #12 from NoaMillsUSDA-ARS/pygcdl
Browse files Browse the repository at this point in the history
Added pygcdl tutorial
  • Loading branch information
aedawid authored Aug 14, 2024
2 parents 9132430 + 0b47739 commit aa16efe
Show file tree
Hide file tree
Showing 3 changed files with 999 additions and 1 deletion.
2 changes: 1 addition & 1 deletion ExampleGeoWorkflows/GRWGWorkshop.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Materials modified from the [USDA-ARS SCINet Geospatial Research Working Group W

| Tutorial | Python | R |
|:--|:--|
| SCINet Geospatial Common Data Library (GeoCDL) | | [view](GRWG22_GeoCDL_R) [download](../tutorials/GRWG22_GeoCDL.Rmd) |
| SCINet Geospatial Common Data Library (GeoCDL) | [view](pygcdl_tutorial) [download](../tutorials/pygcdl_tutorial.ipynb) | [view](GRWG22_GeoCDL_R) [download](../tutorials/GRWG22_GeoCDL.Rmd) |
| Handling Vector Data | [view](GRWG22_VectorData_python) [download](../tutorials/GRWG22_VectorData.ipynb) | [view](GRWG22_VectorData_R) [download](../tutorials/GRWG22_VectorData.Rmd) |
| Raster Calculations with Tiles | [view](GRWG22_RasterTiles_python) [download](../tutorials/GRWG22_RasterTiles.ipynb) | [view](GRWG22_RasterTiles_R) [download](../tutorials/GRWG22_RasterTiles.Rmd) |
| SLURM Job Arrays for Many Data Input Files | [view](GRWG22_ZonalStats_wSLURM_python) [download](../tutorials/GRWG22_ZonalStats_wSLURM.ipynb) | [view](GRWG22_ZonalStats_wSLURM_R) [download](../tutorials/GRWG22_ZonalStats_wSLURM.Rmd) |
Expand Down
358 changes: 358 additions & 0 deletions ExampleGeoWorkflows/pygcdl_tutorial.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,358 @@
---
title: Geospatial Common Data Library (GeoCDL)
layout: single
author: Noa Mills
author_profile: false
header:
overlay_color: "444444"
overlay_image: /assets/images/margaret-weir-GZyjbLNOaFg-unsplash_dark.jpg
---

**Last Update:** 29 May 2024 <br />
**Download Jupyter Notebook**: [GRWG22_GeoCDL.Rmd](https://geospatial.101workbook.org/tutorials/pygcdl_tutorial.ipynb)

# Python Package for USDA-ARS SCINet GeoCDL

## Overview
This tutorial covers the python package `pygcdl` for the SCINet Geospatial Common
Data Library (GeoCDL), a community project from the
[Geospatial Research Working Group](https://scinet.usda.gov/working-groups/geospatial)
to reduce the time and effort to access commonly used geospatial datasets. This
tutorial is based on the vignette for the R counterpart of pygcdl, rgeocdl. We have
collected several large gridded data products to store on Ceres and created a
REST API for SCINet users to request the spatiotemporal subsets of those data
that they need. The geospatial processing to create those subsets executes on
Ceres and a service node has been setup to serve the API.

This python package is a user-friendly interface to pass along user requests to the
core GeoCDL API from the compute nodes. That is, the python package does not perform
the geospatial processing itself. A major benefit of using this python package is that
it was designed to integrate into a user's geospatial data processing workflow in
Python. For example, a user storing their study area boundary definition as a
`geopandas.GeoDataFrame` object can pass along that object to this package's
functions and the package will do the necessary formatting of the data to make it
compatible with GeoCDL.

The workflows we will cover include uploading a GeoJSON of an LTAR site,
requesting data from two datasets clipped to the boundary of the LTAR site
with equivalent resolutions and CRSs, and visualizing the resulting
maps. We also show how to extract point level information from a gridded layer.

This tutorial assumes you are running this notebook in JupyterLab on
Ceres. The easiest way to do that is with
[Open OnDemand](http://ceres-ood.scinet.usda.gov/). As of this writing, the GeoCDL
is only available on Ceres and not Atlas.

If you have any questions, problems, or requests related to the python interface, please
use the issue tracker on our GitHub repository:
[https://github.com/USDA-SCINet/pygcdl](https://github.com/USDA-SCINet/pygcdl).

## Nomenclature

* <b>GeoCDL:</b> Geospatial Common Data Library, a collection of commonly used raster
datasets accessible from an API running on SCINet's Ceres cluster
* <b>Raster:</b> A geospatial datatype where data is stored as a grid of regularly sized pixels. Geospatial rasters contain geospatial metadata, which maps each pixel of the raster to a geospatial location on the Earth's surface. Examples of geospatial raster file types include: geotiff (.tif), and netCDF (.nc).
* <b>Vector:</b> A geospatial datatype where data is stored as a collection of points, lines, or polygons. Each coordinate maps to a location on Earth's surface. Examples of geospatial vector file types include: geojson (.geojson), and shapefiles (.shp).
* <b>CRS:</b> Coordinate Reference System, also known as a spatial reference system. A
system for defining geospatial coordinates.

## Data Details

###### Dataset: MODIS NDVI
* Link: [https://doi.org/10.3334/ORNLDAAC/1299](https://doi.org/10.3334/ORNLDAAC/1299)
* Details: This data set provides Moderate Resolution Imaging Spectroradiometer
(MODIS) normalized difference vegetation index (NDVI) data, smoothed and gap-filled,
for the conterminous US for the period 2000-01-01 through 2015-12-31. The data
were generated using the NASA Stennis Time Series Product Tool (TSPT) to generate
NDVI data streams from the Terra satellite (MODIS MOD13Q1 product) and Aqua
satellite (MODIS MYD13Q1 product) instruments. TSPT produces NDVI data that
are less affected by clouds and bad pixels.

###### Dataset: PRISM
* Link: [https://prism.oregonstate.edu/](https://prism.oregonstate.edu/)
* Details: The PRISM Climate Group gathers climate observations from a
wide range of monitoring networks, applies sophisticated quality control
measures, and develops spatial climate datasets to reveal short- and long-term
climate patterns. The resulting datasets incorporate a variety of modeling
techniques and are available at multiple spatial/temporal resolutions, covering
the period from 1895 to the present.


## Primary Libraries

| Name | Description | Link |
|:--|:--|:--|
| pygcdl | Python interface for SCINet GeoCDL API | https://github.com/USDA-SCINet/pygcdl |
| geopandas | Geospatial vector data for python | https://geopandas.org/en/stable/ |
| rioxarray | Geospatial raster data for python | https://corteva.github.io/rioxarray/stable/ |


### Tutorial Steps:
0. Import Libraries
1. Specify area and dates of interest
2. Select datasets and their variables
3. Download the data
4. Visualize the results


## Part 0: Import Libraries


```python
# Import the necessary packages.
import pygcdl
import geopandas as gpd
import pandas as pd
import numpy as np
from pathlib import Path
import requests as r
import rioxarray
import xarray as xr
import matplotlib.pyplot as plt
import math
```


```python
# Create the pygcdl object in order to interact with the pygcdl package.
pygcdl_obj = pygcdl.PyGeoCDL()
```

## Part 1: Specify area and dates of interest

Here, we specify the spatial extent of our requests. We can request either polygon-based or point-based subsets. When we request a subset, we can specify the spatial extent in one of these three ways:

- <b>GUID: </b> Users can use the `upload_geometry()` function to upload a file or `geopandas` GeoDataFrame object, receive a Geometry Upload Identifier (GUID), and use that GUID for subsequent data requests.
- <b>Clip: </b> Users can specify the coordinates of a bounding box (polygon data only).
- <b>GeoDataFrame: </b> Users can build a `geopandas.GeoDataFrame` object and use it in requests directly, without uploading it in advance.

For this tutorial, we will use the `upload_geometry()` function to upload a shapefile containing polygon data that represents the Jornada Experimental Range in southern New Mexico, and use the GUID generated.


```python
# First, download the GeoJSON from AgCROS
url = "https://services1.arcgis.com/SyUSN23vOoYdfLC8/arcgis/rest/services/LTAR_Base/FeatureServer/1/query?where=acronym='JER'&f=pgeojson"
response = r.get(url)
filename = "jer_bounds_sf.geojson"

# Save the file locally.
with open(filename, mode="wb") as file:
file.write(response.content)
```

The file `jer_bounds_sf.geojson` that we just downloaded is in EPSG:4326, which is the default for geojson files. Say we wish to download raster data through pygcdl in the CRS EPSG:32613 because EPSG:32613 is a projected CRS that is applicable to our area of interest. (You can learn more about EPSG:32613 [here](https://epsg.io/32613)). In this next block, we read our polygon data into a `geopandas.GeoDataFrame` object, and then transform our `geopandas.GeoDataFrame` object into our desired CRS.


```python
# Load file into geopandas.
bounds = gpd.read_file(filename)
# Transform geopandas dataframe to the desired CRS.
bounds = bounds.to_crs("EPSG:32613")
```


```python
# Visualize the downloaded boundary.
bounds.boundary.plot()
```

We can see from the plotted map that the site is an irregular shape. For cases like this where the geometry is defined by many points, it is easiest to provide GeoCDL with a file containing the geometry definition, instead of uploading clipping coordinates. We can upload this geodataframe to GeoCDL using the `upload_geometry` function which returns a unique geometry upload identifier (GUID) that we will use later in our subset request. This stand-alone upload step is optional, but recommended if you are likely to submit multiple subset requests with the same geometry so that it is uploaded just once. You could alternatively use the `geopandas` GeoDataFrame object directly in requests instead of using a GUID, but if we upload the geometry file, then we can re-use the same GUID in subsequent requests.

Here, we upload our file by calling `upload_geometry()` on our `pygcdl_obj` object.


```python
guid = pygcdl_obj.upload_geometry(bounds)
print(guid)
```

To finish the spatial component of our subset request, we will define our target spatial resolution and a resampling method. By indicating a target spatial resolution along with our geometry, we are telling GeoCDL that we want a spatially-harmonized 'datacube', which means that each requested data layer has the same CRS, spatial resolution, and spatial extent.

Unless we specify otherwise, the target CRS in this case is that of our geometry. Resampling is the process by which the GeoCDL calculates pixel values when the cell grid changes, like when we change the resolution or CRS. This calculation is performed by `rasterio`, and you can find the full list of reprojection methods [here](https://rasterio.readthedocs.io/en/stable/api/rasterio.enums.html#rasterio.enums.Resampling). The default resampling method is to take the nearest pixel's value. Here, we choose the "bilinear" method, which you can learn more about [here](https://gisgeography.com/bilinear-interpolation-resampling/).

Our CRS, EPSG:32613, is in units of meters. Therefore, our spatial resolution is in units of meters. Here, we specify that we want each pixel to represent an area of 1000 by 1000 meters.


```python
spat_res = 1000 # in units of meters
resample_method = "bilinear"
```

Next, we specify our temporal data. The GeoCDL accepts multiple temporal range formats so that many different user
needs can be met. In this example, we are interested in July-August 2008. One way to specify that is with the years and months together as `dates='2008-07:2008-08'` or separately as below. By specifying years and months, we are letting GeoCDL know that we are interested in monthly data. If we only specify years, then it will infer we want annual data and if we also specifies days, then it will infer we want daily data. If the inferred date grain is supported by a dataset, then we use that grain for that dataset. If that date grain is not supported by a dataset, then GCDL uses the "grain method" variable, if set, to determine what other grains the user is willing accept. For example, here we specify "finer" to indicate that if monthly data is not available, we are also willing to accept daily data. Other options for grain methods include "strict", "skip", "any", and "coarser".



```python
years = "2008"
months = "7:8"
grain_method = "finer"
```

## Part 2: Select datasets and their variables


We can use the `list_datasets()` function to list all of the datasets that are available in the Geospatial Common Data Library.


```python
pygcdl_obj.list_datasets()
```

We can use the `get_dataset_info()` function to learn more about one specific dataset.


```python
# Get information about the MODIS_NDVI dataset.
pygcdl_obj.get_dataset_info("MODIS_NDVI")
```


```python
# Get information about the PRISM dataset.
pygcdl_obj.get_dataset_info("PRISM")
```

Next, we will specify the datasets and variables we wish to use with a `pandas` dataframe. Here we specify that we want the 'ppt' variable from the PRISM dataset, and the 'NDVI' variable from the MODIS_NDVI dataset. We can format our dataset and variable specifications as a `pandas` dataframe, a numpy array, a matrix, or a dictionary. Here, we format our specifications as a `pandas` dataframe.


```python
dsvars = pd.DataFrame(
[["PRISM", "ppt"], ["MODIS_NDVI", "NDVI"]],
columns=["dataset", "variable"])
print(dsvars)
```

## Part 3: Download the data

First, we create a directory where we would like our data to download to.


```python
output_path = Path("output")
if not output_path.is_dir():
output_path.mkdir()
```

Up until now, we have been primarily saving our request specifications as variables. We will now pass each of those variables to the GeoCDL and download our subset using the `download_polygon_subset` function. GeoCDL returns a zipped folder of results and `pygcdl` unzips it for you. `download_polygon_subset` returns the filenames in that folder. We have here two monthly PRISM layers, seven daily MODIS NDVI layers (every 8 days), plus a metadata file that lists metadata related to the geospatial datasets as well as the GeoCDL request itself. The raster layer files are GeoTIFFs with the dataset, variable, and date indicated in the filename.


```python
subset_files = pygcdl_obj.download_polygon_subset(
dsvars=dsvars,
years=years,
months=months,
grain_method=grain_method,
t_geom=guid,
resolution=spat_res,
dsn=output_path,
)
```


```python
subset_files
```

## Part 4: Visualize the results

Our use of `pygcdl` for this example is complete, but we can visualize the data that we downloaded. We can see that in our site, NDVI increased over July-August in 2008 but by different degrees within the site.

#### MODIS_NDVI Visualization


```python
# Create a list of the NDVI GeoTIFFs.
modis_tifs = [k for k in subset_files if k.endswith('.tif') and "NDVI" in k]
print(modis_tifs)
```


```python
# Create a list of rioxarray datasets and count the images. Use the
# 'mask_and_scale=True' option to read nodata values as nan and to apply the
# scale factor.
modis_image_stack = [
rioxarray.open_rasterio(k, mask_and_scale=True) for k in modis_tifs
]
num_images = len(modis_image_stack)
print(num_images)
```


```python
# Visualize the range of values of NDVI data in the first image.
modis_image_stack[0].plot.hist()
```


```python
# Extract the dates from the filenames and use them to title the subplots.
ndvi_layer_names = [
string.split("NDVI_NDVI_")[1].split(".tif")[0] for string in modis_tifs
]

# Plot the data.
fig, axs = plt.subplots(nrows=2, ncols=4, figsize = (18, 10), sharey=True)
fig.suptitle("NDVI Values in Jornada Experimental Range, July-August 2028", fontsize=25)
plt.subplots_adjust(hspace=0.4)
for i in range(num_images):
im = modis_image_stack[i].plot(ax=axs[math.floor(i/4), i%4], add_colorbar=False)
axs[math.floor(i/4), i%4].set_title(ndvi_layer_names[i], fontsize=20)
axs[math.floor(i/4), i%4].tick_params(labelrotation=45)
axs[1,3].axis("off")
label_axis = fig.add_axes([0.79, 0.12, 0.05, 0.3])
cbar = fig.colorbar(im, cax=label_axis)
cbar.set_label(label="NDVI", fontsize=20)
```

#### PRISM Visualizaiton


```python
# Create a list of PRISM GeoTIFFs.
prism_tifs = [k for k in subset_files if ".tif" in k and "PRISM" in k]
print(prism_tifs)
```


```python
# Create a list of rioxarray datasets and count the images. Use the
# 'masked=True' option to read nodata values as nan. Otherwise, nodata values
# for this data are read in as -9999
prism_image_stack = [
rioxarray.open_rasterio(k, masked=True) for k in prism_tifs
]
num_images = len(prism_image_stack)
print(num_images)
```


```python
print("Min", float(prism_image_stack[0].min()))
print("Max", float(prism_image_stack[0].max()))
print("Mean", float(prism_image_stack[0].mean()))
```


```python
prism_image_stack[0].plot.hist()
```


```python
# Extract the dates from the filenames and use them to title the subplots.
prism_layer_names = [
string.split("PRISM_ppt_")[1].split(".tif")[0] for string in prism_tifs
]

# Plot the data.
fig, axs = plt.subplots(ncols=2, figsize = (14, 8), sharey=True)
fig.suptitle("PRISM Values in Jornada Experimental Range, July-August 2028", fontsize=25)
plt.subplots_adjust(hspace=0.4)
for i in range(num_images):
im = prism_image_stack[i].plot(ax=axs[i], add_colorbar=False)
axs[i].set_title(ndvi_layer_names[i], fontsize=20)
axs[i].tick_params(labelrotation=45)
fig.colorbar(im)
```
Loading

0 comments on commit aa16efe

Please sign in to comment.