SantanderMetGroup/datatools

smgdatatools stands for data tools from SantanderMetGroup.

etl.py

Generate virtual datasets for climate data. Supports Kerchunk, HDF5 VDS and NcML.

Kerchunk

ERA5 from Amazon S3

See a description of the dataset here.

echo 'https://s3.amazonaws.com/era5-pds/2020/01/data/air_pressure_at_mean_sea_level.nc
https://s3.amazonaws.com/era5-pds/2020/02/data/air_pressure_at_mean_sea_level.nc
https://s3.amazonaws.com/era5-pds/2020/01/data/sea_surface_temperature.nc
https://s3.amazonaws.com/era5-pds/2020/02/data/sea_surface_temperature.nc' | \
etl.py --db test.sqlite --collector hdf5chunk --hdf5-driver ros3 --aggregations air_pressure_at_mean_sea_level sea_surface_temperature --etl jinja -t era5-s3.json.j2 --dest test.json

You need to remove the last comma from the test.json file!

import xarray

ds = xarray.open_dataset("reference://", engine="zarr", backend_kwargs={
                    "consolidated": False,
                    "storage_options": {"fo": 'test.json', "remote_protocol": "s3","remote_options": {"anon": True}}
                    })
print(ds)

CMIP6 from Pangeo and Google Cloud

See a description of the dataset here.

echo 'gs://cmip6/CMIP6/CMIP/NCAR/CESM2-FV2/historical/r2i1p1f1/Amon/tas/gn/v20200226
gs://cmip6/CMIP6/CMIP/NCAR/CESM2-FV2/historical/r1i1p1f1/Amon/tas/gn/v20191120
gs://cmip6/CMIP6/CMIP/NCAR/CESM2-FV2/historical/r2i1p1f1/Amon/pr/gn/v20200226
gs://cmip6/CMIP6/CMIP/NCAR/CESM2-FV2/historical/r1i1p1f1/Amon/pr/gn/v20191120' | \
etl.py --db test.sqlite --collector zarr --aggregations tas pr --etl jinja -t gcs-cmip6.json.j2 --dest test.json

You need to remove the last comma from the test.json file!

import xarray

ds = xarray.open_dataset("reference://", engine="zarr", backend_kwargs={
                    "consolidated": False,
                    "storage_options": {"fo": 'test.json', "remote_protocol": "gs","remote_options": {"anon": True}}
                    })
print(ds)

Be careful with the following:

Number of chunks does not match between ensemble members for the same variable. Check this against the SQL database (eg. select count(*) from variable inner join chunk on variable.id = chunk.variable_id where variable.name = VARIABLE_NAME group by variable.id).

HDF5 Virtual Dataset

find test/data -maxdepth 1 -type f -name '*.nc' | grep -v 'fx' | etl.py --db test.sqlite --collector nc --aggregations tas pr --etl new-common --dest test.h5 --coord-name variant_label --coord-values-attr variant_label

Open the virtual dataset with xarray:

import xarray

ds = xarray.open_dataset("test.h5")
ds[["tas", "pr"]].mean()

NcML

find test/data -maxdepth 1 -type f -name '*.nc' | grep -v 'fx' | etl.py --db test.sqlite --collector nc --aggregations tas pr --etl jinja -t time-ensemble.ncml.j2 --dest test.ncml

Open the generated XML file with your favourite editor. You may also use ToolsUI or climate4R.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

SantanderMetGroup/datatools

etl.py

Kerchunk

ERA5 from Amazon S3

CMIP6 from Pangeo and Google Cloud

HDF5 Virtual Dataset

NcML

Files

README.md

Latest commit

History

README.md

File metadata and controls

SantanderMetGroup/datatools

etl.py

Kerchunk

ERA5 from Amazon S3

CMIP6 from Pangeo and Google Cloud

HDF5 Virtual Dataset

NcML