Preprocessing module for s2spy #127
Replies: 9 comments 5 replies
-
Thanks for the writeup! For whatever solution we eventually end up with, I think it will be important that:
In an esmvaltool recipe (but it could work just as well with another workflow package e.g. snakemake??), this could look something like: preprocessors:
- sm1:
- selbox: ®ion
- lon_min: 10
- lon_max: 20
- lat_min: 30
- lat_max: 40
- calculate_anomalies:
- apply_fft: True
- detrend: True
- sm2:
- *region # this (& and *) are yaml anchors, so we don't have to repeat the selbox args
- detrend:
- method: Loess Would it be possible to write all the preprocessing steps (variables, functions, arguments) that were needed for the Lorenz workshop in this format? That would be super helpful, as it shows exactly what we need to do, without clutter about the implementation. |
Beta Was this translation helpful? Give feedback.
-
Metaflow might be worth checking out: https://outerbounds.com/docs/intro-tutorial-season-2-overview/ |
Beta Was this translation helpful? Give feedback.
-
Thanks for the overview @jannesvaningen. I have some comments regarding your discussion questions:
In our package, we already assume/require that the data should be given as I think there is no package could handle all the operations, but some of them can handl some of these preprocessing steps. As a starting point, we can begin with simple operations, like linear detrending, and try to reuse the existing tools (e.g. scipy, esmvaltool). Only when the operations with solid userase couldn't be handled, we write code for it, just like the dimensionality module and RGDR. I am still exploring good candidates for these steps. I think later we can have a meeting and try to shortlist the preprocessing operations we would like to support initially and the tools that are potentially reusable.
That really depends on your purpose. Ideally, we want to make these preprocessing operations independent and let the user decide what to do next. The function should just work with timeseries, no matter when it is called.
I am not sure if we would like to support it at the beginning. I would prefer, let the user prepare the data in terms of the dimensions. We only care about time dimension. Then the module doesn't need to know, whether it is a precursor or a target timeseries, as what Peter suggested.
The module should be flexible. It doesn't need to know when it is called. With proper time axis, it should do the job, in my opinion. |
Beta Was this translation helpful? Give feedback.
-
Besides, I couldn't agree more with the three bullet points that @Peter9192 made about the design code. I will continue exploring the packages and included a few minimum examples in notebooks. I'm still a bit confused about the preprocessing we have during the Lorentz workshop. If I understand correctly, we linearly detrend the signals, calculate anomalies, then take the running mean and the data is ready for ML and cross-validation. For the anomalies, I thought we subtract the mean of the certain time across years. Am I wrong? |
Beta Was this translation helpful? Give feedback.
-
Thanks for the comments @geek-yang ! I think we can build on that. I'll write up a recipe like Peter suggested to see what we need for the Lorenz preprocessing.
CDS toolbox has a nice example: https://cds.climate.copernicus.eu/toolbox/doc/how-to/13_how_to_calculate_climatologies_and_anomalies/13_how_to_calculate_climatologies_and_anomalies.html |
Beta Was this translation helpful? Give feedback.
-
As @Peter9192 suggested, I have made a recipe of the preprocessing steps that were taken during the Lorenz workshop for the target and the precursor. Notes:
target:
- precipitation:
- selbox:
- format_lon: 'west_east' (-180 to 180 longitude)
- lon_min: 38
- lon_max: 52
- lat_min: 11
- lat_max: 0
- mask (for instance land/sea mask): True (this wasn't done during the workshop but we only want rainfall over land)
- input_data
- spatial_aggregation: True (there should be an option to aggregate over one area or over multiple (predefined) areas)
- input_data
- time: &time
- start_year: 1980
- end_year: 2020
- smoothening: True
- method: rolling_mean
- window: 25 days
- anomalies: True
- detrend: True
- method: Linear
- time_aggregation: &time_aggregation
- frequency: 28
- type: days
- target_period:
- start_date: '01-10' (dd-mm)
- end_date: '31-12' (dd-mm)
- aggregate: True
preprocessors:
- sst:
- selbox:
- lon_min: -180
- lon_max: 180
- lat_min: -80
- lat_max: 80
- time:
- *time
- smoothening: True
- method: rolling_mean
- window: 25 days
- anomalies: True
- detrend: True
- method: Linear
- time_aggregation:
*time_aggregation |
Beta Was this translation helpful? Give feedback.
-
So far I have explored many packages. There is no tool that could perform all the operations we need. However, I find out that these operations can be handled with |
Beta Was this translation helpful? Give feedback.
-
After playing with esmvaltool, and xarray, and exploring some workflow tools, I have more thoughts about the preprocessing module. I think we need both esmvaltool and a built-in module in s2spy to deal with preprocessing in our workflow. For users who need to scale their workflow up to CMIP level, we would point them to esmvaltool for data handling and preprocessing before calling s2spy. There are some advantages for depending on esmvaltool:
The downside for relying on ESMValTool is, it is a bit too heavy. But for users who would like to work with CMIP models, they should be prepared for heavy jobs. So it should be ok. The only question is, how do we include it into our workflow? I prefer to have some example notebooks which show the whole lifespan of doing an experiment with ESMValTool and s2spy, from data preparation to ML. For those who come with their own small datasets, we also need to facilitate them with some lightweight preprocessing modules. I explore those workflow tools in python, but to me they seem to be a bit too over-engineering for our task. Most of these tools are designed to deal with asynchronous multi-tasks on distributed systems (similar to k8s). I think what we need here is just something super lightweight but could help users with basic preprocessing operations in an organized way. I tried to come up with something simpler using xarray and scipy, I think by creating a preprocessor class we could easily achieve this. Here are some codes below for discussion. The advantages for having such a preprocessing module is:
The downside is, users get a bit constrained. I'm not so sure about the design of this module. But I think we all agree that it should not make s2spy heavier. |
Beta Was this translation helpful? Give feedback.
-
Intro
To proceed with the preprocessing module for s2spy, we need to know what functionality is inside the legacy code to determine what we need. I will give an overview here of an example used at the Lorentz workshop. I explain the steps taken in the legacy code and the arguments that you can pass to the functions.
Notebook example
link
Preprocessing
Preprocessing is done separately for the precursors and for the target variable.
Precursors
First, we preprocess the precursors. In this case, we want detrending and anomalies. In this example we only have one precursor, but usually you have more than one. Not all variables require the same preprocessing. In the legacy code you can change this like so:
The structure of the calls is like this:
Possible arguments of rg.pp_precursors:
subselect data, inadvisable for daily data due to rolling mean
needed for robust calculation of climatological mean.
The default is None.
selbox has format of (lon_min, lon_max, lat_min, lat_max).
The default is None.
string referring to format of longitude. If 'only_east' longitude
ranges from 0 to 360. If 'west_east', ranges from -180 to 180.
The default is 'only_east'.
If True: auto detect a mask if a field has a lot of the exact same
value (e.g. -9999). The default is False.
If True: linear scipy detrending (fast), see sp.signal.detrend docs.
With dict, {'method':'loess'}, loess detrending can be called (slow).
Extra loess argument can be passed as well, see core_pp.detrend_wrapper?
The default is True.
remove climatolgy. For daily data, clim calculated by first apply
25-day rolling mean if apply_fft==True, subsequently fitting the first
6 harmonics to the rolling mean climatology.
For monthly data, climatology is calculated on raw data.
The default is True.
Apply Fast Fourier Transform to fit first 6 harmonics to rolling mean
climatology. See anomaly.
Encoding for writing post-processed netcdf, could save memory.
E.g. {"dtype": "int16", "scale_factor": 1E-4}
The default is {}.
Target
Then, we preprocess the target variable. Procedure is kind of the same as with the precursors. Here the full pre-processed ts is stored on the class though, not as a separate file. There are two other keyword arguments:
if tfreq is None and target variable contain one-value-per-year,
the target is extended to match the percursor time-axis.
If precursors are monthly means, then the the target is also extended
to monthly values, else daily values. Both are then aggregated to
{tfreq} day/monthly means. The default is True.
If True, set rg.tfreq to None. Target Variable will be aggregated
to a single-value-per-year "period mean". start_end_TVdate defines
the period to aggregate over.
Order of function calls:
Discussion
Some discussion questions:
Beta Was this translation helpful? Give feedback.
All reactions