Preprocessing module for s2spy #127

jannesvaningen · 2022-10-26T14:25:15Z

jannesvaningen
Oct 26, 2022
Maintainer

Intro

To proceed with the preprocessing module for s2spy, we need to know what functionality is inside the legacy code to determine what we need. I will give an overview here of an example used at the Lorentz workshop. I explain the steps taken in the legacy code and the arguments that you can pass to the functions.

Notebook example

link

Preprocessing

Preprocessing is done separately for the precursors and for the target variable.

Precursors

First, we preprocess the precursors. In this case, we want detrending and anomalies. In this example we only have one precursor, but usually you have more than one. Not all variables require the same preprocessing. In the legacy code you can change this like so:

Option for pp_precursors to vary pre-processing of datasets by given a list as an argument: An example: detrend=[True, {'sm1':False, 'sm2':False}]. This input means that default argument is True, except the variables sm1 and sm2 got a different argument. The precursor names sm1 and sm2 should refer to the names given in list_of_name_path.

The structure of the calls is like this:

rg.pp_precursors (adding up some **kwrgs from kwrgs_load (loading the data in core_pp.import_ds_lazy and kwrgs_pp (kwrgs on preprocessing))
- functions_pp.perform_post_processing (here a check is done if the preprocessed file does not already exist, otherwise it loads that preprocessed file)
  - core_pp.detrend_anom_ncdf3D (check if the data is 3d or 2d, call core_pp.detrend_xarray_ds_2D and write the result to a netcdf file). This function takes a raw netcdf file and writes the results to a netcdf file in the same folder as the original data in /preprocessed. The keyword arguments given in rg.pp_precursors are passed through to this function.
    - core_pp.detrend_xarray_ds_2D (where the actual detrending and anomalies are calculated). This function first determines whether the data is daily or monthly. If it's daily, a rolling mean is applied to calculate the anomalies. If it's monthly, no smoothening is applied. Fourier transformation is also possible, with given n_harmonics. After that it shows some visual plots of the anomalies. Then, detrending is done in function detrend_wrapper for linear or loess detrending.

Possible arguments of rg.pp_precursors:

seldates : pd.DatetimeIndex or start_end_date tuple, optional
subselect data, inadvisable for daily data due to rolling mean
needed for robust calculation of climatological mean.
The default is None.
selbox : tuple, optional
selbox has format of (lon_min, lon_max, lat_min, lat_max).
The default is None.
format_lon : str, optional
string referring to format of longitude. If 'only_east' longitude
ranges from 0 to 360. If 'west_east', ranges from -180 to 180.
The default is 'only_east'.
auto_detect_mask : bool, optional
If True: auto detect a mask if a field has a lot of the exact same
value (e.g. -9999). The default is False.
detrend : bool or dict, optional
If True: linear scipy detrending (fast), see sp.signal.detrend docs.
With dict, {'method':'loess'}, loess detrending can be called (slow).
Extra loess argument can be passed as well, see core_pp.detrend_wrapper?
The default is True.
anomaly : bool, optional
remove climatolgy. For daily data, clim calculated by first apply
25-day rolling mean if apply_fft==True, subsequently fitting the first
6 harmonics to the rolling mean climatology.
For monthly data, climatology is calculated on raw data.
The default is True.
apply_fft : bool, optional
Apply Fast Fourier Transform to fit first 6 harmonics to rolling mean
climatology. See anomaly.
encoding : dict, optional
Encoding for writing post-processed netcdf, could save memory.
E.g. {"dtype": "int16", "scale_factor": 1E-4}
The default is {}.

Target

Then, we preprocess the target variable. Procedure is kind of the same as with the precursors. Here the full pre-processed ts is stored on the class though, not as a separate file. There are two other keyword arguments:

ext_annual_to_mon : bool, optional
if tfreq is None and target variable contain one-value-per-year,
the target is extended to match the percursor time-axis.
If precursors are monthly means, then the the target is also extended
to monthly values, else daily values. Both are then aggregated to
{tfreq} day/monthly means. The default is True.
TVdates_aggr : bool, optional
If True, set rg.tfreq to None. Target Variable will be aggregated
to a single-value-per-year "period mean". start_end_TVdate defines
the period to aggregate over.
Order of function calls:
rg.pp_TV (uses some RGCDP self variables for determining the kwrgs of functions_pp.process_TV)
- functions_pp.process_TV. First it detrends and calculates anomalies on the 1d timeseries if desired, then it determines if it's daily, monthly or annual and aggregates the data.

Discussion

Some discussion questions:

It looks like the main reason Sem wrote these modules (or he wasn't aware of alternatives) is that there are no good preprocessing modules that are flexible enough to handle daily, monthly, annual data and data that could be 2d or 3d and that allow to do several types of transformations, such as linear/loess detrending or fourier transformations. Can we find packages that allow for this type of flexibility?
The order of the detrending, deseasonalizing and aggregating matters. It looks like now (not sure though) that for the precursors, the detrending is done after calculating anomalies, whereas with the target it is the other way around. What is the proper way of doing it?
For preprocessing the target, the module assumes you already have some kind of 1d timeseries ready. So you have taken a spatial mean over some region already before doing the preprocessing. Do we want to allow functionality for creating this 1d timeseries as well? I would consider it also part of the preprocessing.
Another issue is where we put this preprocessing module in the s2spy workflow. There are basically 3 options: right at the start before the calendar, after the calendar or after train/test/splitting. What do we think? Can we discuss the benefits and drawbacks?

Peter9192 · 2022-10-27T07:50:06Z

Peter9192
Oct 27, 2022
Maintainer

Thanks for the writeup! For whatever solution we eventually end up with, I think it will be important that:

Each preprocessing function/pipeline can be applied to a single input file, and results in a single output file
The preprocessing functions should be able to operate without knowing about the others
The preprocessing should not have to know the difference between a precursor and a target variable

In an esmvaltool recipe (but it could work just as well with another workflow package e.g. snakemake??), this could look something like:

preprocessors:
  - sm1:
    - selbox: &region
      - lon_min: 10
      - lon_max: 20
      - lat_min: 30
      - lat_max: 40
    - calculate_anomalies:
      - apply_fft: True
    - detrend: True
  - sm2:
    - *region  # this (& and *) are yaml anchors, so we don't have to repeat the selbox args
    - detrend:
      - method: Loess

Would it be possible to write all the preprocessing steps (variables, functions, arguments) that were needed for the Lorenz workshop in this format? That would be super helpful, as it shows exactly what we need to do, without clutter about the implementation.

0 replies

Peter9192 · 2022-10-27T09:36:31Z

Peter9192
Oct 27, 2022
Maintainer

Metaflow might be worth checking out: https://outerbounds.com/docs/intro-tutorial-season-2-overview/

0 replies

geek-yang · 2022-10-31T17:10:05Z

geek-yang
Oct 31, 2022
Maintainer

Thanks for the overview @jannesvaningen. I have some comments regarding your discussion questions:

It looks like the main reason Sem wrote these modules (or he wasn't aware of alternatives) is that there are no good preprocessing modules that are flexible enough to handle daily, monthly, annual data and data that could be 2d or 3d and that allow to do several types of transformations, such as linear/loess detrending or fourier transformations. Can we find packages that allow for this type of flexibility?

In our package, we already assume/require that the data should be given as xarray dataset/dataarray. In this case, I would expect the user to manage the area (e.g. lat/lon) themselves, since xarray can facilitate these operations quite easily. We will need to take care of the time dimension, similar to what we did for the calendar.

I think there is no package could handle all the operations, but some of them can handl some of these preprocessing steps. As a starting point, we can begin with simple operations, like linear detrending, and try to reuse the existing tools (e.g. scipy, esmvaltool). Only when the operations with solid userase couldn't be handled, we write code for it, just like the dimensionality module and RGDR.

I am still exploring good candidates for these steps. I think later we can have a meeting and try to shortlist the preprocessing operations we would like to support initially and the tools that are potentially reusable.

The order of the detrending, deseasonalizing and aggregating matters. It looks like now (not sure though) that for the precursors, the detrending is done after calculating anomalies, whereas with the target it is the other way around. What is the proper way of doing it?

That really depends on your purpose. Ideally, we want to make these preprocessing operations independent and let the user decide what to do next. The function should just work with timeseries, no matter when it is called.

For preprocessing the target, the module assumes you already have some kind of 1d timeseries ready. So you have taken a spatial mean over some region already before doing the preprocessing. Do we want to allow functionality for creating this 1d timeseries as well? I would consider it also part of the preprocessing.

I am not sure if we would like to support it at the beginning. I would prefer, let the user prepare the data in terms of the dimensions. We only care about time dimension. Then the module doesn't need to know, whether it is a precursor or a target timeseries, as what Peter suggested.

Another issue is where we put this preprocessing module in the s2spy workflow. There are basically 3 options: right at the start before the calendar, after the calendar or after train/test/splitting. What do we think? Can we discuss the benefits and drawbacks?

The module should be flexible. It doesn't need to know when it is called. With proper time axis, it should do the job, in my opinion.

0 replies

geek-yang · 2022-10-31T17:16:03Z

geek-yang
Oct 31, 2022
Maintainer

Besides, I couldn't agree more with the three bullet points that @Peter9192 made about the design code. I will continue exploring the packages and included a few minimum examples in notebooks.

I'm still a bit confused about the preprocessing we have during the Lorentz workshop. If I understand correctly, we linearly detrend the signals, calculate anomalies, then take the running mean and the data is ready for ML and cross-validation. For the anomalies, I thought we subtract the mean of the certain time across years. Am I wrong?

0 replies

jannesvaningen · 2022-11-01T09:14:27Z

jannesvaningen
Nov 1, 2022
Maintainer Author

Thanks for the comments @geek-yang ! I think we can build on that. I'll write up a recipe like Peter suggested to see what we need for the Lorenz preprocessing.
About @geek-yang last comment: if I understand correctly from the legacy code, the steps are as follows:

detrend (linear in this case)
smoothen (= rolling mean of 25 days, not for monthly data, only for daily data)
take anomalies (using absolute anomalies w.r.t. climatology of smoothed concurrent day accross years). For example, for January 1st, it subtracts the mean of all January 1st values in the dataset from all the January 1st values. In the case of months it is simply every January value minus the mean of all Januaries in the dataset.

Math is always much neater (d=day, m=month, y=year):
- $x_{d,m,y} - \bar{x}_{d,m}$
- $x_{m,y} - \bar{x}_{m}$

CDS toolbox has a nice example: https://cds.climate.copernicus.eu/toolbox/doc/how-to/13_how_to_calculate_climatologies_and_anomalies/13_how_to_calculate_climatologies_and_anomalies.html

0 replies

jannesvaningen · 2022-11-02T10:40:22Z

jannesvaningen
Nov 2, 2022
Maintainer Author

As @Peter9192 suggested, I have made a recipe of the preprocessing steps that were taken during the Lorenz workshop for the target and the precursor.

Notes:

Target:
- the steps selbox, mask and spatial aggregation were carried out outside the pipelines. People were given a 1dimensional dataset of precipitation over an area in Africa and that is where the pipelines started. So officially selbox, mask and spatial aggregation wasn't part of the pipeline but I list it here because it is part of the preprocessing. I know many people spend quite some time on this in their own analysis. Often things go wrong here (including in our own analysis) and having a good target that is created by a transparent process is essential for a good pipeline.
- I'm not sure if the smoothening, anomaly calculation and detrending is done over the full dataset or only in the given time window
- I'm not sure when the aggregation over the target period is done, before or after the detrending and anomalies.
Precursor:
- We use the same time period as the precursor using the yaml anchor. This is ok for this recipe because we want to forecast OND rainfall using a couple of precursor 28 day bins before OND. But if you would for instance forecast rainfall in January, you need data from the year before.
A common preprocessing step typically done for precursors is regridding. People often regrid from 0.25 degree to 1 degree for instance to reduce memory usage and because the spatial detail is not necessary. This wasn't done during the Lorenz workshop, we were already given SST on a 2deg grid. The difficulty with this step is that it is often done with cdo, not with python. You can do it with the xesmf package, but people often use cdo because it is very fast.

target:
  - precipitation:
     - selbox:
       - format_lon: 'west_east' (-180 to 180 longitude)
       - lon_min: 38
       - lon_max: 52
       - lat_min: 11
       - lat_max: 0
     - mask (for instance land/sea mask): True (this wasn't done during the workshop but we only want rainfall over land)
       - input_data
     - spatial_aggregation: True (there should be an option to aggregate over one area or over multiple (predefined) areas)
       - input_data
     - time: &time
       - start_year: 1980
       - end_year: 2020
     - smoothening: True
       - method: rolling_mean
       - window: 25 days
     - anomalies: True
     - detrend: True
       - method: Linear
     - time_aggregation: &time_aggregation
       - frequency: 28
       - type: days
     - target_period:
       - start_date: '01-10' (dd-mm)
       - end_date: '31-12' (dd-mm)
       - aggregate: True
preprocessors:
  - sst:
    - selbox:
      - lon_min: -180
      - lon_max: 180
      - lat_min: -80
      - lat_max: 80
    - time:
      - *time
    - smoothening: True
       - method: rolling_mean
       - window: 25 days
     - anomalies: True
     - detrend: True
       - method: Linear
     - time_aggregation:
       *time_aggregation

2 replies

Peter9192 Nov 2, 2022
Maintainer

So officially selbox, mask and spatial aggregation wasn't part of the pipeline but I list it here because it is part of the preprocessing.

Good! I think we should clarify all steps that have been taken to get from the raw data to whatever we use for model fitting, so also including what was done before the workshop. Especially because these are very common operations that people should not have to reinvent the wheel for.

The difficulty with this step is that it is often done with cdo, not with python

I recognize that. Still, I think we should aim for a fully reproducible pipeline. Whether it is using CDO or not is of later concern.

Interestingly, I think the steps listed so far are quite close to what ESMValTool can do. We could give it a try @geek-yang, I'd be happy to sit with you for half a day or so to try and see how far we can get with that.

geek-yang Nov 9, 2022
Maintainer

Sounds good to me. Let's explore ESMValTool and see how could it help. My only concern is the long dependency list of ESMValTool. But maybe that can be avoided with some wise and clean installation? I would like to make the module and the whole package lightweight.

geek-yang · 2022-11-09T16:41:49Z

geek-yang
Nov 9, 2022
Maintainer

So far I have explored many packages. There is no tool that could perform all the operations we need. However, I find out that these operations can be handled with xarray very easily. It might not be sufficient to tackle all complex situations but it could work with simple cases and it is super lightweight. Here is my notebook with simple examples:
https://github.com/AI4S2S/scratch/blob/explore_preprocess/preprocess.ipynb

2 replies

Peter9192 Nov 9, 2022
Maintainer

I think you're right that, in principle, you can do all these things with xarray. However, a potential pitfall is that everyone might do everything differently and you end up with a big unstructured mess of similar-but-not-quite code. So then, the question is: do we want to add more structure (and constraints) to what you can and cannot do. In other words, do we want/need a layer of abstraction on top of xarray? Perhaps try googling for "workflow tools" and see what you think

geek-yang Nov 10, 2022
Maintainer

Good suggestion! I think we need a layer on top of xarray for the implementation of these preprocessing steps, which could constrain the users and make it easier for a recipe way of executing all the preprocessing steps. We will only ask the user for the input in xarray. And then they can choose the steps in a sequence. We can think about how to implement this.

I expect this to be a basic but unified preprocessing steps user could choose when they come with their own data. We could also provide them with the option to use ESMValTool, which takes care of data downloading and preprocessing. Then I think the user will be ready to go for their ML journey 😄.

geek-yang · 2022-11-16T16:13:13Z

geek-yang
Nov 16, 2022
Maintainer

After playing with esmvaltool, and xarray, and exploring some workflow tools, I have more thoughts about the preprocessing module.

I think we need both esmvaltool and a built-in module in s2spy to deal with preprocessing in our workflow.

For users who need to scale their workflow up to CMIP level, we would point them to esmvaltool for data handling and preprocessing before calling s2spy. There are some advantages for depending on esmvaltool:

We do not re-invent the wheel (even if we want to, we can hardly beat them, for sure)
ESMValTool is designed to handle requests for CMIP data and it is very flexible and efficient in terms of preprocessing
ESMValTool has a strong community behind it and it is well maintained
The syntax in ESMValTool is quite friendly to domain experts and the learning curve is relatively ok

The downside for relying on ESMValTool is, it is a bit too heavy. But for users who would like to work with CMIP models, they should be prepared for heavy jobs. So it should be ok. The only question is, how do we include it into our workflow? I prefer to have some example notebooks which show the whole lifespan of doing an experiment with ESMValTool and s2spy, from data preparation to ML.

For those who come with their own small datasets, we also need to facilitate them with some lightweight preprocessing modules. I explore those workflow tools in python, but to me they seem to be a bit too over-engineering for our task. Most of these tools are designed to deal with asynchronous multi-tasks on distributed systems (similar to k8s). I think what we need here is just something super lightweight but could help users with basic preprocessing operations in an organized way. I tried to come up with something simpler using xarray and scipy, I think by creating a preprocessor class we could easily achieve this.

Here are some codes below for discussion. The advantages for having such a preprocessing module is:

Super lightweight (considering the dependency, and the code itself)
User is able to monitor the intermediate results after each process
Trend, climatology, and etc will be saved in the preprocessor (stressed by Sem)

The downside is, users get a bit constrained. I'm not so sure about the design of this module. But I think we all agree that it should not make s2spy heavier.

1 reply

geek-yang Nov 16, 2022
Maintainer

Here is the code. And an example notebook using this code is available via this link.

"""s2spy preprocess module."""
import scipy
import warnings
import xarray as xr


XArrayData = (xr.DataArray, xr.Dataset)


class preprocessor:
    def __init__(self, data: XArrayData, corr_dim: str = "time"):
        """Workflow manager and executor for preprocessing along time axis."""
        self._trend = None
        self._climatology = None
        self.data = data
        self.corr_dim = corr_dim

    def detrend(self, **kwargs):
        """Detrend signals with scipy.
        
        About the keyword arguments for the detrending function, check the documentation
        of scipy: 
        https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.detrend.html
        """
        data_detrend = xr.apply_ufunc(
            scipy.signal.detrend,
            self.data,
            input_core_dims=[[self.corr_dim]],
            output_core_dims=[[self.corr_dim]],
            kwargs = kwargs
        ).transpose(self.corr_dim, ...)

        self._trend = self.data - data_detrend
        self.data = data_detrend

    @property
    def trend(self):
        if self._trend is None:
            warnings.warn("The trend is not calculated. Please call detrend first.")
        return self._trend

    def anomaly(self, keep_leap_day: bool = False):
        """Remove climatology and get anomaly.
        Compute climatology and obtain anomaly using groupby function in xarray.
        Note that for leap years, Feb 29th needs to be treated separately.
        """
        if keep_leap_day == False:
            self.data = self.data.convert_calendar("365_day")
        else:
            # TO DO: deal with Feb 29th
            raise NotImplementedError

        self._climatology = self.data.groupby("time.dayofyear").mean(self.corr_dim)
        self.data = self.data.groupby("time.dayofyear") - self._climatology

    @property
    def climatology(self):
        if self._climatology is None:
            warnings.warn(
                "The climatology is not calculated. Please call anomaly first."
            )
        return self._climatology

    def rolling(self, window, min_periods: int = 1, statistics: str = "mean", **kwargs):
        data_rolling = self.data.rolling(
            {self.corr_dim: window}, min_periods=min_periods, ** kwargs
            )
        if statistics == "mean":
            self.data = data_rolling.mean(self.corr_dim)
        elif statistics == "sum":
            self.data = data_rolling.sum(self.corr_dim)
        else:
            raise ValueError(f"Statistic {statistics} is not supported. Please check" +
            "https://docs.xarray.dev/en/stable/generated/xarray.core.rolling.DataArrayRolling.html#xarray.core.rolling.DataArrayRolling" +
            "for all supported statistical operations.")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preprocessing module for s2spy #127

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 9 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Preprocessing module for s2spy #127

jannesvaningen Oct 26, 2022 Maintainer

Intro

Notebook example

Preprocessing

Precursors

Target

Discussion

Replies: 9 comments · 5 replies

Peter9192 Oct 27, 2022 Maintainer

Peter9192 Oct 27, 2022 Maintainer

geek-yang Oct 31, 2022 Maintainer

geek-yang Oct 31, 2022 Maintainer

jannesvaningen Nov 1, 2022 Maintainer Author

jannesvaningen Nov 2, 2022 Maintainer Author

Peter9192 Nov 2, 2022 Maintainer

geek-yang Nov 9, 2022 Maintainer

geek-yang Nov 9, 2022 Maintainer

Peter9192 Nov 9, 2022 Maintainer

geek-yang Nov 10, 2022 Maintainer

geek-yang Nov 16, 2022 Maintainer

geek-yang Nov 16, 2022 Maintainer

jannesvaningen
Oct 26, 2022
Maintainer

Replies: 9 comments 5 replies

Peter9192
Oct 27, 2022
Maintainer

Peter9192
Oct 27, 2022
Maintainer

geek-yang
Oct 31, 2022
Maintainer

geek-yang
Oct 31, 2022
Maintainer

jannesvaningen
Nov 1, 2022
Maintainer Author

jannesvaningen
Nov 2, 2022
Maintainer Author

Peter9192 Nov 2, 2022
Maintainer

geek-yang Nov 9, 2022
Maintainer

geek-yang
Nov 9, 2022
Maintainer

Peter9192 Nov 9, 2022
Maintainer

geek-yang Nov 10, 2022
Maintainer

geek-yang
Nov 16, 2022
Maintainer

geek-yang Nov 16, 2022
Maintainer