First basic preprocessing module #152

BSchilperoort · 2022-12-19T16:16:26Z

@jannesvaningen and @semvijverberg have a look! The workings of the module are demonstrated in the tutorial_preprocessing notebook

Currently implemented are:

A preprocessor class, which (on .fit):

Calculates the rolling mean of the data
Computes and stores the (linear) trend of the rolling-mean data.
Computes and stores the climatology of detrended rolling-mean data

When calling .transform

The input data is detrended using the stored trend
The climatology is subtracted from the detrended data

Example:

from s2spy import preprocess
preprocessor = preprocess.Preprocessor(
    rolling_window_size=25,
    detrend="linear",
    remove_climatology=True,
)

processed_data = preprocessor.fit_transform(input_data)

review-notebook-app · 2022-12-19T16:16:31Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

Peter9192

Hey @BSchilperoort, nice that you made a start with this. Overall I like where this is going, but in terms of design, I was hoping for something more composable.

My main issue is that the preprocessor class is now a "catch-all" object. You already see that it requires a range of input arguments related to different pre-processing steps. If we extend this later on with different operations it will be even more. I would much prefer to have a separate class for each pre-processor task, and then for the overall pre-processor to tie the tasks together. E.g.

class PreprocessorTask(ABC):
    """Interface for the preprocessor tasks.

    Should have fit, transform, and fit-transform.
    """

class Detrend(PreprocessorTask):
    """Implementation of detrend preprocessor."""

class Preprocessor():
    """Pipeline or workflow object that can execute the tasks"""

    def __init__(self, *tasks: PreprocessorTask):
        self.tasks = [task for task in tasks]

    def fit(self, data):
        """Call fit on the each of the subtasks, passing the data on from one task to the next."""

et cetera

BSchilperoort · 2022-12-21T14:10:16Z

Overall I like where this is going, but in terms of design, I was hoping for something more composable.

Thanks for your comments @Peter9192 . Yang did have something akin to your example in mind, and it would be quite easy to implement.

However, I have not seen the need for this yet. Currently there are 4 input arguments for the entire preprocessor, which is sufficient to cover all we need. The rolling mean is always applied first by default, as @semvijverberg said this is best practice that users should do anyway. Note that the rolling mean is not applied when calling .transform.

I have not heard of any additional preprocessors that would need to be implemented (ones that need a .fit method at least), and preprocessing such as decadal/monthly sums can't have a fit method, just transform. At that point we would just be making a method.transform wrapper for xarray built-in methods.

Peter9192 · 2022-12-21T14:38:45Z

I have not heard of any additional preprocessors that would need to be implemented

What about dimensionality reduction (RGDR or something different) or feature extraction

At that point we would just be making a method.transform wrapper for xarray built-in methods.

I think this is exactly what we want. It will give us the possibility to control the things that can or cannot be set, and how they are implemented. With xarray you can do almost anything, and you can easily do it "wrong". I thought the idea for us was to provide a consensus/best-practice implementation, with a higher-level (more constrained) interface.

Also, we want things to be modular, right? Putting everything together sounds to me like you're addressing a single use case.

semvijverberg · 2023-01-05T11:20:04Z

Thanks for setting this up.

*I may not completely follow the reasoning of every detail in the conversation, but I wanted to share some thoughts I have.

First some jargon. With pre-processing, I and many climate scientists usually refer to only detrending and deseasonalizing, and not e.g. feature extraction methods. We often call those dimensionality reduction methods or clustering methods and I suggest that we want to stick to that convention. Pre-processing can also refer to handling NaNs and interpolation methods. Roughly speaking, it should encompass preparing the data before you feed it to (the first step of) your analysis.

The rolling mean is the most computational heavy job for the pre-processing. You want to avoid doing that twice, which I believed was the motivation to put everything under one class.
Like @BSchilperoort mentioned, (in principle) the deseasonalizing and detrending should be done on rolling mean (smoothened) data (if data is daily/weekly). I thought I also did my detrending on the smoothened data, but I recently discovered I did not do this, oeps. Anyway, I still think we should detrend on smoothened data, but there should be some visual verification on the implications (raw vs smoothened) and checking if it works as desired.
I think you know this but to be sure, an awesome new feature is that we allow the detrending/deseasonalizing to be done using only the training data.

BSchilperoort · 2023-01-05T12:43:32Z

Thanks for your input, @semvijverberg. I did have an talk with Peter before the holidays where we discussed why I implemented it this way. I think we're all mostly on the same page now.

One point Peter did raise, which I do agree with, is that the current name Preprocessor is not ideal. As you yourself say:

Pre-processing can also refer to handling NaNs and interpolation methods. Roughly speaking, it should encompass preparing the data before you feed it to (the first step of) your analysis.

In the same sense, resampling to the Calendar system is also a form of preprocessing.

Would you (or @jannesvaningen ) know of a good name to give this detrending+deseasonalizing step (instead of the current generic Preprocessor)?

semvijverberg · 2023-01-05T13:27:37Z

Thanks for your input, @semvijverberg. I did have an talk with Peter before the holidays where we discussed why I implemented it this way. I think we're all mostly on the same page now.

One point Peter did raise, which I do agree with, is that the current name Preprocessor is not ideal. As you yourself say:

Pre-processing can also refer to handling NaNs and interpolation methods. Roughly speaking, it should encompass preparing the data before you feed it to (the first step of) your analysis.

In the same sense, resampling to the Calendar system is also a form of preprocessing.

Would you (or @jannesvaningen ) know of a good name to give this detrending+deseasonalizing step (instead of the current generic Preprocessor)?

Yes, this is an excellent point. To be honest, I have no good answer. Perhaps simply class detrend_deseasonalize?

Peter9192 · 2023-01-09T12:28:23Z

Would you (or @jannesvaningen ) know of a good name to give this detrending+deseasonalizing step (instead of the current generic Preprocessor)?

Yes, this is an excellent point. To be honest, I have no good answer. Perhaps simply class detrend_deseasonalize?

What about "normalize" or "standardize" perhaps in conjunction with "TimeSeries", or a shortened "TS". Classnames could be TSNormalize or StandardizeTS, for example.

jannesvaningen · 2023-01-11T14:21:14Z

I am currently struggling with the same type of issues while adding functionality to the legacy code for out-of-sample preprocessing... I think you want to give the user the option to do one, two or all of the three methods that we mentioned, but only in a specific order:

rolling mean
deseasonalize
detrend

The user would have the freedom to do only detrending, but not do detrending and then do deseasonalizing. The aim here is to maintain the best practices. I have no idea how you could best accompany this in code. Do you agree? And if so, any ideas?

For the naming: in econometrics, estimating a trend, cycle or seasonality in a timeseries is called 'timeseries decomposition'. But this only concerns the 'fit' part of this exercise, not the 'transform' part. As far as I know, there is no good general word for all of the transform steps (detrend, deseasonalize) since they are considered separately (if there is no seasonality, why deseasonalize?). What @Peter9192 suggests comes close, but standardization or normalization can also be confusing because these are used in a context where the ts is scaled compared to its standard deviation.

If you follow all the steps you are left with a timeseries that should be stationary or integrated of order 0. The procedure to get such a timeseries is referred to as differencing.

So maybe TSDecompose or TSDifferencing?

Peter9192 · 2023-01-13T08:26:20Z

The user would have the freedom to do only detrending, but not do detrending and then do deseasonalizing. The aim here is to maintain the best practices. I have no idea how you could best accompany this in code. Do you agree? And if so, any ideas?

I think this should not be too difficult. The interface (what it looks like to the user) could be very different, but here's an example implementation:

class Workflow:
    def __init__(self):
        self._tasks = {}

    def add_task(self, task):
        self._tasks[task.name] = task

    def execute(self):
        order = ['A', 'B', 'C']
        for task in order:
            if task in self._tasks:
                self._tasks[task].execute()

class Task:
    def __init__(self, name):
        self.name = name

    def execute(self):
        print(f"Executing {self.name}")

workflow = Workflow()
workflow.add_task(Task('C'))
workflow.add_task(Task('B'))
workflow.execute()
> Executing B
> Executing C

semvijverberg · 2023-01-13T10:00:31Z

Would you (or @jannesvaningen ) know of a good name to give this detrending+deseasonalizing step (instead of the current generic Preprocessor)?

Yes, this is an excellent point. To be honest, I have no good answer. Perhaps simply class detrend_deseasonalize?

What about "normalize" or "standardize" perhaps in conjunction with "TimeSeries", or a shortened "TS". Classnames could be TSNormalize or StandardizeTS, for example.

I agree with @jannesvaningen, words like "normalize" and "standardize" refers to scaling timeseries. Removing the seasonal cycle is definitely different.

semvijverberg · 2023-01-13T10:25:06Z

If you follow all the steps you are left with a timeseries that should be stationary or integrated of order 0. The procedure to get such a timeseries is referred to as differencing.

So maybe TSDecompose or TSDifferencing?

The word 'Decompose' is also used by statsmodels. They use the word 'decompose' referring to both quantify the trend + seasonal cycle within one function. Btw, I think the statsmodels implementation is not well documented. I'm not sure what happens under the hood.

TSDifferencing is a well-known approach to create a stationary timeseries (no trend, stationary variability), but it is something very different from what we are doing. I find the method always a bit funky and not physically logical.

s2spy/preprocess.py

geek-yang · 2023-02-13T15:09:12Z

Just performed a test, it shows that by making the rolling window to be 1, it still performs rolling mean calculation, which is not computationally efficient. We should skip the rolling mean operation when the window is set to be None. Thanks for your notice about this issue @BSchilperoort 👍.

This reverts commit fc604c7.

s2spy/preprocess.py

jannesvaningen · 2023-02-15T13:48:58Z

Nice work Bart! I added some comments, hope it wasn't too much. I added them as single comments, looks a bit messy now, sorry for that. I love how concise and elegant this preprocessor is. And it works fast and smoothly!

I know this is only the first basic preprocessing module, but I hope we can remind ourselves of all the steps that were described in the preprocessing discussion. I would love to also see implemented:

spatial selection (so selecting a certain spatial extent with lat and lon)
regridding
apply land/sea mask

BSchilperoort · 2023-02-15T14:02:38Z

Nice work Bart! I added some comments, hope it wasn't too much. I added them as single comments, looks a bit messy now, sorry for that. I love how concise and elegant this preprocessor is. And it works fast and smoothly!

Thanks for the review! Don't worry about the single comments. It's only a little bit less compact than a review with multiple comments.

geek-yang · 2023-02-20T11:20:22Z

Nice work Bart! I added some comments, hope it wasn't too much. I added them as single comments, looks a bit messy now, sorry for that. I love how concise and elegant this preprocessor is. And it works fast and smoothly!

Thanks for your review @jannesvaningen! I think it helps to identify a few issues that could be addressed to improve the preprocessor.

I know this is only the first basic preprocessing module, but I hope we can remind ourselves of all the steps that were described in the preprocessing discussion. I would love to also see implemented:

spatial selection (so selecting a certain spatial extent with lat and lon)

regridding

apply land/sea mask

Thanks for your suggestions. We've thought about this when designing the preprocessor. But given that the users are asked to prepare their data as data arrays, these steps could be easily addressed by calling the xarray (or xesmf) methods, for instance:

spatial selection da.sel(lat=slice(20, 30), lon=slice(110, 140))
for regridding, we can do simple interpolation with da.interp() or da.interp_like(), and for complex ones with xesmf (e.g. regridder = xe.Regridder(ds, ds_out, "bilinear") and then ds_out = regridder(ds))
apply land/sea mask da.where(np.isnan(mask))

I think we agree to keep this preprocessor light weight, so it is not very attractive to me to re-implement all these steps rather than adding a notebook to showcase everything. But it maybe useful to add all these steps in one place (e.g. a function to receive a dict incl. all the preprocessing steps, just like a workflow builder), which makes it easier for the user. I would like to support it in later iterations if necessary, depending on real usecases.

geek-yang · 2023-02-22T10:53:41Z

Nice work Bart! I added some comments, hope it wasn't too much. I added them as single comments, looks a bit messy now, sorry for that. I love how concise and elegant this preprocessor is. And it works fast and smoothly!

I know this is only the first basic preprocessing module, but I hope we can remind ourselves of all the steps that were described in the preprocessing discussion. I would love to also see implemented:

spatial selection (so selecting a certain spatial extent with lat and lon)

regridding

apply land/sea mask

A new issue #161 has been created to document these suggestions.

Co-authored-by: Bart Schilperoort <[email protected]>

BSchilperoort

Nice! Thanks for making the changes Yang. I have suggested some small modifications 😄

I also thought it would be nice (and more complete) to expose the stored parameters as @propertys. E.g.:

@property
def detrend(self):
    return self._detrend

etc. for the other parameters. This makes it easier to inspect this item later, for users who are for example working in a notebook.

We could also open an issue for this, and then also add a nicer repr which tells you the settings (and if the preprocessor is fit already, with an is_fit property).

tests/test_preprocess.py

docs/notebooks/tutorial_preprocessing.ipynb

Co-authored-by: Bart Schilperoort <[email protected]>

sonarcloud · 2023-02-23T16:27:19Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
2 Code Smells

98.0% Coverage
0.0% Duplication

geek-yang · 2023-02-23T16:31:26Z

Nice! Thanks for making the changes Yang. I have suggested some small modifications 😄

I also thought it would be nice (and more complete) to expose the stored parameters as @propertys. E.g.:
@property
def detrend(self):
    return self._detrend
etc. for the other parameters. This makes it easier to inspect this item later, for users who are for example working in a notebook.

We could also open an issue for this, and then also add a nicer repr which tells you the settings (and if the preprocessor is fit already, with an is_fit property).

Thanks for your review @BSchilperoort . Yes, I agree that we need to have a repr to inform the user what they have created. I just create an issue #162 for it. Let's do it in another PR.

semvijverberg · 2023-02-24T14:40:17Z

Cool! Thanks @BSchilperoort and @geek-yang!

First basic preprocessing module

382ed2f

Made trend/climate properties. Added checks.

6ea04bb

Peter9192 reviewed Dec 21, 2022

View reviewed changes

BSchilperoort and others added 7 commits February 8, 2023 15:05

Add docstrings to preprocess.py

b646f14

add tests for detrending and input checker

7e76476

test _get_climatology

0cc576e

fix bugs related to rolling mean

fc604c7

add tests for fit and transform

73aebc7

add tests for invalid trend and climatology query

2c22f11

make linter happy

0783806

geek-yang marked this pull request as ready for review February 10, 2023 10:12

geek-yang requested a review from semvijverberg February 10, 2023 10:12

BSchilperoort commented Feb 10, 2023

View reviewed changes

s2spy/preprocess.py Outdated Show resolved Hide resolved

Yang added 2 commits February 13, 2023 16:48

Revert "fix bugs related to rolling mean"

8cf5915

This reverts commit fc604c7.

allow user to skip rolling mean

f5288cb

BSchilperoort commented Feb 14, 2023

View reviewed changes

s2spy/preprocess.py Outdated Show resolved Hide resolved

BSchilperoort commented Feb 14, 2023

View reviewed changes

s2spy/preprocess.py Outdated Show resolved Hide resolved