Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First basic preprocessing module #152

Merged
merged 19 commits into from
Feb 24, 2023
Merged

Conversation

BSchilperoort
Copy link
Contributor

@BSchilperoort BSchilperoort commented Dec 19, 2022

@jannesvaningen and @semvijverberg have a look! The workings of the module are demonstrated in the tutorial_preprocessing notebook

Currently implemented are:

A preprocessor class, which (on .fit):

  1. Calculates the rolling mean of the data
  2. Computes and stores the (linear) trend of the rolling-mean data.
  3. Computes and stores the climatology of detrended rolling-mean data

When calling .transform

  1. The input data is detrended using the stored trend
  2. The climatology is subtracted from the detrended data

Example:

from s2spy import preprocess
preprocessor = preprocess.Preprocessor(
    rolling_window_size=25,
    detrend="linear",
    remove_climatology=True,
)

processed_data = preprocessor.fit_transform(input_data)

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link
Contributor

@Peter9192 Peter9192 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @BSchilperoort, nice that you made a start with this. Overall I like where this is going, but in terms of design, I was hoping for something more composable.

My main issue is that the preprocessor class is now a "catch-all" object. You already see that it requires a range of input arguments related to different pre-processing steps. If we extend this later on with different operations it will be even more. I would much prefer to have a separate class for each pre-processor task, and then for the overall pre-processor to tie the tasks together. E.g.

class PreprocessorTask(ABC):
    """Interface for the preprocessor tasks.

    Should have fit, transform, and fit-transform.
    """

class Detrend(PreprocessorTask):
    """Implementation of detrend preprocessor."""

class Preprocessor():
    """Pipeline or workflow object that can execute the tasks"""

    def __init__(self, *tasks: PreprocessorTask):
        self.tasks = [task for task in tasks]

    def fit(self, data):
        """Call fit on the each of the subtasks, passing the data on from one task to the next."""

et cetera

@BSchilperoort
Copy link
Contributor Author

BSchilperoort commented Dec 21, 2022

Overall I like where this is going, but in terms of design, I was hoping for something more composable.

Thanks for your comments @Peter9192 . Yang did have something akin to your example in mind, and it would be quite easy to implement.

However, I have not seen the need for this yet. Currently there are 4 input arguments for the entire preprocessor, which is sufficient to cover all we need. The rolling mean is always applied first by default, as @semvijverberg said this is best practice that users should do anyway. Note that the rolling mean is not applied when calling .transform.

I have not heard of any additional preprocessors that would need to be implemented (ones that need a .fit method at least), and preprocessing such as decadal/monthly sums can't have a fit method, just transform. At that point we would just be making a method.transform wrapper for xarray built-in methods.

@Peter9192
Copy link
Contributor

I have not heard of any additional preprocessors that would need to be implemented

What about dimensionality reduction (RGDR or something different) or feature extraction

At that point we would just be making a method.transform wrapper for xarray built-in methods.

I think this is exactly what we want. It will give us the possibility to control the things that can or cannot be set, and how they are implemented. With xarray you can do almost anything, and you can easily do it "wrong". I thought the idea for us was to provide a consensus/best-practice implementation, with a higher-level (more constrained) interface.

Also, we want things to be modular, right? Putting everything together sounds to me like you're addressing a single use case.

@semvijverberg
Copy link
Member

Thanks for setting this up.

*I may not completely follow the reasoning of every detail in the conversation, but I wanted to share some thoughts I have.

First some jargon. With pre-processing, I and many climate scientists usually refer to only detrending and deseasonalizing, and not e.g. feature extraction methods. We often call those dimensionality reduction methods or clustering methods and I suggest that we want to stick to that convention. Pre-processing can also refer to handling NaNs and interpolation methods. Roughly speaking, it should encompass preparing the data before you feed it to (the first step of) your analysis.

  • The rolling mean is the most computational heavy job for the pre-processing. You want to avoid doing that twice, which I believed was the motivation to put everything under one class.
  • Like @BSchilperoort mentioned, (in principle) the deseasonalizing and detrending should be done on rolling mean (smoothened) data (if data is daily/weekly). I thought I also did my detrending on the smoothened data, but I recently discovered I did not do this, oeps. Anyway, I still think we should detrend on smoothened data, but there should be some visual verification on the implications (raw vs smoothened) and checking if it works as desired.
  • I think you know this but to be sure, an awesome new feature is that we allow the detrending/deseasonalizing to be done using only the training data.

@BSchilperoort
Copy link
Contributor Author

Thanks for your input, @semvijverberg. I did have an talk with Peter before the holidays where we discussed why I implemented it this way. I think we're all mostly on the same page now.

One point Peter did raise, which I do agree with, is that the current name Preprocessor is not ideal. As you yourself say:

Pre-processing can also refer to handling NaNs and interpolation methods. Roughly speaking, it should encompass preparing the data before you feed it to (the first step of) your analysis.

In the same sense, resampling to the Calendar system is also a form of preprocessing.

Would you (or @jannesvaningen ) know of a good name to give this detrending+deseasonalizing step (instead of the current generic Preprocessor)?

@semvijverberg
Copy link
Member

Thanks for your input, @semvijverberg. I did have an talk with Peter before the holidays where we discussed why I implemented it this way. I think we're all mostly on the same page now.

One point Peter did raise, which I do agree with, is that the current name Preprocessor is not ideal. As you yourself say:

Pre-processing can also refer to handling NaNs and interpolation methods. Roughly speaking, it should encompass preparing the data before you feed it to (the first step of) your analysis.

In the same sense, resampling to the Calendar system is also a form of preprocessing.

Would you (or @jannesvaningen ) know of a good name to give this detrending+deseasonalizing step (instead of the current generic Preprocessor)?

Yes, this is an excellent point. To be honest, I have no good answer. Perhaps simply class detrend_deseasonalize?

@Peter9192
Copy link
Contributor

Would you (or @jannesvaningen ) know of a good name to give this detrending+deseasonalizing step (instead of the current generic Preprocessor)?

Yes, this is an excellent point. To be honest, I have no good answer. Perhaps simply class detrend_deseasonalize?

What about "normalize" or "standardize" perhaps in conjunction with "TimeSeries", or a shortened "TS". Classnames could be TSNormalize or StandardizeTS, for example.

@jannesvaningen
Copy link
Contributor

I am currently struggling with the same type of issues while adding functionality to the legacy code for out-of-sample preprocessing... I think you want to give the user the option to do one, two or all of the three methods that we mentioned, but only in a specific order:

  1. rolling mean
  2. deseasonalize
  3. detrend

The user would have the freedom to do only detrending, but not do detrending and then do deseasonalizing. The aim here is to maintain the best practices. I have no idea how you could best accompany this in code. Do you agree? And if so, any ideas?

For the naming: in econometrics, estimating a trend, cycle or seasonality in a timeseries is called 'timeseries decomposition'. But this only concerns the 'fit' part of this exercise, not the 'transform' part. As far as I know, there is no good general word for all of the transform steps (detrend, deseasonalize) since they are considered separately (if there is no seasonality, why deseasonalize?). What @Peter9192 suggests comes close, but standardization or normalization can also be confusing because these are used in a context where the ts is scaled compared to its standard deviation.

If you follow all the steps you are left with a timeseries that should be stationary or integrated of order 0. The procedure to get such a timeseries is referred to as differencing.

So maybe TSDecompose or TSDifferencing?

@Peter9192
Copy link
Contributor

The user would have the freedom to do only detrending, but not do detrending and then do deseasonalizing. The aim here is to maintain the best practices. I have no idea how you could best accompany this in code. Do you agree? And if so, any ideas?

I think this should not be too difficult. The interface (what it looks like to the user) could be very different, but here's an example implementation:

class Workflow:
    def __init__(self):
        self._tasks = {}

    def add_task(self, task):
        self._tasks[task.name] = task

    def execute(self):
        order = ['A', 'B', 'C']
        for task in order:
            if task in self._tasks:
                self._tasks[task].execute()

class Task:
    def __init__(self, name):
        self.name = name

    def execute(self):
        print(f"Executing {self.name}")

workflow = Workflow()
workflow.add_task(Task('C'))
workflow.add_task(Task('B'))
workflow.execute()
> Executing B
> Executing C

@semvijverberg
Copy link
Member

Would you (or @jannesvaningen ) know of a good name to give this detrending+deseasonalizing step (instead of the current generic Preprocessor)?

Yes, this is an excellent point. To be honest, I have no good answer. Perhaps simply class detrend_deseasonalize?

What about "normalize" or "standardize" perhaps in conjunction with "TimeSeries", or a shortened "TS". Classnames could be TSNormalize or StandardizeTS, for example.

I agree with @jannesvaningen, words like "normalize" and "standardize" refers to scaling timeseries. Removing the seasonal cycle is definitely different.

@semvijverberg
Copy link
Member

semvijverberg commented Jan 13, 2023

If you follow all the steps you are left with a timeseries that should be stationary or integrated of order 0. The procedure to get such a timeseries is referred to as differencing.

So maybe TSDecompose or TSDifferencing?

The word 'Decompose' is also used by statsmodels. They use the word 'decompose' referring to both quantify the trend + seasonal cycle within one function. Btw, I think the statsmodels implementation is not well documented. I'm not sure what happens under the hood.

TSDifferencing is a well-known approach to create a stationary timeseries (no trend, stationary variability), but it is something very different from what we are doing. I find the method always a bit funky and not physically logical.

@geek-yang geek-yang marked this pull request as ready for review February 10, 2023 10:12
s2spy/preprocess.py Outdated Show resolved Hide resolved
@geek-yang
Copy link
Member

geek-yang commented Feb 13, 2023

Just performed a test, it shows that by making the rolling window to be 1, it still performs rolling mean calculation, which is not computationally efficient. We should skip the rolling mean operation when the window is set to be None. Thanks for your notice about this issue @BSchilperoort 👍.

s2spy/preprocess.py Outdated Show resolved Hide resolved
s2spy/preprocess.py Outdated Show resolved Hide resolved
s2spy/preprocess.py Outdated Show resolved Hide resolved
s2spy/preprocess.py Outdated Show resolved Hide resolved
@jannesvaningen
Copy link
Contributor

Nice work Bart! I added some comments, hope it wasn't too much. I added them as single comments, looks a bit messy now, sorry for that. I love how concise and elegant this preprocessor is. And it works fast and smoothly!

I know this is only the first basic preprocessing module, but I hope we can remind ourselves of all the steps that were described in the preprocessing discussion. I would love to also see implemented:

  • spatial selection (so selecting a certain spatial extent with lat and lon)
  • regridding
  • apply land/sea mask

@BSchilperoort
Copy link
Contributor Author

Nice work Bart! I added some comments, hope it wasn't too much. I added them as single comments, looks a bit messy now, sorry for that. I love how concise and elegant this preprocessor is. And it works fast and smoothly!

Thanks for the review! Don't worry about the single comments. It's only a little bit less compact than a review with multiple comments.

@geek-yang
Copy link
Member

Nice work Bart! I added some comments, hope it wasn't too much. I added them as single comments, looks a bit messy now, sorry for that. I love how concise and elegant this preprocessor is. And it works fast and smoothly!

Thanks for your review @jannesvaningen! I think it helps to identify a few issues that could be addressed to improve the preprocessor.

I know this is only the first basic preprocessing module, but I hope we can remind ourselves of all the steps that were described in the preprocessing discussion. I would love to also see implemented:

  • spatial selection (so selecting a certain spatial extent with lat and lon)
  • regridding
  • apply land/sea mask

Thanks for your suggestions. We've thought about this when designing the preprocessor. But given that the users are asked to prepare their data as data arrays, these steps could be easily addressed by calling the xarray (or xesmf) methods, for instance:

  • spatial selection da.sel(lat=slice(20, 30), lon=slice(110, 140))
  • for regridding, we can do simple interpolation with da.interp() or da.interp_like(), and for complex ones with xesmf (e.g. regridder = xe.Regridder(ds, ds_out, "bilinear") and then ds_out = regridder(ds))
  • apply land/sea mask da.where(np.isnan(mask))

I think we agree to keep this preprocessor light weight, so it is not very attractive to me to re-implement all these steps rather than adding a notebook to showcase everything. But it maybe useful to add all these steps in one place (e.g. a function to receive a dict incl. all the preprocessing steps, just like a workflow builder), which makes it easier for the user. I would like to support it in later iterations if necessary, depending on real usecases.

@geek-yang
Copy link
Member

Nice work Bart! I added some comments, hope it wasn't too much. I added them as single comments, looks a bit messy now, sorry for that. I love how concise and elegant this preprocessor is. And it works fast and smoothly!

I know this is only the first basic preprocessing module, but I hope we can remind ourselves of all the steps that were described in the preprocessing discussion. I would love to also see implemented:

  • spatial selection (so selecting a certain spatial extent with lat and lon)
  • regridding
  • apply land/sea mask

A new issue #161 has been created to document these suggestions.

Copy link
Contributor Author

@BSchilperoort BSchilperoort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks for making the changes Yang. I have suggested some small modifications 😄

I also thought it would be nice (and more complete) to expose the stored parameters as @propertys. E.g.:

@property
def detrend(self):
    return self._detrend

etc. for the other parameters. This makes it easier to inspect this item later, for users who are for example working in a notebook.

We could also open an issue for this, and then also add a nicer repr which tells you the settings (and if the preprocessor is fit already, with an is_fit property).

tests/test_preprocess.py Outdated Show resolved Hide resolved
docs/notebooks/tutorial_preprocessing.ipynb Outdated Show resolved Hide resolved
docs/notebooks/tutorial_preprocessing.ipynb Outdated Show resolved Hide resolved
@sonarcloud
Copy link

sonarcloud bot commented Feb 23, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 2 Code Smells

98.0% 98.0% Coverage
0.0% 0.0% Duplication

@geek-yang
Copy link
Member

Nice! Thanks for making the changes Yang. I have suggested some small modifications 😄

I also thought it would be nice (and more complete) to expose the stored parameters as @propertys. E.g.:

@property
def detrend(self):
    return self._detrend

etc. for the other parameters. This makes it easier to inspect this item later, for users who are for example working in a notebook.

We could also open an issue for this, and then also add a nicer repr which tells you the settings (and if the preprocessor is fit already, with an is_fit property).

Thanks for your review @BSchilperoort . Yes, I agree that we need to have a repr to inform the user what they have created. I just create an issue #162 for it. Let's do it in another PR.

@BSchilperoort BSchilperoort merged commit dd267fb into main Feb 24, 2023
@BSchilperoort BSchilperoort deleted the implement_first_preprocessing branch February 24, 2023 13:48
@semvijverberg
Copy link
Member

Cool! Thanks @BSchilperoort and @geek-yang!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants