-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First basic preprocessing module #152
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @BSchilperoort, nice that you made a start with this. Overall I like where this is going, but in terms of design, I was hoping for something more composable.
My main issue is that the preprocessor class is now a "catch-all" object. You already see that it requires a range of input arguments related to different pre-processing steps. If we extend this later on with different operations it will be even more. I would much prefer to have a separate class for each pre-processor task, and then for the overall pre-processor to tie the tasks together. E.g.
class PreprocessorTask(ABC):
"""Interface for the preprocessor tasks.
Should have fit, transform, and fit-transform.
"""
class Detrend(PreprocessorTask):
"""Implementation of detrend preprocessor."""
class Preprocessor():
"""Pipeline or workflow object that can execute the tasks"""
def __init__(self, *tasks: PreprocessorTask):
self.tasks = [task for task in tasks]
def fit(self, data):
"""Call fit on the each of the subtasks, passing the data on from one task to the next."""
et cetera
Thanks for your comments @Peter9192 . Yang did have something akin to your example in mind, and it would be quite easy to implement. However, I have not seen the need for this yet. Currently there are 4 input arguments for the entire preprocessor, which is sufficient to cover all we need. The rolling mean is always applied first by default, as @semvijverberg said this is best practice that users should do anyway. Note that the rolling mean is not applied when calling .transform. I have not heard of any additional preprocessors that would need to be implemented (ones that need a .fit method at least), and preprocessing such as decadal/monthly sums can't have a fit method, just transform. At that point we would just be making a |
What about dimensionality reduction (RGDR or something different) or feature extraction
I think this is exactly what we want. It will give us the possibility to control the things that can or cannot be set, and how they are implemented. With xarray you can do almost anything, and you can easily do it "wrong". I thought the idea for us was to provide a consensus/best-practice implementation, with a higher-level (more constrained) interface. Also, we want things to be modular, right? Putting everything together sounds to me like you're addressing a single use case. |
Thanks for setting this up. *I may not completely follow the reasoning of every detail in the conversation, but I wanted to share some thoughts I have. First some jargon. With pre-processing, I and many climate scientists usually refer to only detrending and deseasonalizing, and not e.g. feature extraction methods. We often call those dimensionality reduction methods or clustering methods and I suggest that we want to stick to that convention. Pre-processing can also refer to handling NaNs and interpolation methods. Roughly speaking, it should encompass preparing the data before you feed it to (the first step of) your analysis.
|
Thanks for your input, @semvijverberg. I did have an talk with Peter before the holidays where we discussed why I implemented it this way. I think we're all mostly on the same page now. One point Peter did raise, which I do agree with, is that the current name
In the same sense, resampling to the Calendar system is also a form of preprocessing. Would you (or @jannesvaningen ) know of a good name to give this detrending+deseasonalizing step (instead of the current generic |
Yes, this is an excellent point. To be honest, I have no good answer. Perhaps simply class detrend_deseasonalize? |
What about "normalize" or "standardize" perhaps in conjunction with "TimeSeries", or a shortened "TS". Classnames could be |
I am currently struggling with the same type of issues while adding functionality to the legacy code for out-of-sample preprocessing... I think you want to give the user the option to do one, two or all of the three methods that we mentioned, but only in a specific order:
The user would have the freedom to do only detrending, but not do detrending and then do deseasonalizing. The aim here is to maintain the best practices. I have no idea how you could best accompany this in code. Do you agree? And if so, any ideas? For the naming: in econometrics, estimating a trend, cycle or seasonality in a timeseries is called 'timeseries decomposition'. But this only concerns the 'fit' part of this exercise, not the 'transform' part. As far as I know, there is no good general word for all of the transform steps (detrend, deseasonalize) since they are considered separately (if there is no seasonality, why deseasonalize?). What @Peter9192 suggests comes close, but standardization or normalization can also be confusing because these are used in a context where the ts is scaled compared to its standard deviation. If you follow all the steps you are left with a timeseries that should be stationary or integrated of order 0. The procedure to get such a timeseries is referred to as differencing. So maybe |
I think this should not be too difficult. The interface (what it looks like to the user) could be very different, but here's an example implementation: class Workflow:
def __init__(self):
self._tasks = {}
def add_task(self, task):
self._tasks[task.name] = task
def execute(self):
order = ['A', 'B', 'C']
for task in order:
if task in self._tasks:
self._tasks[task].execute()
class Task:
def __init__(self, name):
self.name = name
def execute(self):
print(f"Executing {self.name}")
workflow = Workflow()
workflow.add_task(Task('C'))
workflow.add_task(Task('B'))
workflow.execute()
> Executing B
> Executing C |
I agree with @jannesvaningen, words like "normalize" and "standardize" refers to scaling timeseries. Removing the seasonal cycle is definitely different. |
The word 'Decompose' is also used by statsmodels. They use the word 'decompose' referring to both quantify the trend + seasonal cycle within one function. Btw, I think the statsmodels implementation is not well documented. I'm not sure what happens under the hood. TSDifferencing is a well-known approach to create a stationary timeseries (no trend, stationary variability), but it is something very different from what we are doing. I find the method always a bit funky and not physically logical. |
Just performed a test, it shows that by making the rolling window to be 1, it still performs rolling mean calculation, which is not computationally efficient. We should skip the rolling mean operation when the window is set to be |
This reverts commit fc604c7.
Nice work Bart! I added some comments, hope it wasn't too much. I added them as single comments, looks a bit messy now, sorry for that. I love how concise and elegant this preprocessor is. And it works fast and smoothly! I know this is only the first basic preprocessing module, but I hope we can remind ourselves of all the steps that were described in the preprocessing discussion. I would love to also see implemented:
|
Thanks for the review! Don't worry about the single comments. It's only a little bit less compact than a review with multiple comments. |
Thanks for your review @jannesvaningen! I think it helps to identify a few issues that could be addressed to improve the preprocessor.
Thanks for your suggestions. We've thought about this when designing the preprocessor. But given that the users are asked to prepare their data as data arrays, these steps could be easily addressed by calling the
I think we agree to keep this preprocessor light weight, so it is not very attractive to me to re-implement all these steps rather than adding a notebook to showcase everything. But it maybe useful to add all these steps in one place (e.g. a function to receive a dict incl. all the preprocessing steps, just like a workflow builder), which makes it easier for the user. I would like to support it in later iterations if necessary, depending on real usecases. |
A new issue #161 has been created to document these suggestions. |
Co-authored-by: Bart Schilperoort <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Thanks for making the changes Yang. I have suggested some small modifications 😄
I also thought it would be nice (and more complete) to expose the stored parameters as @property
s. E.g.:
@property
def detrend(self):
return self._detrend
etc. for the other parameters. This makes it easier to inspect this item later, for users who are for example working in a notebook.
We could also open an issue for this, and then also add a nicer repr which tells you the settings (and if the preprocessor is fit already, with an is_fit
property).
Co-authored-by: Bart Schilperoort <[email protected]>
Kudos, SonarCloud Quality Gate passed! |
Thanks for your review @BSchilperoort . Yes, I agree that we need to have a |
Cool! Thanks @BSchilperoort and @geek-yang! |
@jannesvaningen and @semvijverberg have a look! The workings of the module are demonstrated in the
tutorial_preprocessing
notebookCurrently implemented are:
A preprocessor class, which (on .fit):
When calling .transform
Example: