-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor cross-validator #3
Comments
Let's try to make this more concrete. What would using this package look like? Something like this? from s2scv import CrossValidator
cv = CrossValidator(start_date, end_date, freq='M')
# returns instance of CrossValidator
cv.get_train_test(method='leave_n_out', options: {'n': 5})
# returns a pandas dataframe with a column that lables whether an entry is part of the train/test group
#or e.g. with an alternative interface:
cv.traintest.leave_n_out(n=5) |
Based on @Peter9192 's suggestion, I think we can come up with a roadmap simply by coding. We can act as a real user and start coding like what we typically do when writing some python scripts.
from s2scv import CrossValidator
# input variable
# start_date: str='mm-dd', end_date: str="mm-dd", freq: str='M'
cv = CrossValidator(start_date, end_date, freq='M')
# return object (class CrossValidator) And link the legacy code if there is any. for instance: Lines 1382 to 1383 in 0dd397f
# input variable
# method: str='leave_n_out', options: dict={'n': 5}
cv.get_train_test(method='leave_n_out', options: {'n': 5})
# return numpy.array Legacy code Line 314 in 0dd397f
We will continue till we list every method/functionality we could think about. This way, we will have a complete picture about each building blocks we want to include in our package. This forms the skeleton, and we can start coding and filling in these functions (just like complete unit-test one-by-one). This also helps to put things down based on the schematic @semvijverberg made in #2 and makes it easier for our communication. @jannesvaningen and @semvijverberg , could you please brain-storm a bit and finish this recipe? |
Okay, hereby my attempt to make this more concrete! Maybe there will be some confusion with the naming, as the package/class should become more than just a CrossValidator. It also holds information on the target dates, as described above. Note, this information is also needed to verify that the cross validation is done properly (avoiding train-test leakage due to auto-correlation or due to lag-shifting). I therefore suggest to call S2Sinit. recipe attemptimport S2Sinit
set = S2Sinit(start_target_date: str='01-01', end_target_date: str='02-01', freq='M',
cross_year=False,
start_load_date: str=None, end_load_date: str=None,
start_year: int, end_year: int,
tfreq: int=7)
set.traintest.leave_n_out(n=5, max_lag=None) <-
-> adds pd.DataFrame called df_splits to the instance 'set'
df_splits contains two columns:
The first column ('CV') contain integers that indicate if date is part of training (1), testing (0), or not used (-1).
The second column ('TargetPeriod') contain integers that indicate (1) if the date is part of the target period or not (0). S2Sinit classThe Init class would look like this: class S2Sinit:
def __init__(self,
start_target_period: str='11-01', end_target_period: str='12-01', freq='M',
cross_year: int=None,
start_year: int, end_year: int,
tfreq: Union[int, str]=7,
start_load_date: str=None)
"""
Class to initialize parameters that define the dates to load and the target dates.
start_target_date: str, optional
_defines the start of your target period, default = '11-01'_
end_target_period: str, optional
_defines the end of your target period, default = '12-01'_
freq: str
Frequency in string, either 'M' for months or 'd' for days (in accordance with [pd.daterange](https://pandas.pydata.org/docs/reference/api/pandas.date_range.html)).
cross_year: int, optional
Allows user to select a target period cross year, for example for DJF, the parameters would be:
start_target_period: str='12-01', end_target_period: str='02-01', cross_year = 1
Also enables users to select a target period over multiple years. If 1, the target period will be
-{cross_year}-11-01 until 00-12-01. In total 13 months.
start_year: int
Select start year
end_year: int
Select end year
tfreq: Union[int, str]
If an integer, aggregate to n-day or n-months means, depending on parameter freq. If freq is 'mean' than a period
mean is calculated over the target period.
start_load_date: str, optional
_defines the start of the data that you will load and aggregated to n-day or n-month bins. default = None.
If None, all data that is available in the dataset will be loaded, otherwise a subset of the year might be loaded for
memory efficiency_
""" temporal manipulations and functions in protoThis functionality is currently a bit scattered throughout the proto code. The function to aggregate the daily or monthly dataset to {tfreq}-day or {tfreq}-month means can be found here: Line 501 in 7773493
And if tfreq == 'mean', a period mean will be calculated, with the period defined by start_target_date and end_target_date: Line 911 in 7773493
The most high-level function that currently (only) contains datetime index associated functionality: Lines 175 to 196 in 7773493
Some similar function should be made for S2Sinit, but then the input (ds) is a pd.DatetimeIndex format instead of an xarray dataset. cross validation functions in proto@jannesvaningen wrote this, @semvijverberg I added here for sake of completeness, hope it's okay. functionality for cross validation is both in class RGCPD as well as 'deeper' in functions_pp. Like Yang mentioned, the cross validation starts with def traintest. This function requires as main inputs the RGCPD instance and the cv method (leave one out, k-fold etc.). Line 312 in 7773493
Then, def RV_and_traintest is called within def traintest from here: Lines 368 to 372 in 7773493
The function above takes as input self.df_fullts, self.df_RV_ts, self.traintestgroups from preprocessing function pp_TV. I believe RV (response variable) and TV (target variable) are used interchangeably, @semvijverberg am I correct and maybe something to align? TV stands for Target Variable referring to your 'y' variable, in our case usually air temperature. f refers to the python file functions_pp. I think this shows that we do some preprocessing steps before the cross validation determination.
S2Spp Lines 299 to 300 in 7773493
The function process_TV does pre-processing steps for the target variable like aggregations, within functions_pp: Lines 151 to 158 in 7773493
To trim it down further, the input 'fulltso' is an output of the def load_tv also from functions_pp. It takes as input the list with path names of the input variables and the name of the ds within the dataframe has 'ts' as standard name. It gives as output a fulltso : xr.DataArray() 1-d full timeseries. Lines 290 to 292 in 7773493
Lines 86 to 89 in 7773493
|
Thanks a lot for all the input! I think it's getting more concrete, but the API design can still be improved in a couple of ways. For instance:
This is not a critique but an attempt to converge to a common understanding of what makes a good API for our package. Perhaps @geek-yang and I can come up with a proposal building on what you wrote above. |
Hey Peter, Thanks for the feedback!! My idea was that that the S2Sinit class can create instances that holds all information related to:
This instance can be either passed to other packages or it can be used return relevant manipulations, such as sub selecting the training data of the target-period of the first fold. Selecting the time-period you want to predict, what the temporal frequency of your experiment is and the cross-validation should always be the starting point of you analysis. This information needs to be homogenized across all data that will be loaded later in the pipeline (so it would be inefficient to make this part of a data-loader and having to retype the datetime related parameters (start_date, end_date, tfreq, cross_year). I believe separating the data-loading also makes this class very handy for other people that only want to use the datetime functionality that will be build onto S2Sinit:
These can be annoying/time-consuming things to code properly so I image some people might want to use only S2Sinit functionality. Having all datetime and CV information in one object makes life easier for the user (less typing). Simple example, I would want 10-day means that fall in DJF at lag 2 of training fold 1 (could be a real user-need as maybe it crashed when training on this fold with this lag and you want to check for NaNs). To make this into a stand-alone function it would always need 6 parameters (start_target_date, end_target_date, tfreq, cross_year, df_splits and lag). Otherwise it will need 1 parameter and you can say e.g. set.get_lag_shifted(lag=2). |
@semvijverberg and @jannesvaningen , thanks a lot for your explanation. This indeed helps @Peter9192 and me understand your demand and vision about this package. If I don't miss anything, then according to your description, I think two core components (let's first not be bothered by terms like "class" or "methods") must be included in this package:
In the initializer, you want to specify input/target datetime and "preprocess" the data into chunks that based on the Then you allow the user to use the pre-processed data to perform cross-validation in the same package. The "data chunks" and other key parameters are also ready to be passed to a following module. I think it makes sense to put these two components in one package, but we may want to encapsulate them into two modules (or let's speak in python, classes) since they are a bit different and will have their own functions (methods) not be used by the other. So we might also want a more explicit name for the whole package, but this can be decided later. @Peter9192 and I could look into it together and we will propose a draft structure of the package. Then we can have a discussion about it. Thanks again for the detailed info. That really helps a lot 😄 👍 😆 ! @semvijverberg @jannesvaningen |
Ooop, you are a bit faster than me 😹. But it seems my impression is mostly correct. Thanks for the clarification. @Peter9192 and I will have a design session first and see how far we could go. We will keep you guys updated. |
I completely agree with that. An alternative could be to use a dataclass to just pass that information, like so: from dataclasses import dataclass
from typing import Union
@dataclass
class Experiment:
start_year: int
end_year: int
target_start: str = '11-01'
target_end: str = '12-01'
freq = 'M'
tfreq: Union[int, str] = 7
e = Experiment(start_year = 1979, end_year = 2021)
>> Experiment(start_year=1979, end_year=2021, target_start='11-01', target_end='12-01', tfreq=7) And from there I can see how you want to add methods to the class. However, it gets a bit confusing when you then start changing the attributes on the object. To be continued... |
Cross-validator should be date-time aware to avoid train/test information leakage. Currently the cross-validator is separated from date-time target information and lags.
Currently, date-time target information is decided here:
proto/RGCPD/class_RGCPD.py
Lines 58 to 61 in 0dd397f
and train-test information is decided here:
proto/RGCPD/class_RGCPD.py
Line 314 in 0dd397f
When all this information is both known, this function is called:
proto/RGCPD/functions_pp.py
Lines 1382 to 1384 in 0dd397f
These functions are used for lag shifting:
proto/forecasting/func_models.py
Line 153 in 0dd397f
proto/forecasting/func_models.py
Line 122 in 0dd397f
I'm not a 100% sure if the following is still used in the code:
proto/RGCPD/functions_pp.py
Line 1730 in 0dd397f
Alternatively, we could create a new package/module that would be able to combine all information on datetime target information, lags and train/test information.
I propose to call it: S2SCV
The text was updated successfully, but these errors were encountered: