Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor cross-validator #3

Open
geek-yang opened this issue Mar 28, 2022 · 8 comments
Open

Refactor cross-validator #3

geek-yang opened this issue Mar 28, 2022 · 8 comments

Comments

@geek-yang
Copy link
Member

geek-yang commented Mar 28, 2022

Cross-validator should be date-time aware to avoid train/test information leakage. Currently the cross-validator is separated from date-time target information and lags.

Currently, date-time target information is decided here:

start_end_TVdate=None,
tfreq: int=10,
start_end_date: Tuple[str, str]=None,
start_end_year: Tuple[int, int]=None,

and train-test information is decided here:

def traintest(self, method: Union[str, bool]=None, seed=1,

When all this information is both known, this function is called:

proto/RGCPD/functions_pp.py

Lines 1382 to 1384 in 0dd397f

def cross_validation(RV_ts, traintestgroups=None, test_yrs=None, method=str,
seed=None, gap_prior: int=None, gap_after: int=None):
#%%

These functions are used for lag shifting:

def apply_shift_lag(fit_masks, lag_i):

def _check_y_fitmask(fit_masks, lag_i, base_lag):

I'm not a 100% sure if the following is still used in the code:

def func_dates_min_lag(dates, lag):

Alternatively, we could create a new package/module that would be able to combine all information on datetime target information, lags and train/test information.

I propose to call it: S2SCV

@Peter9192
Copy link
Contributor

Let's try to make this more concrete. What would using this package look like? Something like this?

from s2scv import CrossValidator

cv = CrossValidator(start_date, end_date, freq='M')
# returns instance of CrossValidator

cv.get_train_test(method='leave_n_out', options: {'n': 5})
# returns a pandas dataframe with a column that lables whether an entry is part of the train/test group

#or e.g. with an alternative interface:
cv.traintest.leave_n_out(n=5)

@geek-yang
Copy link
Member Author

Based on @Peter9192 's suggestion, I think we can come up with a roadmap simply by coding. We can act as a real user and start coding like what we typically do when writing some python scripts.

Import library and create an instance of cross validator

from s2scv import CrossValidator

# input variable
# start_date: str='mm-dd', end_date: str="mm-dd", freq: str='M'
cv = CrossValidator(start_date, end_date, freq='M')
# return object (class CrossValidator)

And link the legacy code if there is any. for instance:

proto/RGCPD/functions_pp.py

Lines 1382 to 1383 in 0dd397f

def cross_validation(RV_ts, traintestgroups=None, test_yrs=None, method=str,
seed=None, gap_prior: int=None, gap_after: int=None):

Split dataset for training and testing

# input variable
# method: str='leave_n_out', options: dict={'n': 5}
cv.get_train_test(method='leave_n_out', options: {'n': 5})
# return numpy.array

Legacy code

def traintest(self, method: Union[str, bool]=None, seed=1,

We will continue till we list every method/functionality we could think about.

This way, we will have a complete picture about each building blocks we want to include in our package. This forms the skeleton, and we can start coding and filling in these functions (just like complete unit-test one-by-one). This also helps to put things down based on the schematic @semvijverberg made in #2 and makes it easier for our communication.

@jannesvaningen and @semvijverberg , could you please brain-storm a bit and finish this recipe?

@semvijverberg
Copy link
Member

semvijverberg commented Apr 11, 2022

Okay, hereby my attempt to make this more concrete! Maybe there will be some confusion with the naming, as the package/class should become more than just a CrossValidator. It also holds information on the target dates, as described above. Note, this information is also needed to verify that the cross validation is done properly (avoiding train-test leakage due to auto-correlation or due to lag-shifting). I therefore suggest to call S2Sinit.

recipe attempt

import S2Sinit

set = S2Sinit(start_target_date: str='01-01', end_target_date: str='02-01', freq='M',
                       cross_year=False, 
                       start_load_date: str=None, end_load_date: str=None, 
                       start_year: int, end_year: int, 
                       tfreq: int=7)

                       
set.traintest.leave_n_out(n=5, max_lag=None) <-

-> adds pd.DataFrame called df_splits to the instance 'set'
df_splits contains two columns:
The first column ('CV') contain integers that indicate if date is part of training (1), testing (0), or not used (-1).                
The second column ('TargetPeriod') contain integers that indicate (1) if the date is part of the target period or not (0).

S2Sinit class

The Init class would look like this:

class S2Sinit:
       def __init__(self, 
                       start_target_period: str='11-01', end_target_period: str='12-01', freq='M',
                       cross_year: int=None, 
                       start_year: int, end_year: int, 
                       tfreq: Union[int, str]=7,
                       start_load_date: str=None)

              """
              Class to initialize parameters that define the dates to load and the target dates.
              
              start_target_date: str, optional
              _defines the start of your target period, default = '11-01'_
              
              end_target_period: str, optional
              _defines the end of your target period, default = '12-01'_
              
              freq: str
              Frequency in string, either 'M' for months or 'd' for days (in accordance with [pd.daterange](https://pandas.pydata.org/docs/reference/api/pandas.date_range.html)). 
               
              cross_year: int, optional
              Allows user to select a target period cross year, for example for DJF, the parameters would be:
              start_target_period: str='12-01', end_target_period: str='02-01', cross_year = 1
              Also enables users to select a target period over multiple years. If 1, the target period will be 
              -{cross_year}-11-01 until 00-12-01. In total 13 months.
              
              start_year: int
              Select start year
              
              end_year: int
              Select end year
              
              tfreq: Union[int, str]
              If an integer, aggregate to n-day or n-months means, depending on parameter freq. If freq is 'mean' than a period 
              mean is calculated over the target period.
              
              start_load_date: str, optional
              _defines the start of the data that you will load and aggregated to n-day or n-month bins. default = None.
              If None, all data that is available in the dataset will be loaded, otherwise a subset of the year might be loaded for               
              memory efficiency_     
"""

temporal manipulations and functions in proto

This functionality is currently a bit scattered throughout the proto code.

The function to aggregate the daily or monthly dataset to {tfreq}-day or {tfreq}-month means can be found here:

def time_mean_bins(xr_or_df, tfreq=int, start_end_date=None, start_end_year=None,

And if tfreq == 'mean', a period mean will be calculated, with the period defined by start_target_date and end_target_date:

def time_mean_periods(xr_or_df, start_end_periods=np.ndarray,

The most high-level function that currently (only) contains datetime index associated functionality:

proto/RGCPD/core_pp.py

Lines 175 to 196 in 7773493

def xr_core_pp_time(ds, seldates: Union[tuple, pd.DatetimeIndex]=None,
start_end_year: tuple=None, loadleap: bool=False,
dailytomonths: bool=False):
''' Wrapper for some essentials for basic timeslicing and dailytomonthly
aggregation
ds : xr.DataArray or xr.Dataset
input xarray with 'time' dimension
seldates: tuple, pd.DatetimeIndex, optional
default is None.
if type is tuple: selecting data that fits within start- and enddate,
format ('mm-dd', 'mm-dd'). default is ('01-01' - '12-31')
if type is pd.DatetimeIndex: select that exact timeindex with
xarray.sel(time=seldates)
start_end_year : tuple, optional
default is to load all years
loadleap : TYPE, optional
If True also loads the 29-02 leapdays. The default is False.
dailytomonths:
When True, the daily input data will be aggregated to monthly data.
'''

Some similar function should be made for S2Sinit, but then the input (ds) is a pd.DatetimeIndex format instead of an xarray dataset.

cross validation functions in proto

@jannesvaningen wrote this, @semvijverberg I added here for sake of completeness, hope it's okay.

functionality for cross validation is both in class RGCPD as well as 'deeper' in functions_pp.

Like Yang mentioned, the cross validation starts with def traintest. This function requires as main inputs the RGCPD instance and the cv method (leave one out, k-fold etc.).

def traintest(self, method: Union[str, bool]=None, seed=1,

Then, def RV_and_traintest is called within def traintest from here:

proto/RGCPD/class_RGCPD.py

Lines 368 to 372 in 7773493

self.TV, self.df_splits = RV_and_traintest(self.df_fullts,
self.df_RV_ts,
self.traintestgroups,
verbosity=self.verbosity,
**self.kwrgs_traintest)

The function above takes as input self.df_fullts, self.df_RV_ts, self.traintestgroups from preprocessing function pp_TV. I believe RV (response variable) and TV (target variable) are used interchangeably, @semvijverberg am I correct and maybe something to align? TV stands for Target Variable referring to your 'y' variable, in our case usually air temperature. f refers to the python file functions_pp.

I think this shows that we do some preprocessing steps before the cross validation determination.
So my proposal is to turn it around: adding to the recipe above, we follow:

  1. S2Sinit - initialization
    a. initialization of start end date etc.
    b. cross validation --> like Sem suggested this function adds the columns for train/test and target/non-target
  2. S2Spp (or any other name such as climpp) - where we do the preprocessing steps described below with the cv settings given

S2Spp
This is maybe more for another issue, the preprocessing. An explanation of preprocessing steps for S2Spp. These functions would then be called with the S2Sinit instance.

proto/RGCPD/class_RGCPD.py

Lines 299 to 300 in 7773493

out = f.process_TV(fulltso, **self.kwrgs_pp_TV)
self.df_fullts, self.df_RV_ts, inf, self.traintestgroups = out

The function process_TV does pre-processing steps for the target variable like aggregations, within functions_pp:

proto/RGCPD/functions_pp.py

Lines 151 to 158 in 7773493

def process_TV(fullts, tfreq, start_end_TVdate, start_end_date=None,
start_end_year=None, RV_detrend=False, RV_anomaly=False,
ext_annual_to_mon=True, TVdates_aggr: bool=False,
dailytomonths: bool=False, verbosity=1):
# fullts=load_TV(list_of_name_path,name_ds=rg.name_TVds)[0]
# RV_detrend=False;RV_anomaly=False;verbosity=1;
# ext_annual_to_mon=True;TVdates_aggr=False; start_end_date=None; start_end_year=None
# dailytomonths=False

To trim it down further, the input 'fulltso' is an output of the def load_tv also from functions_pp. It takes as input the list with path names of the input variables and the name of the ds within the dataframe has 'ts' as standard name. It gives as output a fulltso : xr.DataArray() 1-d full timeseries.

proto/RGCPD/class_RGCPD.py

Lines 290 to 292 in 7773493

f = functions_pp
fulltso, self.hash = f.load_TV(self.list_of_name_path,
name_ds=self.name_TVds)

def load_TV(list_of_name_path, name_ds='ts'):
'''
function will load first item of list_of_name_path
list_of_name_path = [('TVname', 'TVpath'), ('prec_name', 'prec_path')]

@Peter9192
Copy link
Contributor

Thanks a lot for all the input! I think it's getting more concrete, but the API design can still be improved in a couple of ways. For instance:

  • I think it's probably easier if you don't think of everything as a class.
  • S2Sinit doesn't sound like a concrete object to me (though CrossValidator does). What does it do? To me it sounds like time bookkeeping.
  • adds pd.DataFrame called df_splits to the instance 'set' why not just return that dataframe instead of mutating the state?
  • start_load_date (defines the start of the data that you will load ... ): this concerns data loading. As far as I understand, S2Sinit is concerned with bookkeeping of datetimes. It doesn't need to know about DataLoader. When you want to load the data, then you just pass in the start_load_data (or the CV instance). Or, if S2Sinit does indeed include functionality for data loading, you can just pass it straight into the method: set.load_data(start_date = ..., end_date = ...).

This is not a critique but an attempt to converge to a common understanding of what makes a good API for our package. Perhaps @geek-yang and I can come up with a proposal building on what you wrote above.

@semvijverberg
Copy link
Member

semvijverberg commented Apr 13, 2022

Hey Peter,

Thanks for the feedback!!

My idea was that that the S2Sinit class can create instances that holds all information related to:

  • bookkeeping of datetimes
  • cross-validation

This instance can be either passed to other packages or it can be used return relevant manipulations, such as sub selecting the training data of the target-period of the first fold.

Selecting the time-period you want to predict, what the temporal frequency of your experiment is and the cross-validation should always be the starting point of you analysis. This information needs to be homogenized across all data that will be loaded later in the pipeline (so it would be inefficient to make this part of a data-loader and having to retype the datetime related parameters (start_date, end_date, tfreq, cross_year).

I believe separating the data-loading also makes this class very handy for other people that only want to use the datetime functionality that will be build onto S2Sinit:

  • lag shifting
  • creating n-day, n-month or period means
  • creating a cross-validator that takes into account 'train_test_groups'
  • retrieving only test data or train data).

These can be annoying/time-consuming things to code properly so I image some people might want to use only S2Sinit functionality.

Having all datetime and CV information in one object makes life easier for the user (less typing). Simple example, I would want 10-day means that fall in DJF at lag 2 of training fold 1 (could be a real user-need as maybe it crashed when training on this fold with this lag and you want to check for NaNs). To make this into a stand-alone function it would always need 6 parameters (start_target_date, end_target_date, tfreq, cross_year, df_splits and lag). Otherwise it will need 1 parameter and you can say e.g. set.get_lag_shifted(lag=2).

@geek-yang
Copy link
Member Author

geek-yang commented Apr 13, 2022

@semvijverberg and @jannesvaningen , thanks a lot for your explanation. This indeed helps @Peter9192 and me understand your demand and vision about this package. If I don't miss anything, then according to your description, I think two core components (let's first not be bothered by terms like "class" or "methods") must be included in this package:

  • A initializer which includes key parameters need to be defined at the very beginning (and will be used for the whole workflow, even for later by other postprocessing/analysis packages in AI4S2S)
  • A cross-validation module

In the initializer, you want to specify input/target datetime and "preprocess" the data into chunks that based on the freq. And therefore you need functions to be "datetime aware" (where pd.DatetimeIndex plays a role) and can take the mean based on given freq. In this step, we will have some output data and the data will be carried (in the memory, or can be output, that's another tbd).

Then you allow the user to use the pre-processed data to perform cross-validation in the same package. The "data chunks" and other key parameters are also ready to be passed to a following module.

I think it makes sense to put these two components in one package, but we may want to encapsulate them into two modules (or let's speak in python, classes) since they are a bit different and will have their own functions (methods) not be used by the other.

So we might also want a more explicit name for the whole package, but this can be decided later.

@Peter9192 and I could look into it together and we will propose a draft structure of the package. Then we can have a discussion about it.

Thanks again for the detailed info. That really helps a lot 😄 👍 😆 ! @semvijverberg @jannesvaningen

@geek-yang
Copy link
Member Author

Hey Peter,

Thanks for the feedback!!

My idea was that that the S2Sinit class can create instances that holds all information related to:

  • bookkeeping of datetimes
  • cross-validation

This instance can be either passed to other packages or it can be used return relevant manipulations, such as sub selecting the training data of the target-period of the first fold.

Selecting the time-period you want to predict, what the temporal frequency of your experiment is and the cross-validation should always be the starting point of you analysis. This information needs to be homogenized across all data that will be loaded later in the pipeline (so it would be inefficient to make this part of a data-loader and having to retype the datetime related parameters (start_date, end_date, tfreq, cross_year).

I believe separating the data-loading also makes this class very handy for other people that only want to use the datetime functionality that will be build onto S2Sinit:

  • lag shifting
  • creating n-day, n-month or period means
  • creating a cross-validator that takes into account 'train_test_groups'
  • retrieving only test data or train data).

These can be annoying/time-consuming things to code properly so I image some people might want to use only S2Sinit functionality.

Having all datetime and CV information in one object makes life easier for the user (less typing). Simple example, I would want 10-day means that fall in DJF at lag 2 of training fold 1 (could be a real user-need as maybe it crashed when training on this fold with this lag and you want to check for NaNs). To make this into a stand-alone function it would always need 6 parameters (start_target_date, end_target_date, tfreq, cross_year, df_splits and lag). Otherwise it will need 1 parameter and you can say e.g. set.get_lag_shifted(lag=2).

Ooop, you are a bit faster than me 😹. But it seems my impression is mostly correct. Thanks for the clarification. @Peter9192 and I will have a design session first and see how far we could go. We will keep you guys updated.

@Peter9192
Copy link
Contributor

Having all datetime and CV information in one object makes life easier for the user.

I completely agree with that. An alternative could be to use a dataclass to just pass that information, like so:

from dataclasses import dataclass
from typing import Union

@dataclass
class Experiment:
    start_year: int
    end_year: int
    target_start: str = '11-01'
    target_end: str = '12-01'
    freq = 'M'
    tfreq: Union[int, str] = 7

e = Experiment(start_year = 1979, end_year = 2021)

>>   Experiment(start_year=1979, end_year=2021, target_start='11-01', target_end='12-01', tfreq=7)

And from there I can see how you want to add methods to the class. However, it gets a bit confusing when you then start changing the attributes on the object. To be continued...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants