Refactor cross-validator #3

geek-yang · 2022-03-28T14:54:10Z

Cross-validator should be date-time aware to avoid train/test information leakage. Currently the cross-validator is separated from date-time target information and lags.

Currently, date-time target information is decided here:

proto/RGCPD/class_RGCPD.py

Lines 58 to 61 in 0dd397f

    
           start_end_TVdate=None, 
        
           tfreq: int=10, 
        
           start_end_date: Tuple[str, str]=None, 
        
           start_end_year: Tuple[int, int]=None,

and train-test information is decided here:

proto/RGCPD/class_RGCPD.py

Line 314 in 0dd397f

def traintest(self, method: Union[str, bool]=None, seed=1,

When all this information is both known, this function is called:

proto/RGCPD/functions_pp.py

Lines 1382 to 1384 in 0dd397f

    
           def cross_validation(RV_ts, traintestgroups=None, test_yrs=None, method=str, 
        
                                seed=None, gap_prior: int=None, gap_after: int=None): 
        
               #%%

These functions are used for lag shifting:

proto/forecasting/func_models.py

Line 153 in 0dd397f

def apply_shift_lag(fit_masks, lag_i):

proto/forecasting/func_models.py

Line 122 in 0dd397f

def _check_y_fitmask(fit_masks, lag_i, base_lag):

I'm not a 100% sure if the following is still used in the code:

proto/RGCPD/functions_pp.py

Line 1730 in 0dd397f

def func_dates_min_lag(dates, lag):

Alternatively, we could create a new package/module that would be able to combine all information on datetime target information, lags and train/test information.

I propose to call it: S2SCV

Peter9192 · 2022-04-06T13:44:24Z

Let's try to make this more concrete. What would using this package look like? Something like this?

from s2scv import CrossValidator

cv = CrossValidator(start_date, end_date, freq='M')
# returns instance of CrossValidator

cv.get_train_test(method='leave_n_out', options: {'n': 5})
# returns a pandas dataframe with a column that lables whether an entry is part of the train/test group

#or e.g. with an alternative interface:
cv.traintest.leave_n_out(n=5)

geek-yang · 2022-04-11T15:04:27Z

Based on @Peter9192 's suggestion, I think we can come up with a roadmap simply by coding. We can act as a real user and start coding like what we typically do when writing some python scripts.

Import library and create an instance of cross validator

from s2scv import CrossValidator

# input variable
# start_date: str='mm-dd', end_date: str="mm-dd", freq: str='M'
cv = CrossValidator(start_date, end_date, freq='M')
# return object (class CrossValidator)

And link the legacy code if there is any. for instance:

proto/RGCPD/functions_pp.py

Lines 1382 to 1383 in 0dd397f

    
           def cross_validation(RV_ts, traintestgroups=None, test_yrs=None, method=str, 
        
                                seed=None, gap_prior: int=None, gap_after: int=None):

Split dataset for training and testing

# input variable
# method: str='leave_n_out', options: dict={'n': 5}
cv.get_train_test(method='leave_n_out', options: {'n': 5})
# return numpy.array

Legacy code

proto/RGCPD/class_RGCPD.py

Line 314 in 0dd397f

def traintest(self, method: Union[str, bool]=None, seed=1,

We will continue till we list every method/functionality we could think about.

This way, we will have a complete picture about each building blocks we want to include in our package. This forms the skeleton, and we can start coding and filling in these functions (just like complete unit-test one-by-one). This also helps to put things down based on the schematic @semvijverberg made in #2 and makes it easier for our communication.

@jannesvaningen and @semvijverberg , could you please brain-storm a bit and finish this recipe?

semvijverberg · 2022-04-11T15:17:30Z

Okay, hereby my attempt to make this more concrete! Maybe there will be some confusion with the naming, as the package/class should become more than just a CrossValidator. It also holds information on the target dates, as described above. Note, this information is also needed to verify that the cross validation is done properly (avoiding train-test leakage due to auto-correlation or due to lag-shifting). I therefore suggest to call S2Sinit.

recipe attempt

import S2Sinit

set = S2Sinit(start_target_date: str='01-01', end_target_date: str='02-01', freq='M',
                       cross_year=False, 
                       start_load_date: str=None, end_load_date: str=None, 
                       start_year: int, end_year: int, 
                       tfreq: int=7)

                       
set.traintest.leave_n_out(n=5, max_lag=None) <-

-> adds pd.DataFrame called df_splits to the instance 'set'
df_splits contains two columns:
The first column ('CV') contain integers that indicate if date is part of training (1), testing (0), or not used (-1).                
The second column ('TargetPeriod') contain integers that indicate (1) if the date is part of the target period or not (0).

S2Sinit class

The Init class would look like this:

class S2Sinit:
       def __init__(self, 
                       start_target_period: str='11-01', end_target_period: str='12-01', freq='M',
                       cross_year: int=None, 
                       start_year: int, end_year: int, 
                       tfreq: Union[int, str]=7,
                       start_load_date: str=None)

              """
              Class to initialize parameters that define the dates to load and the target dates.
              
              start_target_date: str, optional
              _defines the start of your target period, default = '11-01'_
              
              end_target_period: str, optional
              _defines the end of your target period, default = '12-01'_
              
              freq: str
              Frequency in string, either 'M' for months or 'd' for days (in accordance with [pd.daterange](https://pandas.pydata.org/docs/reference/api/pandas.date_range.html)). 
               
              cross_year: int, optional
              Allows user to select a target period cross year, for example for DJF, the parameters would be:
              start_target_period: str='12-01', end_target_period: str='02-01', cross_year = 1
              Also enables users to select a target period over multiple years. If 1, the target period will be 
              -{cross_year}-11-01 until 00-12-01. In total 13 months.
              
              start_year: int
              Select start year
              
              end_year: int
              Select end year
              
              tfreq: Union[int, str]
              If an integer, aggregate to n-day or n-months means, depending on parameter freq. If freq is 'mean' than a period 
              mean is calculated over the target period.
              
              start_load_date: str, optional
              _defines the start of the data that you will load and aggregated to n-day or n-month bins. default = None.
              If None, all data that is available in the dataset will be loaded, otherwise a subset of the year might be loaded for               
              memory efficiency_     
"""

temporal manipulations and functions in proto

This functionality is currently a bit scattered throughout the proto code.

The function to aggregate the daily or monthly dataset to {tfreq}-day or {tfreq}-month means can be found here:

proto/RGCPD/functions_pp.py

Line 501 in 7773493

    
           def time_mean_bins(xr_or_df, tfreq=int, start_end_date=None, start_end_year=None,

And if tfreq == 'mean', a period mean will be calculated, with the period defined by start_target_date and end_target_date:

proto/RGCPD/functions_pp.py

Line 911 in 7773493

def time_mean_periods(xr_or_df, start_end_periods=np.ndarray,

The most high-level function that currently (only) contains datetime index associated functionality:

proto/RGCPD/core_pp.py

Lines 175 to 196 in 7773493

    
           def xr_core_pp_time(ds, seldates: Union[tuple, pd.DatetimeIndex]=None, 
        
                               start_end_year: tuple=None, loadleap: bool=False, 
        
                               dailytomonths: bool=False): 
        
               ''' Wrapper for some essentials for basic timeslicing and dailytomonthly 
        
                   aggregation 
        
               ds : xr.DataArray or xr.Dataset 
        
                   input xarray with 'time' dimension 
        
               seldates: tuple, pd.DatetimeIndex, optional 
        
                   default is None. 
        
                   if type is tuple: selecting data that fits within start- and enddate, 
        
                   format ('mm-dd', 'mm-dd'). default is ('01-01' - '12-31') 
        
                   if type is pd.DatetimeIndex: select that exact timeindex with 
        
                   xarray.sel(time=seldates) 
        
               start_end_year : tuple, optional 
        
                   default is to load all years 
        
               loadleap : TYPE, optional 
        
                   If True also loads the 29-02 leapdays. The default is False. 
        
               dailytomonths: 
        
                   When True, the daily input data will be aggregated to monthly data. 
        
               '''

Some similar function should be made for S2Sinit, but then the input (ds) is a pd.DatetimeIndex format instead of an xarray dataset.

cross validation functions in proto

@jannesvaningen wrote this, @semvijverberg I added here for sake of completeness, hope it's okay.

functionality for cross validation is both in class RGCPD as well as 'deeper' in functions_pp.

Like Yang mentioned, the cross validation starts with def traintest. This function requires as main inputs the RGCPD instance and the cv method (leave one out, k-fold etc.).

proto/RGCPD/class_RGCPD.py

Line 312 in 7773493

def traintest(self, method: Union[str, bool]=None, seed=1,

Then, def RV_and_traintest is called within def traintest from here:

proto/RGCPD/class_RGCPD.py

Lines 368 to 372 in 7773493

    
           self.TV, self.df_splits = RV_and_traintest(self.df_fullts, 
        
                                                      self.df_RV_ts, 
        
                                                      self.traintestgroups, 
        
                                                      verbosity=self.verbosity, 
        
                                                      **self.kwrgs_traintest)

The function above takes as input self.df_fullts, self.df_RV_ts, self.traintestgroups from preprocessing function pp_TV. I believe RV (response variable) and TV (target variable) are used interchangeably, @semvijverberg am I correct and maybe something to align? TV stands for Target Variable referring to your 'y' variable, in our case usually air temperature. f refers to the python file functions_pp.

I think this shows that we do some preprocessing steps before the cross validation determination.
So my proposal is to turn it around: adding to the recipe above, we follow:

S2Sinit - initialization
a. initialization of start end date etc.
b. cross validation --> like Sem suggested this function adds the columns for train/test and target/non-target
S2Spp (or any other name such as climpp) - where we do the preprocessing steps described below with the cv settings given

S2Spp
This is maybe more for another issue, the preprocessing. An explanation of preprocessing steps for S2Spp. These functions would then be called with the S2Sinit instance.

proto/RGCPD/class_RGCPD.py

Lines 299 to 300 in 7773493

    
           out = f.process_TV(fulltso, **self.kwrgs_pp_TV) 
        
           self.df_fullts, self.df_RV_ts, inf, self.traintestgroups = out

The function process_TV does pre-processing steps for the target variable like aggregations, within functions_pp:

proto/RGCPD/functions_pp.py

Lines 151 to 158 in 7773493

    
           def process_TV(fullts, tfreq, start_end_TVdate, start_end_date=None, 
        
                          start_end_year=None, RV_detrend=False, RV_anomaly=False, 
        
                          ext_annual_to_mon=True, TVdates_aggr: bool=False, 
        
                          dailytomonths: bool=False, verbosity=1): 
        
               # fullts=load_TV(list_of_name_path,name_ds=rg.name_TVds)[0] 
        
               # RV_detrend=False;RV_anomaly=False;verbosity=1; 
        
               # ext_annual_to_mon=True;TVdates_aggr=False; start_end_date=None; start_end_year=None 
        
               # dailytomonths=False

To trim it down further, the input 'fulltso' is an output of the def load_tv also from functions_pp. It takes as input the list with path names of the input variables and the name of the ds within the dataframe has 'ts' as standard name. It gives as output a fulltso : xr.DataArray() 1-d full timeseries.

proto/RGCPD/class_RGCPD.py

Lines 290 to 292 in 7773493

    
           f = functions_pp 
        
           fulltso, self.hash = f.load_TV(self.list_of_name_path, 
        
                                               name_ds=self.name_TVds)

proto/RGCPD/functions_pp.py

Lines 86 to 89 in 7773493

    
           def load_TV(list_of_name_path, name_ds='ts'): 
        
               ''' 
        
               function will load first item of list_of_name_path 
        
               list_of_name_path = [('TVname', 'TVpath'), ('prec_name', 'prec_path')]

Peter9192 · 2022-04-13T11:03:04Z

Thanks a lot for all the input! I think it's getting more concrete, but the API design can still be improved in a couple of ways. For instance:

I think it's probably easier if you don't think of everything as a class.
S2Sinit doesn't sound like a concrete object to me (though CrossValidator does). What does it do? To me it sounds like time bookkeeping.
adds pd.DataFrame called df_splits to the instance 'set' why not just return that dataframe instead of mutating the state?
start_load_date (defines the start of the data that you will load ... ): this concerns data loading. As far as I understand, S2Sinit is concerned with bookkeeping of datetimes. It doesn't need to know about DataLoader. When you want to load the data, then you just pass in the start_load_data (or the CV instance). Or, if S2Sinit does indeed include functionality for data loading, you can just pass it straight into the method: set.load_data(start_date = ..., end_date = ...).

This is not a critique but an attempt to converge to a common understanding of what makes a good API for our package. Perhaps @geek-yang and I can come up with a proposal building on what you wrote above.

semvijverberg · 2022-04-13T14:50:25Z

Hey Peter,

Thanks for the feedback!!

My idea was that that the S2Sinit class can create instances that holds all information related to:

bookkeeping of datetimes
cross-validation

This instance can be either passed to other packages or it can be used return relevant manipulations, such as sub selecting the training data of the target-period of the first fold.

Selecting the time-period you want to predict, what the temporal frequency of your experiment is and the cross-validation should always be the starting point of you analysis. This information needs to be homogenized across all data that will be loaded later in the pipeline (so it would be inefficient to make this part of a data-loader and having to retype the datetime related parameters (start_date, end_date, tfreq, cross_year).

I believe separating the data-loading also makes this class very handy for other people that only want to use the datetime functionality that will be build onto S2Sinit:

lag shifting
creating n-day, n-month or period means
creating a cross-validator that takes into account 'train_test_groups'
retrieving only test data or train data).

These can be annoying/time-consuming things to code properly so I image some people might want to use only S2Sinit functionality.

Having all datetime and CV information in one object makes life easier for the user (less typing). Simple example, I would want 10-day means that fall in DJF at lag 2 of training fold 1 (could be a real user-need as maybe it crashed when training on this fold with this lag and you want to check for NaNs). To make this into a stand-alone function it would always need 6 parameters (start_target_date, end_target_date, tfreq, cross_year, df_splits and lag). Otherwise it will need 1 parameter and you can say e.g. set.get_lag_shifted(lag=2).

geek-yang · 2022-04-13T15:05:29Z

@semvijverberg and @jannesvaningen , thanks a lot for your explanation. This indeed helps @Peter9192 and me understand your demand and vision about this package. If I don't miss anything, then according to your description, I think two core components (let's first not be bothered by terms like "class" or "methods") must be included in this package:

A initializer which includes key parameters need to be defined at the very beginning (and will be used for the whole workflow, even for later by other postprocessing/analysis packages in AI4S2S)
A cross-validation module

In the initializer, you want to specify input/target datetime and "preprocess" the data into chunks that based on the freq. And therefore you need functions to be "datetime aware" (where pd.DatetimeIndex plays a role) and can take the mean based on given freq. In this step, we will have some output data and the data will be carried (in the memory, or can be output, that's another tbd).

Then you allow the user to use the pre-processed data to perform cross-validation in the same package. The "data chunks" and other key parameters are also ready to be passed to a following module.

I think it makes sense to put these two components in one package, but we may want to encapsulate them into two modules (or let's speak in python, classes) since they are a bit different and will have their own functions (methods) not be used by the other.

So we might also want a more explicit name for the whole package, but this can be decided later.

@Peter9192 and I could look into it together and we will propose a draft structure of the package. Then we can have a discussion about it.

Thanks again for the detailed info. That really helps a lot 😄 👍 😆 ! @semvijverberg @jannesvaningen

geek-yang · 2022-04-13T15:13:46Z

Hey Peter,

Thanks for the feedback!!

My idea was that that the S2Sinit class can create instances that holds all information related to:

bookkeeping of datetimes

cross-validation

This instance can be either passed to other packages or it can be used return relevant manipulations, such as sub selecting the training data of the target-period of the first fold.

Selecting the time-period you want to predict, what the temporal frequency of your experiment is and the cross-validation should always be the starting point of you analysis. This information needs to be homogenized across all data that will be loaded later in the pipeline (so it would be inefficient to make this part of a data-loader and having to retype the datetime related parameters (start_date, end_date, tfreq, cross_year).

I believe separating the data-loading also makes this class very handy for other people that only want to use the datetime functionality that will be build onto S2Sinit:

lag shifting

creating n-day, n-month or period means

creating a cross-validator that takes into account 'train_test_groups'

retrieving only test data or train data).

These can be annoying/time-consuming things to code properly so I image some people might want to use only S2Sinit functionality.

Having all datetime and CV information in one object makes life easier for the user (less typing). Simple example, I would want 10-day means that fall in DJF at lag 2 of training fold 1 (could be a real user-need as maybe it crashed when training on this fold with this lag and you want to check for NaNs). To make this into a stand-alone function it would always need 6 parameters (start_target_date, end_target_date, tfreq, cross_year, df_splits and lag). Otherwise it will need 1 parameter and you can say e.g. set.get_lag_shifted(lag=2).

Ooop, you are a bit faster than me 😹. But it seems my impression is mostly correct. Thanks for the clarification. @Peter9192 and I will have a design session first and see how far we could go. We will keep you guys updated.

Peter9192 · 2022-04-14T08:26:54Z

Having all datetime and CV information in one object makes life easier for the user.

I completely agree with that. An alternative could be to use a dataclass to just pass that information, like so:

from dataclasses import dataclass
from typing import Union

@dataclass
class Experiment:
    start_year: int
    end_year: int
    target_start: str = '11-01'
    target_end: str = '12-01'
    freq = 'M'
    tfreq: Union[int, str] = 7

e = Experiment(start_year = 1979, end_year = 2021)

>>   Experiment(start_year=1979, end_year=2021, target_start='11-01', target_end='12-01', tfreq=7)

And from there I can see how you want to add methods to the class. However, it gets a bit confusing when you then start changing the attributes on the object. To be continued...

Peter9192 mentioned this issue May 18, 2022

Implementing the TimeIndex class - the constructor method AI4S2S/s2spy#8

Closed

geek-yang mentioned this issue May 20, 2022

Implementing the TimeIndex class - the resample method AI4S2S/s2spy#11

Closed

geek-yang mentioned this issue Jun 6, 2022

Implementing the lag shifting (discard) method AI4S2S/s2spy#19

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor cross-validator #3

Refactor cross-validator #3

geek-yang commented Mar 28, 2022 •

edited

Loading

Peter9192 commented Apr 6, 2022

geek-yang commented Apr 11, 2022

semvijverberg commented Apr 11, 2022 •

edited by jannesvaningen

Loading

Peter9192 commented Apr 13, 2022

semvijverberg commented Apr 13, 2022 •

edited

Loading

geek-yang commented Apr 13, 2022 •

edited

Loading

geek-yang commented Apr 13, 2022

Peter9192 commented Apr 14, 2022

Refactor cross-validator #3

Refactor cross-validator #3

Comments

geek-yang commented Mar 28, 2022 • edited Loading

Peter9192 commented Apr 6, 2022

geek-yang commented Apr 11, 2022

semvijverberg commented Apr 11, 2022 • edited by jannesvaningen Loading

recipe attempt

S2Sinit class

temporal manipulations and functions in proto

cross validation functions in proto

Peter9192 commented Apr 13, 2022

semvijverberg commented Apr 13, 2022 • edited Loading

geek-yang commented Apr 13, 2022 • edited Loading

geek-yang commented Apr 13, 2022

Peter9192 commented Apr 14, 2022

geek-yang commented Mar 28, 2022 •

edited

Loading

semvijverberg commented Apr 11, 2022 •

edited by jannesvaningen

Loading

semvijverberg commented Apr 13, 2022 •

edited

Loading

geek-yang commented Apr 13, 2022 •

edited

Loading