feat: use `dateutils.relativedelta` instead of `timedelta` #64

FBruzzesi · 2024-11-09T11:40:46Z

Description

This will enable to work with months and years time frequencies.

Closes #63 and #8

FBruzzesi · 2024-11-09T11:48:46Z

@mdancho84 fancy taking a look whenever you have the time?

mdancho84 · 2024-11-12T15:02:59Z

Sorry must have missed this message. I'll install and test out with pytimetk.

mdancho84 · 2024-11-12T15:15:06Z

I just updated to feat/relativedelta and saw no breaking changes.

I'll test the month in a minute and report back.

FBruzzesi · 2024-11-12T15:18:51Z

Hey @mdancho84

Sorry must have missed this message

I should avoid pinging during weekends 🙈

I'll test the month in a minute and report back.

Sure thanks! There is no rush 😁 I could also follow a similar approach to what we do in Narwhals and run the pytimetk test suite in CI as a downstream library. I will open an issue to keep track of that

mdancho84 · 2024-11-12T15:20:53Z

That would be great just to make sure my examples don't fail.

No worries about weekends. I get slammed by emails regardless. It was my bad.

mdancho84 · 2024-11-12T15:23:03Z

I'm seeing a failure with 'months':
ValueError: frequency must be one of ('days', 'seconds', 'microseconds', 'milliseconds', 'minutes', 'hours', 'weeks'). Found months

# imports
import numpy as np
import pandas as pd
import pytimetk as tk

# Get data
df = tk.datasets.load_dataset('bike_sales_sample')

df['order_date'] = pd.to_datetime(df['order_date'])

df.glimpse()

# aggregate sales by month
sales_by_month = df \
    .groupby('category_2') \
    .summarize_by_time(
        date_column = 'order_date',
        value_column = 'total_price',
        agg_func = ['sum'],
        freq = 'MS'
    )

sales_by_month

# make cross validation sets

from pytimetk import TimeSeriesCV

tscv = TimeSeriesCV(
    frequency="months",
    train_size=24,
    forecast_horizon=12,
    gap=12,
)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File Untitled-[1](untitled-1:1):1
----> 1 tscv = TimeSeriesCV(
      [2](untitled-1:2)     frequency="months",
      [3](untitled-1:3)     train_size=2[4](untitled-1:4),
      4     forecast_horizon=12,
      [5](untitled-1:5)     gap=12,
      [6](untitled-1:6) )

File ~/Desktop/software/pytimetk/src/pytimetk/crossvalidation/time_series_cv.py:153, in TimeSeriesCV.__init__(self, frequency, train_size, forecast_horizon, gap, stride, window, mode, split_limit, **kwargs)
    [140](https://file+.vscode-resource.vscode-cdn.net/Users/mdancho/Desktop/software/pytimetk/~/Desktop/software/pytimetk/src/pytimetk/crossvalidation/time_series_cv.py:140) def __init__(
    [141](https://file+.vscode-resource.vscode-cdn.net/Users/mdancho/Desktop/software/pytimetk/~/Desktop/software/pytimetk/src/pytimetk/crossvalidation/time_series_cv.py:141)     self,
    [142](https://file+.vscode-resource.vscode-cdn.net/Users/mdancho/Desktop/software/pytimetk/~/Desktop/software/pytimetk/src/pytimetk/crossvalidation/time_series_cv.py:142)     frequency: str,
   (...)
    [151](https://file+.vscode-resource.vscode-cdn.net/Users/mdancho/Desktop/software/pytimetk/~/Desktop/software/pytimetk/src/pytimetk/crossvalidation/time_series_cv.py:151) ):
    [152](https://file+.vscode-resource.vscode-cdn.net/Users/mdancho/Desktop/software/pytimetk/~/Desktop/software/pytimetk/src/pytimetk/crossvalidation/time_series_cv.py:152)     # Initialize the parent class
--> [153](https://file+.vscode-resource.vscode-cdn.net/Users/mdancho/Desktop/software/pytimetk/~/Desktop/software/pytimetk/src/pytimetk/crossvalidation/time_series_cv.py:153)     super().__init__(
    [154](https://file+.vscode-resource.vscode-cdn.net/Users/mdancho/Desktop/software/pytimetk/~/Desktop/software/pytimetk/src/pytimetk/crossvalidation/time_series_cv.py:154)         frequency = frequency,
    [155](https://file+.vscode-resource.vscode-cdn.net/Users/mdancho/Desktop/software/pytimetk/~/Desktop/software/pytimetk/src/pytimetk/crossvalidation/time_series_cv.py:155)         train_size = train_size,
    [156](https://file+.vscode-resource.vscode-cdn.net/Users/mdancho/Desktop/software/pytimetk/~/Desktop/software/pytimetk/src/pytimetk/crossvalidation/time_series_cv.py:156)         forecast_horizon = forecast_horizon,
    [157](https://file+.vscode-resource.vscode-cdn.net/Users/mdancho/Desktop/software/pytimetk/~/Desktop/software/pytimetk/src/pytimetk/crossvalidation/time_series_cv.py:157)         gap = gap,
    [158](https://file+.vscode-resource.vscode-cdn.net/Users/mdancho/Desktop/software/pytimetk/~/Desktop/software/pytimetk/src/pytimetk/crossvalidation/time_series_cv.py:158)         stride = stride,
    [159](https://file+.vscode-resource.vscode-cdn.net/Users/mdancho/Desktop/software/pytimetk/~/Desktop/software/pytimetk/src/pytimetk/crossvalidation/time_series_cv.py:159)         window = window,
    [160](https://file+.vscode-resource.vscode-cdn.net/Users/mdancho/Desktop/software/pytimetk/~/Desktop/software/pytimetk/src/pytimetk/crossvalidation/time_series_cv.py:160)         mode=mode, 
    [161](https://file+.vscode-resource.vscode-cdn.net/Users/mdancho/Desktop/software/pytimetk/~/Desktop/software/pytimetk/src/pytimetk/crossvalidation/time_series_cv.py:161)         **kwargs
    [162](https://file+.vscode-resource.vscode-cdn.net/Users/mdancho/Desktop/software/pytimetk/~/Desktop/software/pytimetk/src/pytimetk/crossvalidation/time_series_cv.py:162)     )
    [164](https://file+.vscode-resource.vscode-cdn.net/Users/mdancho/Desktop/software/pytimetk/~/Desktop/software/pytimetk/src/pytimetk/crossvalidation/time_series_cv.py:164)     self.split_limit = split_limit

File ~/opt/anaconda3/envs/pytimetk/lib/python3.9/site-packages/timebasedcv/core.py:111, in _CoreTimeBasedSplit.__init__(self, frequency, train_size, forecast_horizon, gap, stride, window, mode)
    [108](https://file+.vscode-resource.vscode-cdn.net/Users/mdancho/Desktop/software/pytimetk/~/opt/anaconda3/envs/pytimetk/lib/python3.9/site-packages/timebasedcv/core.py:108) self.window_ = window
    [109](https://file+.vscode-resource.vscode-cdn.net/Users/mdancho/Desktop/software/pytimetk/~/opt/anaconda3/envs/pytimetk/lib/python3.9/site-packages/timebasedcv/core.py:109) self.mode_ = mode
--> [111](https://file+.vscode-resource.vscode-cdn.net/Users/mdancho/Desktop/software/pytimetk/~/opt/anaconda3/envs/pytimetk/lib/python3.9/site-packages/timebasedcv/core.py:111) self._validate_arguments()

File ~/opt/anaconda3/envs/pytimetk/lib/python3.9/site-packages/timebasedcv/core.py:118, in _CoreTimeBasedSplit._validate_arguments(self)
    [116](https://file+.vscode-resource.vscode-cdn.net/Users/mdancho/Desktop/software/pytimetk/~/opt/anaconda3/envs/pytimetk/lib/python3.9/site-packages/timebasedcv/core.py:116) if self.frequency_ not in _frequency_values:
    [117](https://file+.vscode-resource.vscode-cdn.net/Users/mdancho/Desktop/software/pytimetk/~/opt/anaconda3/envs/pytimetk/lib/python3.9/site-packages/timebasedcv/core.py:117)     msg = f"`frequency` must be one of {_frequency_values}. Found {self.frequency_}"
--> [118](https://file+.vscode-resource.vscode-cdn.net/Users/mdancho/Desktop/software/pytimetk/~/opt/anaconda3/envs/pytimetk/lib/python3.9/site-packages/timebasedcv/core.py:118)     raise ValueError(msg)
    [120](https://file+.vscode-resource.vscode-cdn.net/Users/mdancho/Desktop/software/pytimetk/~/opt/anaconda3/envs/pytimetk/lib/python3.9/site-packages/timebasedcv/core.py:120) # Validate window
    [121](https://file+.vscode-resource.vscode-cdn.net/Users/mdancho/Desktop/software/pytimetk/~/opt/anaconda3/envs/pytimetk/lib/python3.9/site-packages/timebasedcv/core.py:121) if self.window_ not in _window_values:

ValueError: `frequency` must be one of ('days', 'seconds', 'microseconds', 'milliseconds', 'minutes', 'hours', 'weeks'). Found months

mdancho84 · 2024-11-12T15:27:33Z

Let me doublecheck that the branch was installed.

mdancho84 · 2024-11-12T15:35:31Z

That was the problem. I wasn't upgraded. Solution was to uninstall and reinstall at your last commit /branch

mdancho84 · 2024-11-12T15:59:18Z

I am running into issues with the 'month' test I've put together inside of Pytimetk. I have 1 year of data (12 months). And with the specification (tscv) I'd expect 2 splits.

Note - My pytimetk tests for daily still work fine.

tscv = TimeSeriesCV(
    frequency="months",
    train_size=6,
    forecast_horizon=3,
    gap=0,
)

Test:

# imports
import numpy as np
import pandas as pd
import pytimetk as tk

# Get data
df = tk.datasets.load_dataset('bike_sales_sample')

df['order_date'] = pd.to_datetime(df['order_date'])

df.glimpse()

# aggregate sales by month
sales_by_month = df \
    .groupby('category_2') \
    .summarize_by_time(
        date_column = 'order_date',
        value_column = 'total_price',
        agg_func = ['sum'],
        freq = 'MS'
    )

sales_by_month \
    .groupby('category_2') \
    .plot_timeseries("order_date", "total_price_sum", smooth=False, plotly_dropdown = True)

# Set index
df = sales_by_month.copy()

df.set_index("order_date", inplace=True)

# Create an X dataframeand y series
X, y = df.loc[:, ["category_2"]], df["total_price_sum"]
X
y

# make cross validation sets

from pytimetk import TimeSeriesCV

tscv = TimeSeriesCV(
    frequency="months",
    train_size=6,
    forecast_horizon=3,
    gap=0,
)

splits = tscv.split(X, y)

for i, (X_train, X_forecast, y_train, y_forecast) in enumerate(list(splits)):
    
    print(f"Split {i+1}")
    print(X_train)
    print(X_forecast)

tscv.glimpse(y)

tscv.plot(X,y)

Output

The output from printing the splits suggests it's only making 1 split:

Split 1
                    category_2
order_date                    
2011-03-01  Cross Country Race
2011-04-01  Cross Country Race
2011-05-01  Cross Country Race
2011-06-01  Cross Country Race
2011-07-01  Cross Country Race
2011-08-01  Cross Country Race
2011-03-01          Cyclocross
2011-04-01          Cyclocross
2011-05-01          Cyclocross
2011-06-01          Cyclocross
2011-07-01          Cyclocross
2011-08-01          Cyclocross
2011-03-01          Elite Road
2011-04-01          Elite Road
2011-05-01          Elite Road
2011-06-01          Elite Road
2011-07-01          Elite Road
2011-08-01          Elite Road
2011-03-01      Endurance Road
2011-04-01      Endurance Road
2011-05-01      Endurance Road
2011-06-01      Endurance Road
2011-07-01      Endurance Road
2011-08-01      Endurance Road
2011-03-01            Fat Bike
2011-04-01            Fat Bike
2011-05-01            Fat Bike
2011-06-01            Fat Bike
2011-07-01            Fat Bike
2011-08-01            Fat Bike
2011-03-01       Over Mountain
2011-04-01       Over Mountain
2011-05-01       Over Mountain
2011-06-01       Over Mountain
2011-07-01       Over Mountain
2011-08-01       Over Mountain
2011-03-01               Sport
2011-04-01               Sport
2011-05-01               Sport
2011-06-01               Sport
2011-07-01               Sport
2011-08-01               Sport
2011-03-01               Trail
2011-04-01               Trail
2011-05-01               Trail
2011-06-01               Trail
2011-07-01               Trail
2011-08-01               Trail
2011-03-01          Triathalon
2011-04-01          Triathalon
2011-05-01          Triathalon
2011-06-01          Triathalon
2011-07-01          Triathalon
2011-08-01          Triathalon
                    category_2
order_date                    
2011-09-01  Cross Country Race
2011-10-01  Cross Country Race
2011-11-01  Cross Country Race
2011-09-01          Cyclocross
2011-10-01          Cyclocross
2011-11-01          Cyclocross
2011-09-01          Elite Road
2011-10-01          Elite Road
2011-11-01          Elite Road
2011-09-01      Endurance Road
2011-10-01      Endurance Road
2011-11-01      Endurance Road
2011-09-01            Fat Bike
2011-10-01            Fat Bike
2011-11-01            Fat Bike
2011-09-01       Over Mountain
2011-10-01       Over Mountain
2011-11-01       Over Mountain
2011-09-01               Sport
2011-10-01               Sport
2011-11-01               Sport
2011-09-01               Trail
2011-10-01               Trail
2011-11-01               Trail
2011-09-01          Triathalon
2011-10-01          Triathalon
2011-11-01          Triathalon

FBruzzesi · 2024-11-12T16:26:56Z

It seems that the data has one year only:

df.index.min(), df.index.max()
(Timestamp('2011-01-01 00:00:00'), Timestamp('2011-12-01 00:00:00'))

mdancho84 · 2024-11-12T16:38:05Z

Yes, is that a problem?

FBruzzesi · 2024-11-12T16:47:56Z

Whops sorry, I thought it was 1 year frequency, ignore me

FBruzzesi · 2024-11-12T19:32:53Z

So the reason is that there are 11 months between min and max, with a 6 months training and 3 months forecast horizon (and stride as well). This would make the second split to start on 2010-12-01, which is before the min date, and therefore exit the loop.

For mode="backward" I want to guarantee that the train size is always guaranteed.

If you were to compute this in forward mode, then you would get 2 splits, the second of which has a test size of 2 months (from 2011-10-01 to 2011-12-01).

One way to achieve that in backward mode would be to specify the end date in .split method:

from datetime import datetime

...

tscv = TimeSeriesCV(
    frequency="months",
    train_size=6,
    forecast_horizon=3,
    gap=0,
)

splits = tscv.split(X, y, end_dt=datetime(2012, 1, 1))

for i, _ in enumerate(list(splits)):
    print(f"Split {i+1}")

Split 1
Split 2

mdancho84 · 2024-11-12T19:58:28Z

Ok that's interesting. Thanks for looking into it.

All of the original examples I put together are working.

I'll play around with it and see if there's anything else.

But this looks great. Thanks so much for adding the new frequencies.

FBruzzesi added 2 commits November 9, 2024 12:35

feat: use relativedelta instead of timedelta

f381419

docstring update

3467f3d

Merge branch 'main' into feat/relativedelta

2d774a6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: use `dateutils.relativedelta` instead of `timedelta` #64

feat: use `dateutils.relativedelta` instead of `timedelta` #64

FBruzzesi commented Nov 9, 2024

FBruzzesi commented Nov 9, 2024

mdancho84 commented Nov 12, 2024

mdancho84 commented Nov 12, 2024

FBruzzesi commented Nov 12, 2024

mdancho84 commented Nov 12, 2024

mdancho84 commented Nov 12, 2024 •

edited

Loading

mdancho84 commented Nov 12, 2024

mdancho84 commented Nov 12, 2024

mdancho84 commented Nov 12, 2024 •

edited

Loading

FBruzzesi commented Nov 12, 2024

mdancho84 commented Nov 12, 2024

FBruzzesi commented Nov 12, 2024

FBruzzesi commented Nov 12, 2024 •

edited

Loading

mdancho84 commented Nov 12, 2024

feat: use dateutils.relativedelta instead of timedelta #64

Are you sure you want to change the base?

feat: use dateutils.relativedelta instead of timedelta #64

Conversation

FBruzzesi commented Nov 9, 2024

Description

FBruzzesi commented Nov 9, 2024

mdancho84 commented Nov 12, 2024

mdancho84 commented Nov 12, 2024

FBruzzesi commented Nov 12, 2024

mdancho84 commented Nov 12, 2024

mdancho84 commented Nov 12, 2024 • edited Loading

mdancho84 commented Nov 12, 2024

mdancho84 commented Nov 12, 2024

mdancho84 commented Nov 12, 2024 • edited Loading

Test:

Output

FBruzzesi commented Nov 12, 2024

mdancho84 commented Nov 12, 2024

FBruzzesi commented Nov 12, 2024

FBruzzesi commented Nov 12, 2024 • edited Loading

mdancho84 commented Nov 12, 2024

feat: use `dateutils.relativedelta` instead of `timedelta` #64

feat: use `dateutils.relativedelta` instead of `timedelta` #64

mdancho84 commented Nov 12, 2024 •

edited

Loading

mdancho84 commented Nov 12, 2024 •

edited

Loading

FBruzzesi commented Nov 12, 2024 •

edited

Loading