Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use dateutils.relativedelta instead of timedelta #64

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

FBruzzesi
Copy link
Owner

Description

This will enable to work with months and years time frequencies.

Closes #63 and #8

@FBruzzesi
Copy link
Owner Author

@mdancho84 fancy taking a look whenever you have the time?

@mdancho84
Copy link

Sorry must have missed this message. I'll install and test out with pytimetk.

@mdancho84
Copy link

I just updated to feat/relativedelta and saw no breaking changes.

I'll test the month in a minute and report back.

@FBruzzesi
Copy link
Owner Author

Hey @mdancho84

Sorry must have missed this message

I should avoid pinging during weekends 🙈

I'll test the month in a minute and report back.

Sure thanks! There is no rush 😁 I could also follow a similar approach to what we do in Narwhals and run the pytimetk test suite in CI as a downstream library. I will open an issue to keep track of that

@mdancho84
Copy link

That would be great just to make sure my examples don't fail.

No worries about weekends. I get slammed by emails regardless. It was my bad.

@mdancho84
Copy link

mdancho84 commented Nov 12, 2024

I'm seeing a failure with 'months':
ValueError: frequency must be one of ('days', 'seconds', 'microseconds', 'milliseconds', 'minutes', 'hours', 'weeks'). Found months

# imports
import numpy as np
import pandas as pd
import pytimetk as tk

# Get data
df = tk.datasets.load_dataset('bike_sales_sample')

df['order_date'] = pd.to_datetime(df['order_date'])

df.glimpse()

# aggregate sales by month
sales_by_month = df \
    .groupby('category_2') \
    .summarize_by_time(
        date_column = 'order_date',
        value_column = 'total_price',
        agg_func = ['sum'],
        freq = 'MS'
    )

sales_by_month

# make cross validation sets

from pytimetk import TimeSeriesCV

tscv = TimeSeriesCV(
    frequency="months",
    train_size=24,
    forecast_horizon=12,
    gap=12,
)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File Untitled-[1](untitled-1:1):1
----> 1 tscv = TimeSeriesCV(
      [2](untitled-1:2)     frequency="months",
      [3](untitled-1:3)     train_size=2[4](untitled-1:4),
      4     forecast_horizon=12,
      [5](untitled-1:5)     gap=12,
      [6](untitled-1:6) )

File ~/Desktop/software/pytimetk/src/pytimetk/crossvalidation/time_series_cv.py:153, in TimeSeriesCV.__init__(self, frequency, train_size, forecast_horizon, gap, stride, window, mode, split_limit, **kwargs)
    [140](https://file+.vscode-resource.vscode-cdn.net/Users/mdancho/Desktop/software/pytimetk/~/Desktop/software/pytimetk/src/pytimetk/crossvalidation/time_series_cv.py:140) def __init__(
    [141](https://file+.vscode-resource.vscode-cdn.net/Users/mdancho/Desktop/software/pytimetk/~/Desktop/software/pytimetk/src/pytimetk/crossvalidation/time_series_cv.py:141)     self,
    [142](https://file+.vscode-resource.vscode-cdn.net/Users/mdancho/Desktop/software/pytimetk/~/Desktop/software/pytimetk/src/pytimetk/crossvalidation/time_series_cv.py:142)     frequency: str,
   (...)
    [151](https://file+.vscode-resource.vscode-cdn.net/Users/mdancho/Desktop/software/pytimetk/~/Desktop/software/pytimetk/src/pytimetk/crossvalidation/time_series_cv.py:151) ):
    [152](https://file+.vscode-resource.vscode-cdn.net/Users/mdancho/Desktop/software/pytimetk/~/Desktop/software/pytimetk/src/pytimetk/crossvalidation/time_series_cv.py:152)     # Initialize the parent class
--> [153](https://file+.vscode-resource.vscode-cdn.net/Users/mdancho/Desktop/software/pytimetk/~/Desktop/software/pytimetk/src/pytimetk/crossvalidation/time_series_cv.py:153)     super().__init__(
    [154](https://file+.vscode-resource.vscode-cdn.net/Users/mdancho/Desktop/software/pytimetk/~/Desktop/software/pytimetk/src/pytimetk/crossvalidation/time_series_cv.py:154)         frequency = frequency,
    [155](https://file+.vscode-resource.vscode-cdn.net/Users/mdancho/Desktop/software/pytimetk/~/Desktop/software/pytimetk/src/pytimetk/crossvalidation/time_series_cv.py:155)         train_size = train_size,
    [156](https://file+.vscode-resource.vscode-cdn.net/Users/mdancho/Desktop/software/pytimetk/~/Desktop/software/pytimetk/src/pytimetk/crossvalidation/time_series_cv.py:156)         forecast_horizon = forecast_horizon,
    [157](https://file+.vscode-resource.vscode-cdn.net/Users/mdancho/Desktop/software/pytimetk/~/Desktop/software/pytimetk/src/pytimetk/crossvalidation/time_series_cv.py:157)         gap = gap,
    [158](https://file+.vscode-resource.vscode-cdn.net/Users/mdancho/Desktop/software/pytimetk/~/Desktop/software/pytimetk/src/pytimetk/crossvalidation/time_series_cv.py:158)         stride = stride,
    [159](https://file+.vscode-resource.vscode-cdn.net/Users/mdancho/Desktop/software/pytimetk/~/Desktop/software/pytimetk/src/pytimetk/crossvalidation/time_series_cv.py:159)         window = window,
    [160](https://file+.vscode-resource.vscode-cdn.net/Users/mdancho/Desktop/software/pytimetk/~/Desktop/software/pytimetk/src/pytimetk/crossvalidation/time_series_cv.py:160)         mode=mode, 
    [161](https://file+.vscode-resource.vscode-cdn.net/Users/mdancho/Desktop/software/pytimetk/~/Desktop/software/pytimetk/src/pytimetk/crossvalidation/time_series_cv.py:161)         **kwargs
    [162](https://file+.vscode-resource.vscode-cdn.net/Users/mdancho/Desktop/software/pytimetk/~/Desktop/software/pytimetk/src/pytimetk/crossvalidation/time_series_cv.py:162)     )
    [164](https://file+.vscode-resource.vscode-cdn.net/Users/mdancho/Desktop/software/pytimetk/~/Desktop/software/pytimetk/src/pytimetk/crossvalidation/time_series_cv.py:164)     self.split_limit = split_limit

File ~/opt/anaconda3/envs/pytimetk/lib/python3.9/site-packages/timebasedcv/core.py:111, in _CoreTimeBasedSplit.__init__(self, frequency, train_size, forecast_horizon, gap, stride, window, mode)
    [108](https://file+.vscode-resource.vscode-cdn.net/Users/mdancho/Desktop/software/pytimetk/~/opt/anaconda3/envs/pytimetk/lib/python3.9/site-packages/timebasedcv/core.py:108) self.window_ = window
    [109](https://file+.vscode-resource.vscode-cdn.net/Users/mdancho/Desktop/software/pytimetk/~/opt/anaconda3/envs/pytimetk/lib/python3.9/site-packages/timebasedcv/core.py:109) self.mode_ = mode
--> [111](https://file+.vscode-resource.vscode-cdn.net/Users/mdancho/Desktop/software/pytimetk/~/opt/anaconda3/envs/pytimetk/lib/python3.9/site-packages/timebasedcv/core.py:111) self._validate_arguments()

File ~/opt/anaconda3/envs/pytimetk/lib/python3.9/site-packages/timebasedcv/core.py:118, in _CoreTimeBasedSplit._validate_arguments(self)
    [116](https://file+.vscode-resource.vscode-cdn.net/Users/mdancho/Desktop/software/pytimetk/~/opt/anaconda3/envs/pytimetk/lib/python3.9/site-packages/timebasedcv/core.py:116) if self.frequency_ not in _frequency_values:
    [117](https://file+.vscode-resource.vscode-cdn.net/Users/mdancho/Desktop/software/pytimetk/~/opt/anaconda3/envs/pytimetk/lib/python3.9/site-packages/timebasedcv/core.py:117)     msg = f"`frequency` must be one of {_frequency_values}. Found {self.frequency_}"
--> [118](https://file+.vscode-resource.vscode-cdn.net/Users/mdancho/Desktop/software/pytimetk/~/opt/anaconda3/envs/pytimetk/lib/python3.9/site-packages/timebasedcv/core.py:118)     raise ValueError(msg)
    [120](https://file+.vscode-resource.vscode-cdn.net/Users/mdancho/Desktop/software/pytimetk/~/opt/anaconda3/envs/pytimetk/lib/python3.9/site-packages/timebasedcv/core.py:120) # Validate window
    [121](https://file+.vscode-resource.vscode-cdn.net/Users/mdancho/Desktop/software/pytimetk/~/opt/anaconda3/envs/pytimetk/lib/python3.9/site-packages/timebasedcv/core.py:121) if self.window_ not in _window_values:

ValueError: `frequency` must be one of ('days', 'seconds', 'microseconds', 'milliseconds', 'minutes', 'hours', 'weeks'). Found months

@mdancho84
Copy link

Let me doublecheck that the branch was installed.

@mdancho84
Copy link

That was the problem. I wasn't upgraded. Solution was to uninstall and reinstall at your last commit /branch

@mdancho84
Copy link

mdancho84 commented Nov 12, 2024

I am running into issues with the 'month' test I've put together inside of Pytimetk. I have 1 year of data (12 months). And with the specification (tscv) I'd expect 2 splits.

Note - My pytimetk tests for daily still work fine.

tscv = TimeSeriesCV(
    frequency="months",
    train_size=6,
    forecast_horizon=3,
    gap=0,
)

Test:

# imports
import numpy as np
import pandas as pd
import pytimetk as tk

# Get data
df = tk.datasets.load_dataset('bike_sales_sample')

df['order_date'] = pd.to_datetime(df['order_date'])

df.glimpse()

# aggregate sales by month
sales_by_month = df \
    .groupby('category_2') \
    .summarize_by_time(
        date_column = 'order_date',
        value_column = 'total_price',
        agg_func = ['sum'],
        freq = 'MS'
    )

sales_by_month \
    .groupby('category_2') \
    .plot_timeseries("order_date", "total_price_sum", smooth=False, plotly_dropdown = True)

# Set index
df = sales_by_month.copy()

df.set_index("order_date", inplace=True)

# Create an X dataframeand y series
X, y = df.loc[:, ["category_2"]], df["total_price_sum"]
X
y

# make cross validation sets

from pytimetk import TimeSeriesCV

tscv = TimeSeriesCV(
    frequency="months",
    train_size=6,
    forecast_horizon=3,
    gap=0,
)

splits = tscv.split(X, y)

for i, (X_train, X_forecast, y_train, y_forecast) in enumerate(list(splits)):
    
    print(f"Split {i+1}")
    print(X_train)
    print(X_forecast)

tscv.glimpse(y)

tscv.plot(X,y)

Output

The output from printing the splits suggests it's only making 1 split:

Split 1
                    category_2
order_date                    
2011-03-01  Cross Country Race
2011-04-01  Cross Country Race
2011-05-01  Cross Country Race
2011-06-01  Cross Country Race
2011-07-01  Cross Country Race
2011-08-01  Cross Country Race
2011-03-01          Cyclocross
2011-04-01          Cyclocross
2011-05-01          Cyclocross
2011-06-01          Cyclocross
2011-07-01          Cyclocross
2011-08-01          Cyclocross
2011-03-01          Elite Road
2011-04-01          Elite Road
2011-05-01          Elite Road
2011-06-01          Elite Road
2011-07-01          Elite Road
2011-08-01          Elite Road
2011-03-01      Endurance Road
2011-04-01      Endurance Road
2011-05-01      Endurance Road
2011-06-01      Endurance Road
2011-07-01      Endurance Road
2011-08-01      Endurance Road
2011-03-01            Fat Bike
2011-04-01            Fat Bike
2011-05-01            Fat Bike
2011-06-01            Fat Bike
2011-07-01            Fat Bike
2011-08-01            Fat Bike
2011-03-01       Over Mountain
2011-04-01       Over Mountain
2011-05-01       Over Mountain
2011-06-01       Over Mountain
2011-07-01       Over Mountain
2011-08-01       Over Mountain
2011-03-01               Sport
2011-04-01               Sport
2011-05-01               Sport
2011-06-01               Sport
2011-07-01               Sport
2011-08-01               Sport
2011-03-01               Trail
2011-04-01               Trail
2011-05-01               Trail
2011-06-01               Trail
2011-07-01               Trail
2011-08-01               Trail
2011-03-01          Triathalon
2011-04-01          Triathalon
2011-05-01          Triathalon
2011-06-01          Triathalon
2011-07-01          Triathalon
2011-08-01          Triathalon
                    category_2
order_date                    
2011-09-01  Cross Country Race
2011-10-01  Cross Country Race
2011-11-01  Cross Country Race
2011-09-01          Cyclocross
2011-10-01          Cyclocross
2011-11-01          Cyclocross
2011-09-01          Elite Road
2011-10-01          Elite Road
2011-11-01          Elite Road
2011-09-01      Endurance Road
2011-10-01      Endurance Road
2011-11-01      Endurance Road
2011-09-01            Fat Bike
2011-10-01            Fat Bike
2011-11-01            Fat Bike
2011-09-01       Over Mountain
2011-10-01       Over Mountain
2011-11-01       Over Mountain
2011-09-01               Sport
2011-10-01               Sport
2011-11-01               Sport
2011-09-01               Trail
2011-10-01               Trail
2011-11-01               Trail
2011-09-01          Triathalon
2011-10-01          Triathalon
2011-11-01          Triathalon

@FBruzzesi
Copy link
Owner Author

It seems that the data has one year only:

df.index.min(), df.index.max()
(Timestamp('2011-01-01 00:00:00'), Timestamp('2011-12-01 00:00:00'))

@mdancho84
Copy link

Yes, is that a problem?

@FBruzzesi
Copy link
Owner Author

Whops sorry, I thought it was 1 year frequency, ignore me

@FBruzzesi
Copy link
Owner Author

FBruzzesi commented Nov 12, 2024

So the reason is that there are 11 months between min and max, with a 6 months training and 3 months forecast horizon (and stride as well). This would make the second split to start on 2010-12-01, which is before the min date, and therefore exit the loop.

For mode="backward" I want to guarantee that the train size is always guaranteed.

If you were to compute this in forward mode, then you would get 2 splits, the second of which has a test size of 2 months (from 2011-10-01 to 2011-12-01).

One way to achieve that in backward mode would be to specify the end date in .split method:

from datetime import datetime

...

tscv = TimeSeriesCV(
    frequency="months",
    train_size=6,
    forecast_horizon=3,
    gap=0,
)

splits = tscv.split(X, y, end_dt=datetime(2012, 1, 1))

for i, _ in enumerate(list(splits)):
    print(f"Split {i+1}")

Split 1
Split 2

@mdancho84
Copy link

Ok that's interesting. Thanks for looking into it.

All of the original examples I put together are working.

I'll play around with it and see if there's anything else.

But this looks great. Thanks so much for adding the new frequencies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for Month, Quarter and Year Frequencies
2 participants