-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
breaking: raise error for gaps in series #504
Merged
Changes from 8 commits
Commits
Show all changes
19 commits
Select commit
Hold shift + click to select a range
c89a770
enh: raise error for gaps in series
jmoralez 212b890
merge main
jmoralez 0d0a7f8
refactor
jmoralez 3fec31e
refactor2
jmoralez dfde561
replace peyton dataset
jmoralez 2b85923
replace ptlr dataset
jmoralez c566e6f
more tols
jmoralez 080e7fa
bump utils
jmoralez ccac70e
Merge branch 'main' into gaps-err
elephaint e1b5802
Merge branch 'main' into gaps-err
jmoralez a783aa0
remove irregular timestamps tutorial and update capabilities
jmoralez 7dd1622
change references from tutorial to capabilities
jmoralez 313249d
pin nbdev
jmoralez b610a1a
show diff
jmoralez d151fbd
manually run export
jmoralez 0303061
Merge branch 'main' into gaps-err
jmoralez 5041a76
increase cell
jmoralez acc47b1
[skip ci] move test to last cell
jmoralez 2426a34
[skip ci] update output
jmoralez File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
27 changes: 6 additions & 21 deletions
27
nbs/docs/capabilities/anomaly-detection/01_quickstart.ipynb
Large diffs are not rendered by default.
Oops, something went wrong.
14 changes: 7 additions & 7 deletions
14
nbs/docs/capabilities/anomaly-detection/03_anomaly_detection_date_features.ipynb
Large diffs are not rendered by default.
Oops, something went wrong.
23 changes: 6 additions & 17 deletions
23
nbs/docs/capabilities/anomaly-detection/04_confidence_levels.ipynb
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Large diffs are not rendered by default.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -70,6 +70,7 @@ | |
")\n", | ||
"from utilsforecast.compat import DFType, DataFrame, pl_DataFrame\n", | ||
"from utilsforecast.feature_engineering import _add_time_features, time_features\n", | ||
"from utilsforecast.preprocessing import id_time_grid\n", | ||
"from utilsforecast.validation import ensure_time_dtype, validate_format\n", | ||
"if TYPE_CHECKING:\n", | ||
" try:\n", | ||
|
@@ -871,6 +872,7 @@ | |
" target_col: str,\n", | ||
" model: str,\n", | ||
" validate_api_key: bool,\n", | ||
" freq: Optional[str],\n", | ||
" ) -> Tuple[DFType, Optional[DFType], bool]:\n", | ||
" if validate_api_key and not self.validate_api_key(log=False):\n", | ||
" raise Exception('API Key not valid, please email [email protected]')\n", | ||
|
@@ -896,7 +898,25 @@ | |
" validate_format(df=df, id_col=id_col, time_col=time_col, target_col=target_col)\n", | ||
" if ufp.is_nan_or_none(df[target_col]).any():\n", | ||
" raise ValueError(f'Target column ({target_col}) cannot contain missing values.')\n", | ||
" return df, X_df, drop_id\n", | ||
" freq = _maybe_infer_freq(df, freq=freq, id_col=id_col, time_col=time_col)\n", | ||
" expected_ids_times = id_time_grid(\n", | ||
" df,\n", | ||
" freq=freq,\n", | ||
" start=\"per_serie\",\n", | ||
" end=\"per_serie\",\n", | ||
" id_col=id_col,\n", | ||
" time_col=time_col,\n", | ||
" )\n", | ||
" if len(df) != len(expected_ids_times):\n", | ||
" raise ValueError(\n", | ||
" \"Series contain missing or duplicate timestamps, or the timestamps \"\n", | ||
" \"do not match the provided frequency.\\n\"\n", | ||
" \"Please make sure that all series have a single observation from the first \"\n", | ||
" \"to the last timestamp and that the provided frequency matches the timestamps'.\\n\"\n", | ||
" \"You can refer to https://docs.nixtla.io/docs/tutorials-missing_values \"\n", | ||
" \"for an end to end example.\"\n", | ||
" )\n", | ||
" return df, X_df, drop_id, freq\n", | ||
"\n", | ||
" def validate_api_key(self, log: bool = True) -> bool:\n", | ||
" \"\"\"Returns True if your api_key is valid.\"\"\"\n", | ||
|
@@ -1045,20 +1065,20 @@ | |
" self.__dict__.pop('feature_contributions', None)\n", | ||
" model = self._maybe_override_model(model)\n", | ||
" logger.info('Validating inputs...')\n", | ||
" df, X_df, drop_id = self._run_validations(\n", | ||
" df, X_df, drop_id, freq = self._run_validations(\n", | ||
" df=df,\n", | ||
" X_df=X_df,\n", | ||
" id_col=id_col,\n", | ||
" time_col=time_col,\n", | ||
" target_col=target_col,\n", | ||
" validate_api_key=validate_api_key,\n", | ||
" model=model,\n", | ||
" freq=freq,\n", | ||
" )\n", | ||
" df, X_df = _validate_exog(\n", | ||
" df, X_df, id_col=id_col, time_col=time_col, target_col=target_col\n", | ||
" )\n", | ||
" level, quantiles = _prepare_level_and_quantiles(level, quantiles)\n", | ||
" freq = _maybe_infer_freq(df, freq=freq, id_col=id_col, time_col=time_col)\n", | ||
" standard_freq = _standardize_freq(freq)\n", | ||
" model_input_size, model_horizon = self._get_model_params(model, standard_freq)\n", | ||
" if finetune_steps > 0 or level is not None or add_history:\n", | ||
|
@@ -1279,16 +1299,16 @@ | |
" self.__dict__.pop('weights_x', None)\n", | ||
" model = self._maybe_override_model(model)\n", | ||
" logger.info('Validating inputs...')\n", | ||
" df, _, drop_id = self._run_validations(\n", | ||
" df, _, drop_id, freq = self._run_validations(\n", | ||
" df=df,\n", | ||
" X_df=None,\n", | ||
" id_col=id_col,\n", | ||
" time_col=time_col,\n", | ||
" target_col=target_col,\n", | ||
" validate_api_key=validate_api_key,\n", | ||
" model=model,\n", | ||
" freq=freq,\n", | ||
" )\n", | ||
" freq = _maybe_infer_freq(df, freq=freq, id_col=id_col, time_col=time_col)\n", | ||
" standard_freq = _standardize_freq(freq)\n", | ||
" model_input_size, model_horizon = self._get_model_params(model, standard_freq)\n", | ||
"\n", | ||
|
@@ -1468,16 +1488,16 @@ | |
" )\n", | ||
" model = self._maybe_override_model(model)\n", | ||
" logger.info('Validating inputs...')\n", | ||
" df, _, drop_id = self._run_validations(\n", | ||
" df, _, drop_id, freq = self._run_validations(\n", | ||
" df=df,\n", | ||
" X_df=None,\n", | ||
" id_col=id_col,\n", | ||
" time_col=time_col,\n", | ||
" target_col=target_col,\n", | ||
" validate_api_key=validate_api_key,\n", | ||
" model=model,\n", | ||
" freq=freq,\n", | ||
" )\n", | ||
" freq = _maybe_infer_freq(df, freq=freq, id_col=id_col, time_col=time_col)\n", | ||
" standard_freq = _standardize_freq(freq)\n", | ||
" level, quantiles = _prepare_level_and_quantiles(level, quantiles)\n", | ||
" model_input_size, model_horizon = self._get_model_params(model, standard_freq)\n", | ||
|
@@ -1734,6 +1754,34 @@ | |
"nixtla_client.validate_api_key()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"#| hide\n", | ||
"# missing times\n", | ||
"series = generate_series(2, min_length=100, freq='5min')\n", | ||
"with_gaps = series.sample(frac=0.5, random_state=0)\n", | ||
"expected_msg = 'missing or duplicate timestamps, or the timestamps do not match'\n", | ||
"# gaps\n", | ||
"test_fail(\n", | ||
" lambda: nixtla_client.forecast(df=with_gaps, h=1, freq='5min'),\n", | ||
" contains=expected_msg,\n", | ||
")\n", | ||
"# duplicates\n", | ||
"test_fail(\n", | ||
" lambda: nixtla_client.forecast(df=pd.concat([series, series]), h=1, freq='5min'),\n", | ||
" contains=expected_msg,\n", | ||
")\n", | ||
"# wrong freq\n", | ||
"test_fail(\n", | ||
" lambda: nixtla_client.forecast(df=series, h=1, freq='1min'),\n", | ||
" contains=expected_msg,\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
|
@@ -2394,8 +2442,8 @@ | |
"anom_inferred_df_index = nixtla_client.detect_anomalies(df_ds_index)\n", | ||
"fcst_inferred_df = nixtla_client.forecast(df_[['ds', 'unique_id', 'y']], h=10)\n", | ||
"anom_inferred_df = nixtla_client.detect_anomalies(df_[['ds', 'unique_id', 'y']])\n", | ||
"pd.testing.assert_frame_equal(fcst_inferred_df_index, fcst_inferred_df, atol=1e-3)\n", | ||
"pd.testing.assert_frame_equal(anom_inferred_df_index, anom_inferred_df, atol=1e-3)\n", | ||
"pd.testing.assert_frame_equal(fcst_inferred_df_index, fcst_inferred_df, atol=1e-4, rtol=1e-3)\n", | ||
"pd.testing.assert_frame_equal(anom_inferred_df_index, anom_inferred_df, atol=1e-4, rtol=1e-3)\n", | ||
"df_ds_index = df_ds_index.groupby('unique_id').tail(80)\n", | ||
"for freq in ['Y', 'W-MON', 'Q-DEC', 'H']:\n", | ||
" df_ds_index.index = np.concatenate(\n", | ||
|
@@ -2405,7 +2453,7 @@ | |
" fcst_inferred_df_index = nixtla_client.forecast(df_ds_index, h=10)\n", | ||
" df_test = df_ds_index.reset_index()\n", | ||
" fcst_inferred_df = nixtla_client.forecast(df_test, h=10)\n", | ||
" pd.testing.assert_frame_equal(fcst_inferred_df_index, fcst_inferred_df, atol=1e-3)" | ||
" pd.testing.assert_frame_equal(fcst_inferred_df_index, fcst_inferred_df, atol=1e-4, rtol=1e-3)" | ||
] | ||
}, | ||
{ | ||
|
@@ -2547,7 +2595,9 @@ | |
"\n", | ||
"pd.testing.assert_frame_equal(\n", | ||
" timegpt_anomalies_df_1,\n", | ||
" timegpt_anomalies_df_2 \n", | ||
" timegpt_anomalies_df_2,\n", | ||
" atol=1e-4,\n", | ||
" rtol=1e-3,\n", | ||
")" | ||
] | ||
}, | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -40,6 +40,7 @@ | |
) | ||
from utilsforecast.compat import DFType, DataFrame, pl_DataFrame | ||
from utilsforecast.feature_engineering import _add_time_features, time_features | ||
from utilsforecast.preprocessing import id_time_grid | ||
from utilsforecast.validation import ensure_time_dtype, validate_format | ||
|
||
if TYPE_CHECKING: | ||
|
@@ -800,6 +801,7 @@ def _run_validations( | |
target_col: str, | ||
model: str, | ||
validate_api_key: bool, | ||
freq: Optional[str], | ||
) -> Tuple[DFType, Optional[DFType], bool]: | ||
if validate_api_key and not self.validate_api_key(log=False): | ||
raise Exception("API Key not valid, please email [email protected]") | ||
|
@@ -827,7 +829,25 @@ def _run_validations( | |
raise ValueError( | ||
f"Target column ({target_col}) cannot contain missing values." | ||
) | ||
return df, X_df, drop_id | ||
freq = _maybe_infer_freq(df, freq=freq, id_col=id_col, time_col=time_col) | ||
elephaint marked this conversation as resolved.
Show resolved
Hide resolved
|
||
expected_ids_times = id_time_grid( | ||
df, | ||
freq=freq, | ||
start="per_serie", | ||
end="per_serie", | ||
id_col=id_col, | ||
time_col=time_col, | ||
) | ||
if len(df) != len(expected_ids_times): | ||
raise ValueError( | ||
"Series contain missing or duplicate timestamps, or the timestamps " | ||
"do not match the provided frequency.\n" | ||
"Please make sure that all series have a single observation from the first " | ||
"to the last timestamp and that the provided frequency matches the timestamps'.\n" | ||
"You can refer to https://docs.nixtla.io/docs/tutorials-missing_values " | ||
"for an end to end example." | ||
) | ||
return df, X_df, drop_id, freq | ||
|
||
def validate_api_key(self, log: bool = True) -> bool: | ||
"""Returns True if your api_key is valid.""" | ||
|
@@ -975,20 +995,20 @@ def forecast( | |
self.__dict__.pop("feature_contributions", None) | ||
model = self._maybe_override_model(model) | ||
logger.info("Validating inputs...") | ||
df, X_df, drop_id = self._run_validations( | ||
df, X_df, drop_id, freq = self._run_validations( | ||
df=df, | ||
X_df=X_df, | ||
id_col=id_col, | ||
time_col=time_col, | ||
target_col=target_col, | ||
validate_api_key=validate_api_key, | ||
model=model, | ||
freq=freq, | ||
) | ||
df, X_df = _validate_exog( | ||
df, X_df, id_col=id_col, time_col=time_col, target_col=target_col | ||
) | ||
level, quantiles = _prepare_level_and_quantiles(level, quantiles) | ||
freq = _maybe_infer_freq(df, freq=freq, id_col=id_col, time_col=time_col) | ||
standard_freq = _standardize_freq(freq) | ||
model_input_size, model_horizon = self._get_model_params(model, standard_freq) | ||
if finetune_steps > 0 or level is not None or add_history: | ||
|
@@ -1215,16 +1235,16 @@ def detect_anomalies( | |
self.__dict__.pop("weights_x", None) | ||
model = self._maybe_override_model(model) | ||
logger.info("Validating inputs...") | ||
df, _, drop_id = self._run_validations( | ||
df, _, drop_id, freq = self._run_validations( | ||
df=df, | ||
X_df=None, | ||
id_col=id_col, | ||
time_col=time_col, | ||
target_col=target_col, | ||
validate_api_key=validate_api_key, | ||
model=model, | ||
freq=freq, | ||
) | ||
freq = _maybe_infer_freq(df, freq=freq, id_col=id_col, time_col=time_col) | ||
standard_freq = _standardize_freq(freq) | ||
model_input_size, model_horizon = self._get_model_params(model, standard_freq) | ||
|
||
|
@@ -1406,16 +1426,16 @@ def cross_validation( | |
) | ||
model = self._maybe_override_model(model) | ||
logger.info("Validating inputs...") | ||
df, _, drop_id = self._run_validations( | ||
df, _, drop_id, freq = self._run_validations( | ||
df=df, | ||
X_df=None, | ||
id_col=id_col, | ||
time_col=time_col, | ||
target_col=target_col, | ||
validate_api_key=validate_api_key, | ||
model=model, | ||
freq=freq, | ||
) | ||
freq = _maybe_infer_freq(df, freq=freq, id_col=id_col, time_col=time_col) | ||
standard_freq = _standardize_freq(freq) | ||
level, quantiles = _prepare_level_and_quantiles(level, quantiles) | ||
model_input_size, model_horizon = self._get_model_params(model, standard_freq) | ||
|
@@ -1628,7 +1648,7 @@ def plot( | |
ax=ax, | ||
) | ||
|
||
# %% ../nbs/src/nixtla_client.ipynb 50 | ||
# %% ../nbs/src/nixtla_client.ipynb 51 | ||
def _forecast_wrapper( | ||
df: pd.DataFrame, | ||
client: NixtlaClient, | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should add a callout to this capabilities notebook at the bottom stating that TimeGPT doesn't allow gaps in the timestamps? E.g.
"Make sure there are no gaps in your time series data. This means that even if the chosen frequency is irregular, you should still make sure you provide a value for every irregular timestamp in the data. For example, if your frequency is "B" (business day), there can't be a gap (missing datapoint) between two consecutive business days."
Edit: perhaps add a similar comment also to the beginning or end of the tutorial notebook
12_irregular.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have that in the data requirements notebook
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but repetition is the key to education? 😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that may help, but I'd prefer to add a link to that section instead, otherwise we'll have to remember to change that in every place we set it and will most likely miss some.