-
-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: Feature extraction on time series batch #67
Comments
Hi @mbignotti, Glad to hear that you like our package! 😄 Thank you for identifying this bug & providing a clear explanation with reproducible code! I guess this bug relates to another bug I identified a couple of weeks ago in #62
This bug is a base case of the bugfix in that PR (i.e., only 1 possible segment). I'll look into the code and try to fix the bug + add some tests for this base case. Cheers, Jeroen |
Hi @jvdd, |
Hi @mbignotti No problem, never hesitate pinging us! Sharing feature requests / issues is a crucial part in open-source development! I'll discuss both your comments later today with @jonasvdd & @emield12 (& update you with our opinion / steps forward) Cheers, Jeroen |
Hi @mbignotti, can you confirm that everything works as expected in the latest release of tsflex (v0.3)? :) |
Hi @jvdd, |
Hi @mbignotti, You are right, it is now somewhat confusing to apply this functionality. This afternoon, I had a fruitful discussion with @jvdd, and we decided this:
fc = FeatureCollection(
FeatureDescriptor(
function = np.mean,
series_name="Value",
window=len(df), # this depends on the size of your dataframe
stride=1
)
)
As such, we have decided to add the # NOTE: how the window and stride parameters are optional.
fc = FeatureCollection(
FeatureDescriptor(
function = np.mean,
series_name="Value",
)
)
# uses the whole (unsegmented) series of `data` to calculate
# the features upon
fc.calculate_unsgemented(data=df, return_df=True) |
Using two different methods could be a little bit confusing, in my opinion. After all, as you mentioned in a previous comment, computing the feature on the entire time series is a special case of the more general one where you specify a window.
# NOTE: window and stride parameters are omitted.
fc = FeatureCollection(
FeatureDescriptor(
function = np.mean,
series_name="Value",
)
)
# Uses the whole (unsegmented) series of `data` to calculate the features. The method remains the same.
fc.calculate(data=df, return_df=True)
fc = FeatureCollection(
FeatureDescriptor(
function = np.mean,
series_name="Value",
window=-1 # Signals that we want to compute on the entire batch. Stride cannot be passed or is ignored in this case.
)
)
# Uses the whole (unsegmented) series of `data` to calculate the features. The method remains the same.
fc.calculate(data=df, return_df=True) The problem of having two different methods is that, in a real application (not just a notebook), you tipically have many possible configurations, and you usually want to keep the complexity at a minimum level. However, I do not know the internals of tsflex. Hence I cannot really say which option is the best one and / or how much difficult it is to implement it. I can only judge the API from a user point of view, which is of course limited. In any case, big thanks for your work and effort! I wish I could give more concrete contributions, but unfortunately I don't have enough time :) |
Hi @mbignotti, Thank you for putting so much effort into giving your end-user API perspective, really appreciated! 🤗 The main reason @jvdd and I wanted to introduce a new method is to make things more explicit (and move some special cases away from the already lengthy I am rather intrigued by this sentence, could you elaborate more on this (maybe provide a use-case), so I can understand it better
Regarding your proposed alternatives; I rather like them, and they seem rather intuitive / user-friendly. So, I will give them some thought later on! As for now, I will create a new branch on which I expose the current, non-final, |
Hi @mbignotti! 👋🏼 Have you by any chance found the time to look at the above issue (and mentioned PR)? Would love to hear your opinion about this before we take future concrete implementation steps! 😃 Kind regards, |
Hi @jonasvdd, FeatureDescriptor:
- function: "np.mean" # map somehow the string to the actual function
series_name: "Value"
window: null # or -1
- function: "np.std"
series_name: "Value"
window: null # or -1 Then, the source code will look something like this: with open("config.yaml", "r") as f:
config = yaml.safe_load(f)
fc = FeatureCollection(
FeatureDescriptor(
**settings
)
for settings in config["FeatureDescriptor"]
)
fc.calculate(data=df, return_df=True) Having two different if config["FeatureDescriptor"]["window"] is None: # or config["FeatureDescriptor"]["window"] == -1
fc.calculate_unsegmented(data=df, return_df=True)
else:
fc.calculate(data=df, return_df=True) Maybe, in this case, it's not a big problem, but in my opinion it's cleaner to define everything about how to perform the calculation in the |
Related issue #63 |
Hello,
First of all, I would like to thank you for the really nice library. I think it is much more straight forward and at the same time flexible, compared to similar libraries.
I have a use case where sometimes I need to compute features in a rolling fashion, for which the
window
parameter of theFeatureDescriptor
object is really helpful, and some other times I need to compute features on time series batches. That is, the window parameter equals the length of the entire time series.However, I'm having a few issues with the latter case.
Here is an example:
If I run the code above, I get the following error (personal info are hidden):
If I specify
window=len(df) - 1
,it works but then, of course, it is not using the last data point in the calculation.Am I doing something wrong? Is there a way to achieve the required behaviour?
Thanks a lot!
Environment:
python==3.8.13
numpy==1.22.4
pandas==1.4.2
tsflex==0.2.3.7.7
The text was updated successfully, but these errors were encountered: