-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/calculate fft energy blocks #1024
base: main
Are you sure you want to change the base?
Feature/calculate fft energy blocks #1024
Conversation
… calculated. Docstring not available yet, tests not available yet
…t to proper behavior
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #1024 +/- ##
==========================================
- Coverage 93.42% 93.39% -0.03%
==========================================
Files 20 20
Lines 1886 1909 +23
Branches 371 375 +4
==========================================
+ Hits 1762 1783 +21
Misses 85 85
- Partials 39 41 +2
☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for the contribution! We love seeing more feature calculators!
I have added a few comments on a high level. There is one conceptional problem I see with the current implementation - but this can be fixed (I think), so nothing which prevents the feature in general!
Thanks again for your work :)
Once you finished implementing the comments, I will have another look.
@set_property("input", "pd.Series") | ||
@set_property("index_type", pd.DatetimeIndex) | ||
def energy_content_frequency_brackets( | ||
signal: pd.Series, params: dict = {"number_of_bins": 10, "with_normalization": True} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not set the default parameter to a dictionary (https://stackoverflow.com/questions/26320899/why-is-the-empty-dictionary-a-dangerous-default-value-in-python). This is not only related to tsfresh
but for general python: this can lead to very unexpected behaviour (see the linked stackoverflow if you want to know more).
(you even disabled the "dangerous default parameter" pylint check for this ;) )
The good thing is, you do not even need a default argument. tsfresh will always fill this argument, so you can just remove the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding type information. But params
is actually a list of dictionaries.
As you chose a combiner calculator (https://tsfresh.readthedocs.io/en/latest/text/how_to_add_custom_feature.html#step-1-decide-which-type-of-feature-you-want-to-implement) you will get all parameters for this calculator as a list of dictionaries.
This will also need some changes to your code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From that point of view, I'm a bit unsure if I should leave the calculator to be a combiner or convert it to a "simple" calculator. Converting it into a "simple" calculator would allow me to simplify the code, and speed up calculation. For the time being I therefore decided to convert it, and would appreciate feedback if that decision is appropriate.
Thanks!
Args: | ||
signal (pd.Series): The distribution for which the frequency energy buckets \ | ||
should be calculated | ||
params (_type_, optional): Parameters used for calculation of the frequency \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above, the parameter is not optional (and the type information is not __type__
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks!
number_of_bins: int = 10 | ||
if "number_of_bins" in params.keys(): | ||
number_of_bins: int = ( | ||
params["number_of_bins"] | ||
if ( | ||
params["number_of_bins"] <= len(time_vec) | ||
and params["number_of_bins"] >= 0 | ||
) | ||
else len(time_vec) - 1 | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You do not need to use a default value for number_of_bins
here. You can always assume that the key will be set. The reason is that we define the settings ourself :)
This allows you to simplify the code to (basically)
number_of_bins = min(params["number_of_bins"], len(time_vec) - 1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if a user sets number_of_bins
to 0 or below this should be a bug and you can just raise an exception here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose to raise a warning to inform the user, and to continue with a default value of 1.
) | ||
else len(time_vec) - 1 | ||
) | ||
with_normalization: bool = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same - you can assume the value is set in the param config
split_freq_vec: list[np.ndarray] = np.array_split(freq_vec, number_of_bins) | ||
frequency_energy_vec: list[float] = [] | ||
frequency_range_vec: list[tuple[float, float]] = [] | ||
for _, freq_entry in enumerate(split_freq_vec): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You do not need to use enumerate
if you do not need the index (and it seems you do not need it as you use _
)
lower_frequency_index: int = np.where(freq_vec == freq_entry[0])[0][0] | ||
upper_frequency_index: int = np.where(freq_vec == freq_entry[-1])[0][0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are those two lines doing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea behind those two lines was to find the index of the frequency in freq_vec which corresponds to the frequency limits given for each entry in split_freq_vec. Anyway, as this code was updated on my side, I will re-write that part.
|
||
return_val: list[tuple[str, float]] = [ | ||
( | ||
f"bin_{entry_num}_max_{number_of_bins}_low_{int(frequency_range_vec[entry_num][0])}_high_{int(frequency_range_vec[entry_num][1])}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, we have a conceptional problem.
Each feature calculator in tsfresh is allowed to only return a single number per parameter input. This is important because our feature selection only works for numbers, not lists of numbers.
Additionally, the feature extractor names (the string you return) need to be always the same (independent on the data) and should only depend on the parameters.
How can we solve this? We can use the same idea that we e.g. used in the fft_aggregated
calculator. Just make the entry_num
and additional parameter :) With this, you can return a single result per parameter and just have a lot more parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hei,
I'm not sure how to implement that, or if I fully understood your comment.
I.e. if I use one parameter as input (f.eks. setting params = {"number_of_bins": 10})
, then I can return only a single tuple consisting of (str, float), but not multiple tuples (as I would expect, when setting number_of_bins
to 10). The solution for that would be (as stated above) to rather set params
to [{"number_of_bins": 1}] * 10
, which would increase my number of inputs to 10, and thereby allow me to return 10 tuples, instead of one. Internally, of course, I would have to replace number_of_bins
with entry_num
to define the number of bins for the calculation.
Is that interpretation correct? This would increase complexity for the user, though.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me explain a bit why we need this to give you some context.
Some users use tsfresh just for feature extraction - in which case your approach to return a dynamic number of features would be perfectly well. However, tsfresh is also used for feature selection. If the feature names and number however depends on the data - how can I use the selected features on my training dataset on my validation dataset? Therefore it is crucial that the generated features only depend on the input parameter configuration. This means, if a user sets paramA = 5, it should always give the exact same features.
The way we have implemented in other features, is similar like you described. To make a simple example, assuming our feature calculator wants to generate 10 numbers [0, 1, 2, ....]
. As this is not possible, we give our feature calculator not the number of bins, but the actual bin index. So if the feature calculator is called with the parameter `bin_index=0' it would return the 0, etc. In the default configuration, you would then have a default set of bin_indices. This might not be as many bins as needed for your data (but it can be configured by the user if they want). For example, we always return the first 100 FFT coefficients (no matter how long the data is).
Another approach would be, instead of returning the raw bin numbers, returning statistics of them. Like create one feature out of the max value of all your bins, one out of the min etc. We do something similar to the fft_aggregated
feature calculator.
…ed additional option for user-defined bins instead of automatically calculated bins. Updated test accordingly
.gitignore
Outdated
@@ -48,6 +48,7 @@ docs/text/_generated/ | |||
# Virtual environment | |||
venv* | |||
activate | |||
tsfresh-env |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove this? It seems to be very special to your own environment (and not a general name)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the name of my virtual environment (not venv), but I'll remove it.
.gitignore
Outdated
@@ -61,3 +62,4 @@ tsfresh/notebooks/data/ | |||
# dask | |||
dask-worker-space | |||
dask-worker-space/ | |||
*.pyc.* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why this is needed? Which processes produces these files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are bytecode files, generated when I run the code for the first time. As I don't want to upload them in git, I ignored them.
@anates - the combiner/simple calculator and your bin index issue makes it probably quite complex, so let me try to simplify the problem a bit: def energy_content_frequency_brackets(signal, with_normalization, bin_index): Would that be possible? I think the function would look something like
When I call this function for different If this works, we can go one step further and make the implementation efficient: instead of doing the FFT for every bin independently we do it for all bins once. This is how a "combiner" works (it is just for making the calculation more efficient). But first, lets make sure the implementation with a "simple" type works! Hope this helps :) Sorry for all the delayed-ping-pong! Note: in your current implementation, you also return the "low" and "high" in the feature name. As described above, this is not possible. However we can think of adding yet another feature calculator which just returns the high and one which just returns the low. But this can be a separate PR |
Hei,
Thanks! |
Hi! To your second point: |
Hei, This would also reduce the amount of code duplication, else I would have to re-use the same code in two different extractors. Concerning the second point: Assumed I understood your point correctly, the current implementation of returning Thanks! |
Very close to what you described is possible (and preferred as you said!). It would be ok to write a function which gets an additional parameter, but if it is a def energy_content_frequency_brackets(signal, with_normalization, bin_index, return_mode) -> float:
if return_mode == "value":
return fft_sum # single float!
elif return_mode == "high":
return 42 # replace with actual number
elif return_mode == "low":
return 47 # replace with actual number Feel free to use different names, this was just an example!
Maybe I missed your point, but the current solution as it is implemented right now in the code does not work as expected for tsfresh (again: it is not wrong, it just does not work with the way tsfresh expects feature calculators to work). Let me rephrase what I tried to explain in #1024 (comment). def energy_content_frequency_brackets(signal, with_normalization, bin_index) -> float: (plus the return mode from the question above maybe) Let me try to give another (maybe simpler) example: assuming you write a simple feature calculator which always returns the three largest number of a time series. You can not write the feature calculator like this def my_calculator(data) -> List[float]:
data = list(reversed(sorted(data)))
return [data[0], data[1], data[2]] and also not like this def my_calculator(data) -> List[Tuple[str, float]]:
data = list(reversed(sorted(data)))
return [("largest_1", data[0]), ...] but you need to write it like this def my_calculator(data, which_largest) -> float:
data = list(reversed(sorted(data)))
return data[which_largest] |
Work on this feature is on hold for the next weeks, but I'll take it up again after my break, if that's ok? |
…nput parameter, for initial test/proof of concept
Hei, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, great! The changes you did are very good - now it is in a shape to fit with the rest of the feature extractors.
Thanks for taking the time and sorry again for all the delays. I really appreciate your contribution.
warnings.warn(warnings_message, RuntimeWarning) | ||
return np.NaN | ||
|
||
number_of_bins: int = 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You also have this as an input parameter. Maybe better use that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hei,
yes, I'll implement it, but will work on it step-by-step, to ensure that every step still fits within the framework. Thanks!
freq_vec: np.ndarray = np.fft.rfftfreq(n=len(signal), d=(time_vec[1] - time_vec[0])) | ||
fft_vec: np.ndarray = np.abs(np.fft.rfft(signal.to_numpy())) | ||
|
||
if (bin_ranges[0] < 0 and bin_ranges[1]) < 0 or (bin_ranges[0] >= bin_ranges[1]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you want the bracket after the < 0?
if (bin_ranges[0] < 0 and bin_ranges[1]) < 0 or (bin_ranges[0] >= bin_ranges[1]): | |
if (bin_ranges[0] < 0 and bin_ranges[1] < 0) or (bin_ranges[0] >= bin_ranges[1]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's a typo, thanks for noticing it!
This feature allows the computation of the frequency energy content of the signal, distributed over n frequency bins. This simplifies the detection of dominant frequencies for different process stages.
This pull request is meant as initial step, to receive feedback (both wrt. code, implementation and intention of the feature) and to improve the code, before it can be merged.