Feature/calculate fft energy blocks #1024

anates · 2023-05-30T11:44:10Z

This feature allows the computation of the frequency energy content of the signal, distributed over n frequency bins. This simplifies the detection of dominant frequencies for different process stages.

This pull request is meant as initial step, to receive feedback (both wrt. code, implementation and intention of the feature) and to improve the code, before it can be merged.

… calculated. Docstring not available yet, tests not available yet

…t to proper behavior

codecov-commenter · 2023-06-01T21:29:00Z

Codecov Report

Patch coverage: 87.50% and project coverage change: -0.03 ⚠️

Comparison is base (e91e2a3) 93.42% compared to head (55696eb) 93.39%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1024      +/-   ##
==========================================
- Coverage   93.42%   93.39%   -0.03%     
==========================================
  Files          20       20              
  Lines        1886     1909      +23     
  Branches      371      375       +4     
==========================================
+ Hits         1762     1783      +21     
  Misses         85       85              
- Partials       39       41       +2

Impacted Files	Coverage Δ
tsfresh/feature_extraction/settings.py	`100.00% <ø> (ø)`
tsfresh/feature_extraction/feature_calculators.py	`92.43% <87.50%> (-0.04%)`	⬇️

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

nils-braun

Thank you very much for the contribution! We love seeing more feature calculators!
I have added a few comments on a high level. There is one conceptional problem I see with the current implementation - but this can be fixed (I think), so nothing which prevents the feature in general!

Thanks again for your work :)
Once you finished implementing the comments, I will have another look.

nils-braun · 2023-06-08T19:02:42Z

tsfresh/feature_extraction/feature_calculators.py

+@set_property("input", "pd.Series")
+@set_property("index_type", pd.DatetimeIndex)
+def energy_content_frequency_brackets(
+    signal: pd.Series, params: dict = {"number_of_bins": 10, "with_normalization": True}


Please do not set the default parameter to a dictionary (https://stackoverflow.com/questions/26320899/why-is-the-empty-dictionary-a-dangerous-default-value-in-python). This is not only related to tsfresh but for general python: this can lead to very unexpected behaviour (see the linked stackoverflow if you want to know more).
(you even disabled the "dangerous default parameter" pylint check for this ;) )

The good thing is, you do not even need a default argument. tsfresh will always fill this argument, so you can just remove the default.

Thanks for adding type information. But params is actually a list of dictionaries.
As you chose a combiner calculator (https://tsfresh.readthedocs.io/en/latest/text/how_to_add_custom_feature.html#step-1-decide-which-type-of-feature-you-want-to-implement) you will get all parameters for this calculator as a list of dictionaries.

This will also need some changes to your code.

From that point of view, I'm a bit unsure if I should leave the calculator to be a combiner or convert it to a "simple" calculator. Converting it into a "simple" calculator would allow me to simplify the code, and speed up calculation. For the time being I therefore decided to convert it, and would appreciate feedback if that decision is appropriate.
Thanks!

nils-braun · 2023-06-08T19:10:00Z

tsfresh/feature_extraction/feature_calculators.py

+    Args:
+        signal (pd.Series): The distribution for which the frequency energy buckets \
+            should be calculated
+        params (_type_, optional): Parameters used for calculation of the frequency \


See above, the parameter is not optional (and the type information is not __type__)

Fixed, thanks!

nils-braun · 2023-06-08T19:18:25Z

tsfresh/feature_extraction/feature_calculators.py

+    number_of_bins: int = 10
+    if "number_of_bins" in params.keys():
+        number_of_bins: int = (
+            params["number_of_bins"]
+            if (
+                params["number_of_bins"] <= len(time_vec)
+                and params["number_of_bins"] >= 0
+            )
+            else len(time_vec) - 1
+        )


You do not need to use a default value for number_of_bins here. You can always assume that the key will be set. The reason is that we define the settings ourself :)
This allows you to simplify the code to (basically)

number_of_bins = min(params["number_of_bins"], len(time_vec) - 1)

I think if a user sets number_of_bins to 0 or below this should be a bug and you can just raise an exception here.

I chose to raise a warning to inform the user, and to continue with a default value of 1.

nils-braun · 2023-06-08T19:19:10Z

tsfresh/feature_extraction/feature_calculators.py

+            )
+            else len(time_vec) - 1
+        )
+    with_normalization: bool = True


Same - you can assume the value is set in the param config

tsfresh/feature_extraction/feature_calculators.py

nils-braun · 2023-06-08T19:28:42Z

tsfresh/feature_extraction/feature_calculators.py

+    split_freq_vec: list[np.ndarray] = np.array_split(freq_vec, number_of_bins)
+    frequency_energy_vec: list[float] = []
+    frequency_range_vec: list[tuple[float, float]] = []
+    for _, freq_entry in enumerate(split_freq_vec):


You do not need to use enumerate if you do not need the index (and it seems you do not need it as you use _)

nils-braun · 2023-06-08T19:31:45Z

tsfresh/feature_extraction/feature_calculators.py

+        lower_frequency_index: int = np.where(freq_vec == freq_entry[0])[0][0]
+        upper_frequency_index: int = np.where(freq_vec == freq_entry[-1])[0][0]


What are those two lines doing?

The idea behind those two lines was to find the index of the frequency in freq_vec which corresponds to the frequency limits given for each entry in split_freq_vec. Anyway, as this code was updated on my side, I will re-write that part.

tsfresh/feature_extraction/feature_calculators.py

nils-braun · 2023-06-08T19:50:46Z

tsfresh/feature_extraction/feature_calculators.py

+
+    return_val: list[tuple[str, float]] = [
+        (
+            f"bin_{entry_num}_max_{number_of_bins}_low_{int(frequency_range_vec[entry_num][0])}_high_{int(frequency_range_vec[entry_num][1])}",


Here, we have a conceptional problem.
Each feature calculator in tsfresh is allowed to only return a single number per parameter input. This is important because our feature selection only works for numbers, not lists of numbers.
Additionally, the feature extractor names (the string you return) need to be always the same (independent on the data) and should only depend on the parameters.

How can we solve this? We can use the same idea that we e.g. used in the fft_aggregated calculator. Just make the entry_num and additional parameter :) With this, you can return a single result per parameter and just have a lot more parameters.

Hei,
I'm not sure how to implement that, or if I fully understood your comment.

I.e. if I use one parameter as input (f.eks. setting params = {"number_of_bins": 10}), then I can return only a single tuple consisting of (str, float), but not multiple tuples (as I would expect, when setting number_of_bins to 10). The solution for that would be (as stated above) to rather set params to [{"number_of_bins": 1}] * 10, which would increase my number of inputs to 10, and thereby allow me to return 10 tuples, instead of one. Internally, of course, I would have to replace number_of_bins with entry_num to define the number of bins for the calculation.

Is that interpretation correct? This would increase complexity for the user, though.

Thanks!

Let me explain a bit why we need this to give you some context.
Some users use tsfresh just for feature extraction - in which case your approach to return a dynamic number of features would be perfectly well. However, tsfresh is also used for feature selection. If the feature names and number however depends on the data - how can I use the selected features on my training dataset on my validation dataset? Therefore it is crucial that the generated features only depend on the input parameter configuration. This means, if a user sets paramA = 5, it should always give the exact same features.

The way we have implemented in other features, is similar like you described. To make a simple example, assuming our feature calculator wants to generate 10 numbers [0, 1, 2, ....]. As this is not possible, we give our feature calculator not the number of bins, but the actual bin index. So if the feature calculator is called with the parameter `bin_index=0' it would return the 0, etc. In the default configuration, you would then have a default set of bin_indices. This might not be as many bins as needed for your data (but it can be configured by the user if they want). For example, we always return the first 100 FFT coefficients (no matter how long the data is).

Another approach would be, instead of returning the raw bin numbers, returning statistics of them. Like create one feature out of the max value of all your bins, one out of the min etc. We do something similar to the fft_aggregated feature calculator.

…ed additional option for user-defined bins instead of automatically calculated bins. Updated test accordingly

nils-braun · 2023-06-14T19:57:37Z

.gitignore

@@ -48,6 +48,7 @@ docs/text/_generated/
 # Virtual environment
 venv*
 activate
+tsfresh-env


Can you remove this? It seems to be very special to your own environment (and not a general name)

That's the name of my virtual environment (not venv), but I'll remove it.

nils-braun · 2023-06-14T19:58:05Z

.gitignore

@@ -61,3 +62,4 @@ tsfresh/notebooks/data/
 # dask
 dask-worker-space
 dask-worker-space/
+*.pyc.*


I wonder why this is needed? Which processes produces these files?

Those are bytecode files, generated when I run the code for the first time. As I don't want to upload them in git, I ignored them.

tsfresh/feature_extraction/feature_calculators.py

nils-braun · 2023-06-14T20:24:35Z

@anates - the combiner/simple calculator and your bin index issue makes it probably quite complex, so let me try to simplify the problem a bit:
as a start, can you implement a function energy_content_frequency_brackets which gets all parmeters as keyword arguments and just produces a single return value (float)? Your function would look like

def energy_content_frequency_brackets(signal, with_normalization, bin_index):

Would that be possible? I think the function would look something like


def energy_content_frequency_brackets(signal, with_normalization, bin_index):
    freq_vec = np.fft.rfftfreq(n=len(signal), d=(time_vec[1] - time_vec[0]))
    fft_vec = np.abs(np.fft.rfft(signal.to_numpy()))
    split_freq_vec = np.array_split(freq_vec, number_of_bins)

    freq_entry = split_freq_vec[bin_index]

    lower_frequency_index = find_nearest(array=freq_vec, value=freq_entry[0])[0]
    upper_frequency_index: int = find_nearest(array=freq_vec, value=freq_entry[-1])[0]
    fft_sum: float = np.sum(fft_vec[lower_frequency_index:upper_frequency_index])
    if with_normalization:
        fft_sum /= len(fft_vec)

    return fft_sum

When I call this function for different bin_index (0, ...) I should basically get the list of values you originally had.

If this works, we can go one step further and make the implementation efficient: instead of doing the FFT for every bin independently we do it for all bins once. This is how a "combiner" works (it is just for making the calculation more efficient). But first, lets make sure the implementation with a "simple" type works!

Hope this helps :) Sorry for all the delayed-ping-pong!

Note: in your current implementation, you also return the "low" and "high" in the feature name. As described above, this is not possible. However we can think of adding yet another feature calculator which just returns the high and one which just returns the low. But this can be a separate PR

…ary files

… in this extractor

anates · 2023-06-15T06:56:36Z

Hei,
I addressed the other points, but have some comments regarding the main implementation (dvs. your last comment):

The "low" and "high"-part of my return string intend to inform the user where the boundaries of that bin are. At least for my understanding those values are quite important, to understand my signal better. Are there other ways of returning those boundaries to the user, or is that on their side to identify them (outside the feature-extractor)?
I agree with you that having a variable feature name might not be the best approach, but the number of bins will always be the same, independently of the data length, as it is based on a user-defined input. This corresponds to (for example) the coefficient-approach which always returns the first 100 coefficients, independently of the data itself. Would that still be ok?

Thanks!

nils-braun · 2023-06-20T20:23:48Z

Hi!
To your first point:
Would it work to add them as a second feature calculator? One that returns the values and one that returns the high (and one the low)?

To your second point:
Yes indeed, that is the only option in the current way tsfresh is implemented. Except returning a statistics over the values as described above. Not saying this is the best way, but it is just the way tsfresh is currently working...

anates · 2023-06-22T08:02:29Z

Hei,
thanks for the feedback! Based on that, would it be a solution to add an additional input parameter to this extraction function, to either return the bin value (default) for bin n, labelled with _bin_n, or return the high-/low-values for bin n? For typical operation (i.e. feature selection) it would not make a difference, but if the user is interested in the exact boundaries, they can still call this function separately with the same inputs as before, and obtain the bin limits.

This would also reduce the amount of code duplication, else I would have to re-use the same code in two different extractors.

Concerning the second point: Assumed I understood your point correctly, the current implementation of returning n bins is already working as intended? For it being a combiner I then would just have to re-write handling of the input params, to allow it being a list of dicts.

Thanks!

nils-braun · 2023-06-25T08:44:28Z

Based on that, would it be a solution to add an additional input parameter to this extraction function
Good idea!

Very close to what you described is possible (and preferred as you said!). It would be ok to write a function which gets an additional parameter, but if it is a simple function it should only return a single number (the feature name will be set automatically). Something like

def energy_content_frequency_brackets(signal, with_normalization, bin_index, return_mode) -> float:
   if return_mode == "value":
      return fft_sum # single float!
   elif return_mode == "high":
      return 42 # replace with actual number
   elif return_mode == "low":
      return 47 # replace with actual number

Feel free to use different names, this was just an example!

Assumed I understood your point correctly, the current implementation of returning n bins is already working as intended?

Maybe I missed your point, but the current solution as it is implemented right now in the code does not work as expected for tsfresh (again: it is not wrong, it just does not work with the way tsfresh expects feature calculators to work). Let me rephrase what I tried to explain in #1024 (comment).
If you implement a simple calculator, which I would strongly encourage as the first implementation (as a combiner is harder to implement and only is for speeding up the implementation which can be the second step), your calculator is only allowed to return a single number. Not a list and not a feature name - but really just a single number.
In the example I gave in the thread above, I only return a single of the bins, while the bin_index is an additional parameter.

def energy_content_frequency_brackets(signal, with_normalization, bin_index) -> float:

(plus the return mode from the question above maybe)

Let me try to give another (maybe simpler) example: assuming you write a simple feature calculator which always returns the three largest number of a time series. You can not write the feature calculator like this

def my_calculator(data) -> List[float]:
   data = list(reversed(sorted(data)))
   return [data[0], data[1], data[2]]

and also not like this

def my_calculator(data) -> List[Tuple[str, float]]:
   data = list(reversed(sorted(data)))
   return [("largest_1", data[0]), ...]

but you need to write it like this

def my_calculator(data, which_largest) -> float:
   data = list(reversed(sorted(data)))
   return data[which_largest]

anates · 2023-07-07T12:31:05Z

Work on this feature is on hold for the next weeks, but I'll take it up again after my break, if that's ok?
Thanks!

…nput parameter, for initial test/proof of concept

anates · 2023-09-05T14:40:29Z

Hei,
I reverted the feature calculator to being a "simple" calculator as an initial proof of concept. I have not adjusted the testing-functions yet, but will do that as soon as the implementation gets closer to being final.
However, is the current implementation closer to being useful for tsfresh?
Thanks!

nils-braun

Yes, great! The changes you did are very good - now it is in a shape to fit with the rest of the feature extractors.
Thanks for taking the time and sorry again for all the delays. I really appreciate your contribution.

nils-braun · 2023-09-27T19:50:44Z

tsfresh/feature_extraction/feature_calculators.py

+        warnings.warn(warnings_message, RuntimeWarning)
+        return np.NaN
+
+    number_of_bins: int = 10


You also have this as an input parameter. Maybe better use that!

Hei,
yes, I'll implement it, but will work on it step-by-step, to ensure that every step still fits within the framework. Thanks!

nils-braun · 2023-09-27T19:51:54Z

tsfresh/feature_extraction/feature_calculators.py

+    freq_vec: np.ndarray = np.fft.rfftfreq(n=len(signal), d=(time_vec[1] - time_vec[0]))
+    fft_vec: np.ndarray = np.abs(np.fft.rfft(signal.to_numpy()))
+
+    if (bin_ranges[0] < 0 and bin_ranges[1]) < 0 or (bin_ranges[0] >= bin_ranges[1]):


I guess you want the bracket after the < 0?

Suggested change

if (bin_ranges[0] < 0 and bin_ranges[1]) < 0 or (bin_ranges[0] >= bin_ranges[1]):

if (bin_ranges[0] < 0 and bin_ranges[1] < 0) or (bin_ranges[0] >= bin_ranges[1]):

Yes, that's a typo, thanks for noticing it!

anates added 7 commits May 30, 2023 11:19

Removed virtual environment from git

1ba3018

Added new function energy_content_frequency_brackets as feature to be…

a896955

… calculated. Docstring not available yet, tests not available yet

Ignored pyc-files resulting from a test run

90b58d4

Added default settings for energy_content_frequency_brackets

2db2af7

Added initial test for testing energy_content_frequency_brackets

9515760

Added docstring to energy_content_frequency_brackets

8884742

Removed if __name__ == __main__ from earlier testing stages, to rever…

55696eb

…t to proper behavior

nils-braun reviewed Jun 8, 2023

View reviewed changes

anates and others added 2 commits June 9, 2023 14:28

Updated extraction function, to reflect feedback in pull request. Add…

dcdc1d2

…ed additional option for user-defined bins instead of automatically calculated bins. Updated test accordingly

Merge branch 'main' into feature/calculate_fft_energy_blocks

2d0b1fb

nils-braun reviewed Jun 14, 2023

View reviewed changes

anates and others added 3 commits June 15, 2023 08:42

Removed entry for local python environment, moved pyc-entry to tempor…

78268c6

…ary files

Returns NaN-values/warning if the user wants to use uneven time steps…

462bf20

… in this extractor

Merge branch 'main' into feature/calculate_fft_energy_blocks

f844daa

Reverted calculator to 'simple', to extract only a single value per i…

2a78135

…nput parameter, for initial test/proof of concept

Merge branch 'main' into feature/calculate_fft_energy_blocks

be3d251

nils-braun reviewed Sep 27, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/calculate fft energy blocks #1024

Feature/calculate fft energy blocks #1024

anates commented May 30, 2023

codecov-commenter commented Jun 1, 2023

nils-braun left a comment

nils-braun Jun 8, 2023

nils-braun Jun 8, 2023

anates Jun 9, 2023

nils-braun Jun 8, 2023

anates Jun 9, 2023

nils-braun Jun 8, 2023

nils-braun Jun 8, 2023

anates Jun 9, 2023

nils-braun Jun 8, 2023

nils-braun Jun 8, 2023

nils-braun Jun 8, 2023

anates Jun 9, 2023

nils-braun Jun 8, 2023

anates Jun 9, 2023

nils-braun Jun 14, 2023

nils-braun Jun 14, 2023

anates Jun 15, 2023

nils-braun Jun 14, 2023

anates Jun 15, 2023

nils-braun commented Jun 14, 2023

anates commented Jun 15, 2023

nils-braun commented Jun 20, 2023

anates commented Jun 22, 2023

nils-braun commented Jun 25, 2023 •

edited

Loading

anates commented Jul 7, 2023

anates commented Sep 5, 2023

nils-braun left a comment

nils-braun Sep 27, 2023

anates Oct 4, 2023

nils-braun Sep 27, 2023

anates Oct 4, 2023

		lower_frequency_index: int = np.where(freq_vec == freq_entry[0])[0][0]
		upper_frequency_index: int = np.where(freq_vec == freq_entry[-1])[0][0]

	if (bin_ranges[0] < 0 and bin_ranges[1]) < 0 or (bin_ranges[0] >= bin_ranges[1]):
	if (bin_ranges[0] < 0 and bin_ranges[1] < 0) or (bin_ranges[0] >= bin_ranges[1]):

Feature/calculate fft energy blocks #1024

Are you sure you want to change the base?

Feature/calculate fft energy blocks #1024

Conversation

anates commented May 30, 2023

codecov-commenter commented Jun 1, 2023

Codecov Report

nils-braun left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nils-braun commented Jun 14, 2023

anates commented Jun 15, 2023

nils-braun commented Jun 20, 2023

anates commented Jun 22, 2023

nils-braun commented Jun 25, 2023 • edited Loading

anates commented Jul 7, 2023

anates commented Sep 5, 2023

nils-braun left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nils-braun commented Jun 25, 2023 •

edited

Loading