Test Polars compatibility and performance #368

toni-neurosc · 2024-05-03T07:28:22Z

toni-neurosc
May 3, 2024
Collaborator

So Polars is a replacement for Pandas written in Rust (https://pola.rs/) which can be 10-100x faster than Pandas depending on the operations.
However, it's still not fully compatible with certain things, for example, I have read that it can have problems working directly with scikit-learn.

PyNM is using Pandas dataframes to store analysis results, so I think at some we should at least give Polars a go and see if it would fit the project.

Demo of Plotly Dash with Polars https://www.youtube.com/watch?v=_iebrqafOuM

timonmerk · 2024-05-03T15:42:00Z

timonmerk
May 3, 2024
Maintainer

Thanks @toni-neurosc for mentioning that! Nice video also with impressive speed improvements over pandas. I guess our main aim would be time to store data in an existing data frame / array (either using append/concat after feature computation) and then IO by saving the data frame / array.
Thinking about it, the dataframe columns also stay throughout computation the same. Therefore we could think about saving the features to disk in real-time only the numpy array with np.save? It might not be a super elegant solution, but after the recording is finished those could still be merged into a single csv / parquet dataframe. I guess it's also less overhead than a database write.

0 replies

toni-neurosc · 2024-05-05T11:29:48Z

toni-neurosc
May 5, 2024
Collaborator Author

Hi @timonmerk, I opened a discussion about this in #322. I did not consider numpy's .npy format but it's actually not that crazy, since pretty much anyone who wants to use PyNM is going to be doing the data processing in Python for sure.

In fact, I already had thought about the problem of the intermediate representation of the feature calculation results, which are currently written in a dictionary, then moved into a Pandas dataframe. I think the dictionary representation might be a bit troublesome, and my idea was to basically flatten the nested structure that can arise in some of the feature calcualtions (e.g. different frequency bands for each channel) and hold the order of each of the features in a separate string array, then return a tuple[list[str], np.ndarray] for each of the features.

If we were to do that, maybe we would be able to ditch dataframes altogether. Maybe we need to use them for the GUI for visualization, but in order to send data around parts of the program, I think we could stay within numpy all the time if we wanted.

Then storing to .npy would be quite fast. We just need to save a file with the header separate from the main data array. Plus, it supports compression with .npz for sparse data and it's quite fast according to this benchmark:

0 replies

timonmerk · 2024-06-04T12:12:30Z

timonmerk
Jun 4, 2024
Maintainer

I played with polars a bit for a different project now, and it's quite amazing! The core problem however, that we currently accumulate all computed features in RAM still needs to be adressed. After my previous calculation I will try to implement sqlite and save features after every iteration. This option was the fastest and should not create too much overhead.

Also the computation should not affect the other examples, since pandas or polars provide methods to load from a database. This all comes at a cost not having a human readable csv file.. But we could also save a snippet / head of the features simply for debugging purposes.

0 replies

toni-neurosc · 2024-06-04T12:20:41Z

toni-neurosc
Jun 4, 2024
Collaborator Author

Coincidentally earlier this morning, when I erroneously thought I had fixed the RTD, I preemptively opened a new local branch called "no_pandas" where I wanted to eventually:

Replace all pandas instances with polars
See if I could get rid of the features_dict data structure and return features as ndarrays. It's uncertain it's going to be faster (there is a change that having a memory-sparse dictionary is actually faster when it comes to building the final dataframe than a contiguous numpy array, but at the very minimum I would return separate dictionaries instead of passing a single one around).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test Polars compatibility and performance #368

{{title}}

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Test Polars compatibility and performance #368

toni-neurosc May 3, 2024 Collaborator

Replies: 4 comments

timonmerk May 3, 2024 Maintainer

toni-neurosc May 5, 2024 Collaborator Author

timonmerk Jun 4, 2024 Maintainer

toni-neurosc Jun 4, 2024 Collaborator Author

toni-neurosc
May 3, 2024
Collaborator

timonmerk
May 3, 2024
Maintainer

toni-neurosc
May 5, 2024
Collaborator Author

timonmerk
Jun 4, 2024
Maintainer

toni-neurosc
Jun 4, 2024
Collaborator Author