Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whitening fix - compute covariance matrix in float #3505

Merged
merged 7 commits into from
Nov 4, 2024

Conversation

JoeZiminski
Copy link
Collaborator

@JoeZiminski JoeZiminski commented Oct 25, 2024

In whiten.py there is a datatype argument to set the datatype of the output recording. The casting is performed after whitening has been performed and does not affect the datatype used for computing the whitening matrix. Up until 0.100.8, the covariance matrix was always computed in float32, but in 0.101.0 the line random_data = np.astype(random_data, np.float32) that was under this line was removed. Now if the recording is int16 the covariance matrix estimation is performed in int16 and will overflow in nearly all cases.

This PR reinstates that line, for now casting to float64. The recording dtype will be cast back to the user-specified dtype or the input recording dtype, so I don't think there is any downside in using maximum precision. The only thing is, maybe we will want to move to GPU support in future, and using float32 will keep it consistent.

Unless I'm missing something, I think it would be worth rolling this out in an update ASAP and making a note of this on the release note header. My understanding is the majority of recording datatypes are stored int16 and so this regression may catch out quite a lot of users since 0.101.0 and is quite hard to track down.

TODOs

  • extend tests

@zm711
Copy link
Collaborator

zm711 commented Oct 25, 2024

Out of curiosity since kilosort 1-3 (and even 4 prefers) int16 how do they deal with their whitening? I have looked.

@JoeZiminski
Copy link
Collaborator Author

JoeZiminski commented Oct 25, 2024

I was curious about this also, they:

  • have int16 raw data
  • scale to float32 for CAR, filtering, whitening
  • scale up by 200 and convert back to int16.

kilosort4

In kilosort4, the processing is done in float32, as mentioned here. The covariance matrix is computed in float32 (X is in float32 and torch.zeros initialises in float32 by default). You can see the rescaling by 200 and int16 conversion for example here.

Curiously, I can see in the data loading class here that if the data is uint16 it is cast to float. But in the class __init__ it states the default datatype expected is int16. I can't find ATM where this cast from int16 to float32 is performed (it is definitely performed for uint16). I need to look into this further, from the other comments and code sections the data is definitely expected to be in float32 for processing.

EDIT: Ah it is done here, the main purpose of the unit16 block is the centering around 0.

kilosort<4 (based on 2.5)

First, the whitening matrix is computed in float32 (from data processed by gpufilter.py which performs CAR and filtering in float32). Then float32 conversion + filtering and CAR is performed in the main preprocessing loop, as well as whitening. Then cast back to int16 is performed here, with the 200 scaling incorporated into the whitening matrix.

@zm711
Copy link
Collaborator

zm711 commented Oct 25, 2024

Cool I guess they cast too. did you find the commit+PR where it was changed was there a comment about the dropping? Some of the preprocessing dtype stuff is a mystery to me so maybe someone had a reason?

@JoeZiminski
Copy link
Collaborator Author

Thanks @zm711 good point I was just looking at the version diff, I think it was here.

@yger
Copy link
Collaborator

yger commented Oct 28, 2024

Then indeed, this is a big mistake and random_data should always be broadcasted to float32 before estimating the whitening matrix

@alejoe91 alejoe91 added the preprocessing Related to preprocessing module label Oct 29, 2024
@alejoe91
Copy link
Member

@JoeZiminski are you planning to extend tests here or is it good to merge?

@zm711
Copy link
Collaborator

zm711 commented Oct 31, 2024

Could we add the regression test in an issue to add? We use MS5 which relies on the whitening machinery. Would love to have this on main for everyone.

@JoeZiminski
Copy link
Collaborator Author

Hey @alejoe91 @zm711 hey both, sorry got waylaid with something but will add some tests today! agree this will be good to merge ASAP

@zm711
Copy link
Collaborator

zm711 commented Oct 31, 2024

Sounds good to me. Thanks Joe!

@JoeZiminski
Copy link
Collaborator Author

JoeZiminski commented Oct 31, 2024

Hey @alejoe91 @zm711 in the end adding test tests was taking longer than expected, I'll make a new PR for this. Feel free to merge! At the next meeting we can discuss where to include a warning (e.g. maybe 0.101.0 release notes, or the release notes for a new version).

In the end I reverted to np.float32 to a) match what was there before b) match other implementation (kilosort) c) match GPU if this is implemented. The overflowing computation is on int16 ephys data and is the dot product between two channel timeseries of length determined by the random chunk, by default it is 10k samples. In some impossible example, if somehow the raw data is all around + 32767, demean is not selected, and the user specifies 1M samples instead of 10k, the value will be 1.07 * 1e+15, float32 max value is 3.4 * 1e+39, so I think we'll be fine!

@alejoe91 alejoe91 merged commit d78a0da into SpikeInterface:main Nov 4, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
preprocessing Related to preprocessing module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants