Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hdf5 backend support forNwbRecordingExtractor #2298

Merged

Conversation

h-mayorquin
Copy link
Collaborator

@h-mayorquin h-mayorquin commented Dec 6, 2023

Depends on #2294. This makes the NWB extraction faster as it avoids the nwb validation.

@magland If you have some time to sparse it would be great to get your pair of eyes.

@h-mayorquin h-mayorquin self-assigned this Dec 6, 2023
@h-mayorquin h-mayorquin marked this pull request as ready for review December 6, 2023 11:27
@magland
Copy link
Member

magland commented Dec 6, 2023

@h-mayorquin Overall this looks very good.

I am a bit concerned about the electrical_series_name / electrical_series_location. I understand you are trying to maximize the chance that it will find the correct electrical series object, and provide helpful output when it is not found. But I think it's best to be explicit about the procedure that is used to determine this. Perhaps you could simplify wherever possible? And more importantly spell out the procedure in the docs for the class?

@zm711 zm711 added the extractors Related to extractors module label Dec 6, 2023
@h-mayorquin
Copy link
Collaborator Author

h-mayorquin commented Dec 6, 2023

@magland Thanks for taking a look.

I added further explanation in the docstring concerning electrical_series_location. I will iterate to improve even more.

I am a bit concerned about the electrical_series_name / electrical_series_location. I understand you are trying to maximize the chance that it will find the correct electrical series object, and provide helpful output when it is not found. But I think it's best to be explicit about the procedure that is used to determine this.

Yes, so to describe it here. The intention is for It to iterate over all the groups in the hdf5 file and find the one which name matches with electrical_series_name. As you said, it is a convenience function whose main goal is that users can keep accessing their data by specifying the electrical_series_name as they are doing with the current extractor in the main branch. Do you think it would be better to restrict users to specify the location when they are using hdf5-fast-mode instead? That certaintly would be easier for us to mantain. @alejoe91 what do you think?

@h-mayorquin
Copy link
Collaborator Author

I did some changes in the last commit as changing all the behavior is too drastic. I instead moved to a factory pattern where the old behavior is called when a flag "fast_mode" is set to False. That is, if fast_mode=True the class _NWBHDF5RecordingExtractor is used to build the recording using the h5py, otherwise, the class _NwbPureRecordingExtractor (the original recording) is used instead and in turn uses pynwb to build the recording.

The advantages of the factory pattern are:

  • Keeps the old behavior as a fallback.
  • Keeps the code organized without having to use if all over the place within the same __init__ constructor.
  • Paves the way for a third option in the future to add support to reading zarr nwb files in fast mode.
  • Will be easier to profile between modes.
  • Will be easier to detect inconsistencies by changing mode.

The disavantages are that is a tad more complex and it introduces more code to mantain. There is some code-duplication now between the _NWBHDF5RecordingExtractor and _NwbPureRecordingExtractor but I think that's fine for the moment. I will reduce this in a Future PR to keep things on this one from blowing up more than it already is.

I added tests for the two modes now. Also I feel unsatisfied with the name fast_mode for the flag, I welcome any suggestion for a better flag name.

@h-mayorquin
Copy link
Collaborator Author

OK, I changed fast_mode to io_backend which is literal that can be "pynwb" or "hdf5". I think this is more precise and will be useful when we start getting zarr files. It has the downside that is less clear about the purpose of using the hdf5 backend so I have indicated this in the docstring.

@h-mayorquin h-mayorquin changed the title Faster NWBExtractor NwbRecordingExtractor Add hdf5 backend support forNwbRecordingExtractor Dec 8, 2023
@h-mayorquin
Copy link
Collaborator Author

h-mayorquin commented Dec 13, 2023

remfile and hdf5 with electrical series name: 1 minute 15 seconds
remfile and hdf5 with electrical series location: 10.9 seconds
remfile and pynwb with electrical series name: 1 minute 23 seconds

fsspec just hangs for all the cases with this files. 50 minutes and still was not done.

image

@h-mayorquin
Copy link
Collaborator Author

After a discussion with @alejoe91 we decided to change the name to "use_pynwb" and the default should be False. When we add support for Zarr we should identify on the fly if the file is Zarr and use either h5py or the zarr library to extract the data and modify the documentation when use_pynwb=False. Meanwhile, the reading of zarr files will require use_pynwb=True.

Copy link
Member

@alejoe91 alejoe91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thansk for this amazing work @h-mayorquin

I have a couple of very minor final requests!

src/spikeinterface/extractors/nwbextractors.py Outdated Show resolved Hide resolved
src/spikeinterface/extractors/nwbextractors.py Outdated Show resolved Hide resolved
src/spikeinterface/extractors/nwbextractors.py Outdated Show resolved Hide resolved
sampling_frequency = electrical_series["starting_time"].attrs["rate"]
elif "timestamps" in electrical_series.keys():
t_start = electrical_series["timestamps"][0]
sampling_frequency = 1 / np.median(np.diff(electrical_series["timestamps"][:1000]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
sampling_frequency = 1 / np.median(np.diff(electrical_series["timestamps"][:1000]))
sampling_frequency = 1 / np.median(np.diff(electrical_series["timestamps"][:samples_for_rate_estimation]))

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's change this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think that the samples_for_rate_estimation default is too large. I think tha 1000 samples should be OK.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. Can you change to 1000?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be done.

Comment on lines +213 to +227
@pytest.mark.parametrize("electrical_series_name", ["ElectricalSeries1", "ElectricalSeries2"])
def test_that_hdf5_and_pynwb_extractors_return_the_same_data(path_to_nwbfile, electrical_series_name):
recording_extractor_hdf5 = NwbRecordingExtractor(
path_to_nwbfile,
electrical_series_name=electrical_series_name,
use_pynwb=False,
)

recording_extractor_pynwb = NwbRecordingExtractor(
path_to_nwbfile,
electrical_series_name=electrical_series_name,
use_pynwb=True,
)

check_recordings_equal(recording_extractor_hdf5, recording_extractor_pynwb)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!!!

@h-mayorquin h-mayorquin merged commit 0a0570b into SpikeInterface:main Dec 14, 2023
10 of 11 checks passed
@h-mayorquin h-mayorquin deleted the fast_mode_hdf5_recording branch December 14, 2023 11:54
@magland
Copy link
Member

magland commented Dec 15, 2023

Thanks @h-mayorquin and @alejoe91 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
extractors Related to extractors module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants