-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support remfile and file-like in nwb rec extractor #2169
Conversation
Thanks @magland You can add streaming tests here: https://github.com/SpikeInterface/spikeinterface/blob/main/src/spikeinterface/extractors/tests/test_nwb_s3_extractor.py |
… into nwb-remfile
for more information, see https://pre-commit.ci
Thanks @alejoe91 I added some tests. Let's see if they pass! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to add file to the kwargs for pickling to work.
Concerning this:
Also allows passing in a file-like object. This is important for dendro for embargoed datasets because we need to pass in a file-like object that is capable of renewing its remote url periodically as the presigned download url expires.
I am curios, why can't you get away by just re-initializing the object when the path changes.
# Condition / loop here to see if the path did not change
try:
file_path = (
"https://dandi-api-staging-dandisets.s3.amazonaws.com/blobs/5f4/b7a/5f4b7a1f-7b95-4ad8-9579-4df6025371cc"
)
file = remfile.File(file_path)
# File path should correct here
rec = NwbRecordingExtractor(file_path=file_path)
In other words, why just modifying the url (the file_path) is not enough?
Thanks @h-mayorquin
It's because the recording object needs to be passed into other functions which may take a long time to execute. For example
where we don't have control over some_long_analysis. |
Merci beaucoup! |
if stream_mode == "fsspec": | ||
# only add stream_cache_path to kwargs if it was passed as an argument | ||
if stream_cache_path is not None: | ||
stream_cache_path = str(Path(self.stream_cache_path).absolute()) | ||
|
||
self.extra_requirements.extend(["pandas", "pynwb", "hdmf"]) | ||
self._electrical_series = electrical_series | ||
|
||
# set serializability bools | ||
# TODO: correct spelling of self._serializablility throughout SI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooops
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
solved here #2238
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just added a couple docstring changes/clarifications. Feel free to use if you want.
Co-authored-by: Zach McKenzie <[email protected]>
Co-authored-by: Zach McKenzie <[email protected]>
Co-authored-by: Zach McKenzie <[email protected]>
Thanks, I accepted those commits. I wonder if we want to squash before merge. Maybe you don't care about large number of commits, IDK. |
Doesn't matter to me, whatever @alejoe91 and @samuelgarcia want when they merge. |
@magland I also will like to highlight your comment about our lack of division between keyword and keyword only arguments. I think it will be more flexible in the future and avoid sub-optimal ordering as you have pointed out. |
Yes it's a potentially big problem if users do not use named kw args in their scripts, and then you want to add additional parameters for a function, which may break their code, unless you add them at the end; but it's not always desirable to do that. I recommend that |
This came up in #1800 when parameter order was swapped. Just to link an example of when this caused a problem. And in support of using keyword arguments. |
@magland is this ready to merge? |
I can actually give it a try this afternoon as I need to do some reading from dandi. |
Hi, I tried to test this yesterday but got a couple of bugs. I will fix the others as they are related to spikeinterface but there are some files I can't load with remfile: For asset path: Of dandiset id: The ulr of the blob: This works: import fsspec
import h5py
from fsspec.implementations.cached import CachingFileSystem
from pathlib import Path
# URL of the file
file_url = 'https://dandiarchive.s3.amazonaws.com/blobs/413/cf0/413cf0f3-3498-485a-b099-84bc36d43ca6'
fs = fsspec.filesystem("http", )
fsspec_file = fs.open(file_url, "rb")
# Use h5py to open the cached file
file = h5py.File(fsspec_file, 'r')
nwbfile = NWBHDF5IO(file=file, mode='r', load_namespaces=True).read()
nwbfile But this does not :
The output is: {
"name": "OSError",
"message": "Unable to open file (file signature not found)",
"stack": "---------------------------------------------------------------------------
OSError Traceback (most recent call last)
/home/heberto/development/spikeinterface/bin/dev_notebook.ipynb Cell 35 line 1
<a href='vscode-notebook-cell:/home/heberto/development/spikeinterface/bin/dev_notebook.ipynb#X25sZmlsZQ%3D%3D?line=13'>14</a> cached_file = caching_file_system.open(path=file_url, mode=\"rb\")
<a href='vscode-notebook-cell:/home/heberto/development/spikeinterface/bin/dev_notebook.ipynb#X25sZmlsZQ%3D%3D?line=14'>15</a> # Use h5py to open the cached file
---> <a href='vscode-notebook-cell:/home/heberto/development/spikeinterface/bin/dev_notebook.ipynb#X25sZmlsZQ%3D%3D?line=16'>17</a> file = h5py.File(cached_file, 'r')
<a href='vscode-notebook-cell:/home/heberto/development/spikeinterface/bin/dev_notebook.ipynb#X25sZmlsZQ%3D%3D?line=18'>19</a> nwbfile = NWBHDF5IO(file=file, mode='r', load_namespaces=True).read()
<a href='vscode-notebook-cell:/home/heberto/development/spikeinterface/bin/dev_notebook.ipynb#X25sZmlsZQ%3D%3D?line=19'>20</a> nwbfile
File ~/miniconda3/envs/neuroconv_env/lib/python3.10/site-packages/h5py/_hl/files.py:562, in File.__init__(self, name, mode, driver, libver, userblock_size, swmr, rdcc_nslots, rdcc_nbytes, rdcc_w0, track_order, fs_strategy, fs_persist, fs_threshold, fs_page_size, page_buf_size, min_meta_keep, min_raw_keep, locking, alignment_threshold, alignment_interval, meta_block_size, **kwds)
553 fapl = make_fapl(driver, libver, rdcc_nslots, rdcc_nbytes, rdcc_w0,
554 locking, page_buf_size, min_meta_keep, min_raw_keep,
555 alignment_threshold=alignment_threshold,
556 alignment_interval=alignment_interval,
557 meta_block_size=meta_block_size,
558 **kwds)
559 fcpl = make_fcpl(track_order=track_order, fs_strategy=fs_strategy,
560 fs_persist=fs_persist, fs_threshold=fs_threshold,
561 fs_page_size=fs_page_size)
--> 562 fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
564 if isinstance(libver, tuple):
565 self._libver = libver
File ~/miniconda3/envs/neuroconv_env/lib/python3.10/site-packages/h5py/_hl/files.py:235, in make_fid(name, mode, userblock_size, fapl, fcpl, swmr)
233 if swmr and swmr_support:
234 flags |= h5f.ACC_SWMR_READ
--> 235 fid = h5f.open(name, flags, fapl=fapl)
236 elif mode == 'r+':
237 fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl)
File h5py/_objects.pyx:54, in h5py._objects.with_phil.wrapper()
File h5py/_objects.pyx:55, in h5py._objects.with_phil.wrapper()
File h5py/h5f.pyx:106, in h5py.h5f.open()
OSError: Unable to open file (file signature not found)"
} Note that remfile shares this problem fsspec when the caching file is used so that might offer a clue. That is, the following produces a very similar error: import fsspec
import h5py
from fsspec.implementations.cached import CachingFileSystem
from pynwb import NWBHDF5IO
file_url = 'https://dandiarchive.s3.amazonaws.com/blobs/413/cf0/413cf0f3-3498-485a-b099-84bc36d43ca6'
caching_file_system = CachingFileSystem(
fs=fsspec.filesystem("http", ),
cache_storage=str(Path.cwd()),
)
cached_file = caching_file_system.open(path=file_url, mode="rb")
# Use h5py to open the cached file
file = h5py.File(cached_file, 'r')
nwbfile = NWBHDF5IO(file=file, mode='r', load_namespaces=True).read()
nwbfile I am using h5py 3.10.0 and the latest version of request. |
I tried replacing nwbfile = NWBHDF5IO(file, 'r', load_namespaces=True).read() in your script with nwbfile = NWBHDF5IO(file=file, mode='r', load_namespaces=True).read() and it seemed to work. Then I did the following timing test import remfile
import time
import fsspec
import h5py
from pynwb import NWBHDF5IO
# URL of the file
file_url = 'https://dandiarchive.s3.amazonaws.com/blobs/413/cf0/413cf0f3-3498-485a-b099-84bc36d43ca6'
for mode in ['fsspec', 'remfile']:
if mode == 'fsspec':
print('fspec mode.......')
timer = time.time()
fs = fsspec.filesystem("http", )
fsspec_file = fs.open(file_url, "rb")
# Use h5py to open the cached file
file = h5py.File(fsspec_file, 'r')
print(file.keys())
nwbfile = NWBHDF5IO(file=file, mode='r', load_namespaces=True).read()
print(nwbfile)
elapsed_sec = time.time() - timer
print(f'Elapsed time for fsspec mode: {elapsed_sec:.2f} sec')
elif mode == 'remfile':
print('remfile mode.......')
timer = time.time()
rfile = remfile.File(file_url, verbose=True)
file = h5py.File(rfile, 'r')
print(file.keys())
nwbfile = NWBHDF5IO(file=file, mode='r', load_namespaces=True).read()
print(nwbfile)
elapsed_sec = time.time() - timer
print(f'Elapsed time for remfile mode: {elapsed_sec:.2f} sec') And I got fspec mode.......
<KeysViewHDF5 ['acquisition', 'analysis', 'file_create_date', 'general', 'identifier', 'intervals', 'processing', 'session_description', 'session_start_time', 'specifications', 'stimulus', 'timestamps_reference_time', 'units']>
/home/magland/miniconda3/envs/dev/lib/python3.8/site-packages/hdmf/spec/namespace.py:531: UserWarning: Ignoring cached namespace 'hdmf-common' version 1.5.1 because version 1.8.0 is already loaded.
warn("Ignoring cached namespace '%s' version %s because version %s is already loaded."
/home/magland/miniconda3/envs/dev/lib/python3.8/site-packages/hdmf/spec/namespace.py:531: UserWarning: Ignoring cached namespace 'core' version 2.5.0 because version 2.6.0-alpha is already loaded.
warn("Ignoring cached namespace '%s' version %s because version %s is already loaded."
/home/magland/miniconda3/envs/dev/lib/python3.8/site-packages/hdmf/spec/namespace.py:531: UserWarning: Ignoring cached namespace 'hdmf-experimental' version 0.2.0 because version 0.5.0 is already loaded.
warn("Ignoring cached namespace '%s' version %s because version %s is already loaded."
root pynwb.file.NWBFile at 0x140179901711552
Fields:
acquisition: {
ElectricalSeriesAp <class 'pynwb.ecephys.ElectricalSeries'>,
ElectricalSeriesLf <class 'pynwb.ecephys.ElectricalSeries'>,
OriginalVideoBodyCamera <class 'pynwb.image.ImageSeries'>,
OriginalVideoLeftCamera <class 'pynwb.image.ImageSeries'>,
OriginalVideoRightCamera <class 'pynwb.image.ImageSeries'>
}
devices: {
NeuropixelsProbe <class 'pynwb.device.Device'>
}
electrode_groups: {
NeuropixelsShank <class 'pynwb.ecephys.ElectrodeGroup'>
}
electrodes: electrodes <class 'hdmf.common.table.DynamicTable'>
experiment_description: IBL aims to understand the neural basis of decision-making in the mouse by gathering a whole-brain activity map composed of electrophysiological recordings pooled from multiple laboratories. We have systematically recorded from nearly all major brain areas with Neuropixels probes, using a grid system for unbiased sampling and replicating each recording site in at least two laboratories. These data have been used to construct a brain-wide map of activity at single-spike cellular resolution during a decision-making task. In addition to the map, this data set contains other information gathered during the task: sensory stimuli presented to the mouse; mouse decisions and response times; and mouse pose information from video recordings and DeepLabCut analysis.
file_create_date: [datetime.datetime(2023, 2, 17, 3, 15, 23, 896756, tzinfo=tzutc())]
identifier: c33e2740-5475-463e-bd16-d1c38da37463
institution: University College London
intervals: {
trials <class 'pynwb.epoch.TimeIntervals'>
}
lab: Hausser
processing: {
behavior <class 'pynwb.base.ProcessingModule'>
}
protocol: _iblrig_tasks_ephysChoiceWorld6.6.1
related_publications: ['https://doi.org/10.6084/m9.figshare.21400815.v6, https://doi.org/10.1101/2020.01.17.909838']
session_description: The full description of the session/task protocol can be found in Appendix 2 of International Brain Laboratory, et al. "Standardized and reproducible measurement of decision-making in mice." Elife 10 (2021); e63711.
session_id: 1d4a7bd6-296a-48b9-b20e-bd0ac80750a5
session_start_time: 2022-07-21 16:08:53.428769+01:00
subject: subject abc.IblSubject at 0x140179901711936
Fields:
date_of_birth: 2021-05-15 00:00:00+01:00
description: Mice were housed under a 12/12 h light/dark cycle (normal or inverted depending on the laboratory) with food and water 112 available ad libitum, except during behavioural training days. Electrophysiological recordings and behavioural training were performed during either the dark or light phase of the subject cycle depending on the laboratory. Subjects were obtained from either the Jackson Laboratory or Charles River.
expected_water_ml: 1.1400000000000001
last_water_restriction: 2021-06-01T17:42:13
remaining_water_ml: 1.1400000000000001
sex: M
species: Mus musculus
strain: C57BL/6
subject_id: PL015
url: https://openalyx.internationalbrainlab.org/subjects/PL015
weight: 0.030100000000000002 kg
timestamps_reference_time: 2022-07-21 16:08:53.428769+01:00
trials: trials <class 'pynwb.epoch.TimeIntervals'>
units: units <class 'pynwb.misc.Units'>
Elapsed time for fsspec mode: 393.38 sec
remfile mode.......
Loading 1 chunks starting at 0 (0.1024 million bytes)
Loading 1 chunks starting at 455542 (0.1024 million bytes)
Loading 1 chunks starting at 455838 (0.1024 million bytes)
<KeysViewHDF5 ['acquisition', 'analysis', 'file_create_date', 'general', 'identifier', 'intervals', 'processing', 'session_description', 'session_start_time', 'specifications', 'stimulus', 'timestamps_reference_time', 'units']>
Loading 2 chunks starting at 455839 (0.2048 million bytes)
Loading 2 chunks starting at 455643 (0.2048 million bytes)
...
root pynwb.file.NWBFile at 0x140179901580528
Fields:
acquisition: {
ElectricalSeriesAp <class 'pynwb.ecephys.ElectricalSeries'>,
ElectricalSeriesLf <class 'pynwb.ecephys.ElectricalSeries'>,
OriginalVideoBodyCamera <class 'pynwb.image.ImageSeries'>,
OriginalVideoLeftCamera <class 'pynwb.image.ImageSeries'>,
OriginalVideoRightCamera <class 'pynwb.image.ImageSeries'>
}
devices: {
NeuropixelsProbe <class 'pynwb.device.Device'>
}
electrode_groups: {
NeuropixelsShank <class 'pynwb.ecephys.ElectrodeGroup'>
}
electrodes: electrodes <class 'hdmf.common.table.DynamicTable'>
experiment_description: IBL aims to understand the neural basis of decision-making in the mouse by gathering a whole-brain activity map composed of electrophysiological recordings pooled from multiple laboratories. We have systematically recorded from nearly all major brain areas with Neuropixels probes, using a grid system for unbiased sampling and replicating each recording site in at least two laboratories. These data have been used to construct a brain-wide map of activity at single-spike cellular resolution during a decision-making task. In addition to the map, this data set contains other information gathered during the task: sensory stimuli presented to the mouse; mouse decisions and response times; and mouse pose information from video recordings and DeepLabCut analysis.
file_create_date: [datetime.datetime(2023, 2, 17, 3, 15, 23, 896756, tzinfo=tzutc())]
identifier: c33e2740-5475-463e-bd16-d1c38da37463
institution: University College London
intervals: {
trials <class 'pynwb.epoch.TimeIntervals'>
}
lab: Hausser
processing: {
behavior <class 'pynwb.base.ProcessingModule'>
}
protocol: _iblrig_tasks_ephysChoiceWorld6.6.1
related_publications: ['https://doi.org/10.6084/m9.figshare.21400815.v6, https://doi.org/10.1101/2020.01.17.909838']
session_description: The full description of the session/task protocol can be found in Appendix 2 of International Brain Laboratory, et al. "Standardized and reproducible measurement of decision-making in mice." Elife 10 (2021); e63711.
session_id: 1d4a7bd6-296a-48b9-b20e-bd0ac80750a5
session_start_time: 2022-07-21 16:08:53.428769+01:00
subject: subject abc.IblSubject at 0x140179904705152
Fields:
date_of_birth: 2021-05-15 00:00:00+01:00
description: Mice were housed under a 12/12 h light/dark cycle (normal or inverted depending on the laboratory) with food and water 112 available ad libitum, except during behavioural training days. Electrophysiological recordings and behavioural training were performed during either the dark or light phase of the subject cycle depending on the laboratory. Subjects were obtained from either the Jackson Laboratory or Charles River.
expected_water_ml: 1.1400000000000001
last_water_restriction: 2021-06-01T17:42:13
remaining_water_ml: 1.1400000000000001
sex: M
species: Mus musculus
strain: C57BL/6
subject_id: PL015
url: https://openalyx.internationalbrainlab.org/subjects/PL015
weight: 0.030100000000000002 kg
timestamps_reference_time: 2022-07-21 16:08:53.428769+01:00
trials: trials <class 'pynwb.epoch.TimeIntervals'>
units: units <class 'pynwb.misc.Units'>
Elapsed time for remfile mode: 44.25 sec So the difference in timing is 393 sec for fsspec vs 44 sec for remfile! I'll also note that the pynwb overhead is significant. You can begin to read the h5py file after just a second or two. That's why the nwb recording extractors I am using do not use pynwb at all, but go directly for the electrical series objects in the h5py file. I think in a separate PR we should explore using that method for the SI nwb recording extractor. |
@magland import fsspec
import h5py
from fsspec.implementations.cached import CachingFileSystem
from pynwb import NWBHDF5IO
file_url = 'https://dandiarchive.s3.amazonaws.com/blobs/413/cf0/413cf0f3-3498-485a-b099-84bc36d43ca6'
caching_file_system = CachingFileSystem(
fs=fsspec.filesystem("http", ),
cache_storage=str(Path.cwd()),
)
cached_file = caching_file_system.open(path=file_url, mode="rb")
# Use h5py to open the cached file
file = h5py.File(cached_file, 'r') Can you point to the extractors that you are using? |
https://github.com/scratchrealm/pc-spike-sorting/blob/main/common/NwbRecording.py It doesn't have all the features of the SI version, but it has the ones I need for spike sorting. |
Thanks a bunch, I will check them out. |
Thanks a lot Jeremy and Ramon for figthing with this NWB streaming. |
Wow thanks @h-mayorquin ! The network activity plot is pretty convincing. I just created an issue on pynwb to update their docs to suggest remfile as the preferred method. |
Adds "remfile" stream mode option for NwbRecordingExtractor.
Also allows passing in a file-like object. This is important for dendro for embargoed datasets because we need to pass in a file-like object that is capable of renewing its remote url periodically as the presigned download url expires.
In terms of parameter order, I put the new parameter "file" after electrical_series_name because somebody may be calling this without naming the arguments (since * is not used in the constructor definition).
This is untested (are there unit tests for nwb extractor?)
@alejoe91 @samuelgarcia @luiztauffer