[Documentation]: Streaming NWB files - recommend using remfile as the preferred method #1791

magland · 2023-11-24T13:00:26Z

What would you like changed or added to the documentation and why?

In the instructions for streaming nwb files, I propose that we include use of a new tool called remfile as the preferred option because it is up to an order of magnitude more efficient than fsspec for the initial load of nwb files and offers advantages for loading large datasets within those files as well. I'll provide some justification. I am happy to draft a PR with the update if you would like to proceed, and I welcome any thoughts and feedback on this proposal.

I created remfile about 3-4 months ago to address the slowness in lazy reading of remote NWB files using h5py and pynwb with fsspec. A description of the why and how is provided in the readme for remfile. Essentially...

Why? The conventional way of reading a remote hdf5 file is to use the fsspec library. I am not familiar with the inner workings of fsspec, but it appears that it is not optimized for reading hdf5 files. Efficient access of remote hdf5 files requires reading small chunks of data to obtain meta information, and then large chunks of data, and parallelization, to obtain the larger data arrays.

How? In remfile, a file-like object is created that reads the remote file in chunks using the requests library. A relatively small default chunk size is used, but when remfile detects that a large data array is being accessed, it adaptively switches to larger chunk sizes. For very large data arrays, the system will use multiple threads to read the data in parallel.

The package itself is very small... essentially only a single .py file with ~300 lines. It has CI tests with almost 98% coverage.

This is now being used in several projects where the gain over the fsspec method is substantial. See this comment thread where we found a huge difference in load time for an nwb file on DANDI (~390 sec for fsspec and ~43 sec for remfile). This finding was confirmed on a second computer by @h-mayorquin and he also included a very convincing network profile showing that remfile downloaded ~50x less data than fsspec on the initial nwb file load.

I'll repeat the test script from that thread here (you need to pip install remfile)

import remfile
import time

import fsspec
import h5py
from pynwb import NWBHDF5IO

# URL of the file
file_url = 'https://dandiarchive.s3.amazonaws.com/blobs/413/cf0/413cf0f3-3498-485a-b099-84bc36d43ca6'

for mode in ['fsspec', 'remfile']:
    if mode == 'fsspec':
        print('fspec mode.......')
        timer = time.time()
        fs = fsspec.filesystem("http", )
        fsspec_file = fs.open(file_url, "rb")

        # Use h5py to open the cached file
        file = h5py.File(fsspec_file, 'r')

        print(file.keys())

        nwbfile = NWBHDF5IO(file=file, mode='r', load_namespaces=True).read()
        print(nwbfile)

        elapsed_sec = time.time() - timer
        print(f'Elapsed time for fsspec mode: {elapsed_sec:.2f} sec')
    elif mode == 'remfile':
        print('remfile mode.......')
        timer = time.time()
        rfile = remfile.File(file_url, verbose=True)

        file = h5py.File(rfile, 'r')

        print(file.keys())

        nwbfile = NWBHDF5IO(file=file, mode='r', load_namespaces=True).read()
        print(nwbfile)

        elapsed_sec = time.time() - timer
        print(f'Elapsed time for remfile mode: {elapsed_sec:.2f} sec')

Here is also another timing test for h5py: remfile example vs fsspec example.

A couple of other notes...

Remfile uses retries for http requests, just like fsspec.

Remfile provides an advantage when reading embargoed datasets from DANDI in long-running processes, where the download signature of the link may expire after an hour.

One caveat: Compared with fsspec, the local caching options are limited.

As I said, I am happy to help draft a PR for this if you want to proceed. Also happy to provide further tests/analysis!

Do you have any interest in helping write or edit the documentation?

Yes.

Code of Conduct

I agree to follow this project's Code of Conduct
Have you checked the Contributing document?
Have you ensured this change was not already requested?

The text was updated successfully, but these errors were encountered:

oruebel · 2023-11-25T21:19:33Z

I am happy to help draft a PR for this if you want to proceed

Thanks for the detailed issue and your willingness to help with a PR. Please feel free to create a PR and then we can iterate on the details there.

rly · 2023-11-27T03:36:52Z

In my tests, I have also seen a 5-10X speedup using remfile over fsspec.

We have a draft PR here where we would list remfile as a third option on the streaming docs page: #1761

We could move that to the second method or even the first method recommended, with the noted caveat that local caching options are limited.

magland · 2023-11-28T14:47:49Z

Thanks @rly and @oruebel

I created some benchmark scripts for comparing the performance of fsspec, ros3, and remfile. I even had the idea of running this as a gh action and providing an auto-generated comparison table. I thought this would be a good thing to point to from the pynwb docs. However, I found that the performance fluctuated a lot and depended on all kinds of factors with the network. It's not as straightforward as CPU-benchmarking. In fact, weirdly, I found that fsspec usually performed comparably to remfile on the gh actions server, whereas on my laptop, I always get around an order of magnitude difference. I don't have an explanation for it. Ros3 seems to perform comparably to remfile in the tests I ran both on gh actions and on my laptop. The annoying thing about ros3 is that it requires a specially-built version of h5py. Are there other downsides of ros3?

Bottom line, I don't think my benchmark tests are reliable at this point. Probably the best way to do this is to somehow measure the total data download required, rather than timing, but I'm not sure how to do that (haven't done any research yet).

I'm happy if remfile is listed third, considering it is a new tool, as in that PR. But it would be nice to provide a warning that fsspec (if listed first) can be very slow, and suggesting remfile as an alternative.

CodyCBakerPhD · 2023-11-28T14:54:29Z

I even had the idea of running this as a gh action and providing an auto-generated comparison table.
However, I found that the performance fluctuated a lot and depended on all kinds of factors with the network. It's not as straightforward as CPU-benchmarking.
Bottom line, I don't think my benchmark tests are reliable at this point. Probably the best way to do this is to somehow measure the total data download required, rather than timing, but I'm not sure how to do that (haven't done any research yet).

I highly recommend setting up Airspeed Velocity for this. I was going to do it once the cloud supplement kicks in but feel free to get ahead of me on that.

See how sklearn uses it for their own benchmarking

You can even then deploy it on all kinds of specialized computational infrastructure (i.e., via dendro on AWS) and see how it varies based on things like the IOPS volumes and instance type, and compiled/compare results across each architecture

Are there other downsides of ros3?

No automatic retries. Was/is a big pain for the NWB Inspector (been on my to do list to swap to remfile over there)

Also, packaging. Packaging is painful due to defiance on conda-forge, which usually lags behind PyPI releases

magland · 2023-11-28T15:06:42Z

I highly recommend setting up Airspeed Velocity for this. I was going to do it once the cloud supplement kicks in but feel free to get ahead of me on that.

Thanks @CodyCBakerPhD I'll take a look. Although I don't think this would solve the unreliability of network speed tests, would it? Like I said, I'd like to somehow measure data downloaded rather than download time. I believe that should be consistent across different settings.

Are there other downsides of ros3?

No automatic retries. Was/is a big pain for the NWB Inspector (been on my to do list to swap to remfile over there)

I think this should be mentioned in the docs for helping users decide.

oruebel · 2023-11-28T16:12:01Z

Bottom line, I don't think my benchmark tests are reliable at this point. Probably the best way to do this is to somehow measure the total data download required, rather than timing, but I'm not sure how to do that (haven't done any research yet).

weirdly, I found that fsspec usually performed comparably to remfile on the gh actions server, whereas on my laptop, I always get around an order of magnitude difference

I assume variability in latency and network speed are main factors. Presumably gh actions has high-bandwidth network which may smooth out some of the factors that drive performance issues on home networks/wifi. I think it would be useful to quantify the variability itself, which presumably increases "the further away" (in terms of network transfer rate and latency) we are from the source data.

I think the "total data download" may not be sufficient. With latency as a likely driver of performance, the number and sizes of I/O requests are probably more informative. I.e., you could have roughly the same amount of data being transferred via many more network requests.

magland · 2023-11-28T16:30:32Z

I think the "total data download" may not be sufficient. With latency as a likely driver of performance, the number and sizes of I/O requests are probably more informative. I.e., you could have roughly the same amount of data being transferred via many more network requests.

Good point!

CodyCBakerPhD · 2023-11-28T16:47:33Z

Although I don't think this would solve the unreliability of network speed tests, would it?

That's all on how you setup and configure particular benchmark tests (like how different pytests can have different setup conditions); my point is it gives a standard platform for others (and remote instances of other configurations, such as an AWS instance from a region on the other side of the world) to run the same benchmarks in the same way, but on their own architecture so you get source of data that can be used to model that variability

oruebel · 2023-11-28T20:18:56Z

It would be great to have benchmarks run over time, but I think it may be best to do this in a separate place. Maybe https://github.com/NeurodataWithoutBorders/nwb-project-analytics would be a good place to create an ASV setup for NWB repos?

rly · 2024-01-13T08:46:51Z

Fixed in #1761. You can see it in the dev branch of the pynwb readthedocs: https://pynwb.readthedocs.io/en/dev/tutorials/advanced_io/streaming.html

magland mentioned this issue Nov 24, 2023

Support remfile and file-like in nwb rec extractor SpikeInterface/spikeinterface#2169

Merged

rly mentioned this issue Nov 27, 2023

Streaming add remfile #1761

Merged

6 tasks

h-mayorquin mentioned this issue Dec 14, 2023

Add hdf5 backend support forNwbRecordingExtractor SpikeInterface/spikeinterface#2298

Merged

rly closed this as completed Jan 13, 2024

rly mentioned this issue Jan 13, 2024

Run streaming benchmarks over time #1822

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Documentation]: Streaming NWB files - recommend using remfile as the preferred method #1791

[Documentation]: Streaming NWB files - recommend using remfile as the preferred method #1791

magland commented Nov 24, 2023

oruebel commented Nov 25, 2023

rly commented Nov 27, 2023

magland commented Nov 28, 2023

CodyCBakerPhD commented Nov 28, 2023 •

edited

Loading

magland commented Nov 28, 2023

oruebel commented Nov 28, 2023

magland commented Nov 28, 2023

CodyCBakerPhD commented Nov 28, 2023

oruebel commented Nov 28, 2023

rly commented Jan 13, 2024

[Documentation]: Streaming NWB files - recommend using remfile as the preferred method #1791

[Documentation]: Streaming NWB files - recommend using remfile as the preferred method #1791

Comments

magland commented Nov 24, 2023

What would you like changed or added to the documentation and why?

Do you have any interest in helping write or edit the documentation?

Code of Conduct

oruebel commented Nov 25, 2023

rly commented Nov 27, 2023

magland commented Nov 28, 2023

CodyCBakerPhD commented Nov 28, 2023 • edited Loading

magland commented Nov 28, 2023

oruebel commented Nov 28, 2023

magland commented Nov 28, 2023

CodyCBakerPhD commented Nov 28, 2023

oruebel commented Nov 28, 2023

rly commented Jan 13, 2024

CodyCBakerPhD commented Nov 28, 2023 •

edited

Loading