Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rich visualization of large datasets #24

Open
kushalkolar opened this issue Mar 19, 2023 · 6 comments
Open

Rich visualization of large datasets #24

kushalkolar opened this issue Mar 19, 2023 · 6 comments

Comments

@kushalkolar
Copy link

Hi, I just came across this repo. I haven't looked into DANDI in detail but it seems like there's a nice API which can provide lazy loading and random access to files (assuming those file types support lazy loading), correct me if I'm wrong?

Anyways I'm writing a new library to perform very fast visualizations in jupyter notebooks, it can leverage Vulkan/WGPU using an expressive API. I'm curious to see how it would perform with DANDI.

https://github.com/kushalkolar/fastplotlib

@kushalkolar
Copy link
Author

Also, this repo isn't very active so I'm wondering if this project is still actively supported?

@satra
Copy link
Member

satra commented Mar 19, 2023

thanks for your interest and the pointer to the library.

this repo is actively supported, but this repo gets updated only as relevant to dandisets and given contributor interests in contributing to this repo.

indeed dandi supports remote reading in various formats, and the kind of example you should should exist in several dandisets. i would suggest focusing on microns and ibl datasets as they both have a very rich set of recordings.

does the library work in the absence of a gpu? for example, would it work on a google colab notebook with or without a GPU?

@kushalkolar
Copy link
Author

kushalkolar commented Oct 11, 2023

Hi, I finally got around to trying this but it's very slow, seems like the network is the bottleneck. I am trying it over wifi and getting 5MB/s (bytes not bits).

I am on wifi though, I will try wired.

from dandi.dandiapi import DandiAPIClient

dandiset_id = "000168"
file_path = "jGCaMP8f/jGCaMP8f_ANM471993_cell01.nwb"

# Get the location of the file on DANDI
with DandiAPIClient() as client:
    asset = client.get_dandiset(dandiset_id, 'draft').get_asset_by_path(file_path)
    s3_url = asset.get_content_url(follow_redirects=1, strip_query=True)

from fsspec.implementations.cached import CachingFileSystem
from fsspec import filesystem
from h5py import File
from pynwb import NWBHDF5IO


# Create a virtual filesystem based on the http protocol and use caching to save accessed data to RAM.
fs = filesystem("http")
file_system = fs.open(s3_url, "rb")
file = File(file_system, mode="r")

data = file["acquisition"]["Registered movie 0"]["data"]

fetching single frames:

%timeit
ix = np.random.randint(0, data.shape[0])
frame = data[ix]
30.2 s ± 1.9 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

EDIT: My network is definitely not the bottleneck, it could either be a bandwidth limitation from where the files are stored, or from the way it's being access through dandi client, the virtual filesystem stuff, or nwb hdf (I doubt it would be nwb hdf). Are there any datasets hosted somewhere that is known to have very high bandwidth to help rule this out?

EDIT 2: For this particular dataset, it seems like fetching single frames is also quite slow even if the data are locally on disk, this is separate from the network issue though and possibly due to the chunk size or compression of this particular dataset. The network is still the main bottleneck:

308 ms ± 4.26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

You'd need ~30ms or less seek-time to have useful random-access. We have done this before many times with files on remote filesystems, but they were usually memmaps or zarr and not NWB files.

@kushalkolar
Copy link
Author

Is there any way to know what bandwidth is available for a given dandiset?

@kushalkolar
Copy link
Author

Regarding colab compatibility, we are trying to implement that here: vispy/jupyter_rfb#77

@satra
Copy link
Member

satra commented Oct 12, 2023

@kushalkolar - a few things. we don't control the bandwidth, aws does hence it fluctuates based on demand at a given point in time. there are indeed times when it can be slow.

you also may want to look into https://github.com/magland/remfile for at least nwb files and also talk to @magland who is building new viz tools as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants