Rich visualization of large datasets #24

kushalkolar · 2023-03-19T10:52:39Z

Hi, I just came across this repo. I haven't looked into DANDI in detail but it seems like there's a nice API which can provide lazy loading and random access to files (assuming those file types support lazy loading), correct me if I'm wrong?

Anyways I'm writing a new library to perform very fast visualizations in jupyter notebooks, it can leverage Vulkan/WGPU using an expressive API. I'm curious to see how it would perform with DANDI.

https://github.com/kushalkolar/fastplotlib

kushalkolar · 2023-03-19T10:53:24Z

Also, this repo isn't very active so I'm wondering if this project is still actively supported?

satra · 2023-03-19T16:37:30Z

thanks for your interest and the pointer to the library.

this repo is actively supported, but this repo gets updated only as relevant to dandisets and given contributor interests in contributing to this repo.

indeed dandi supports remote reading in various formats, and the kind of example you should should exist in several dandisets. i would suggest focusing on microns and ibl datasets as they both have a very rich set of recordings.

does the library work in the absence of a gpu? for example, would it work on a google colab notebook with or without a GPU?

kushalkolar · 2023-10-11T22:57:58Z

Hi, I finally got around to trying this but it's very slow, seems like the network is the bottleneck. I am trying it over wifi and getting 5MB/s (bytes not bits).

I am on wifi though, I will try wired.

from dandi.dandiapi import DandiAPIClient

dandiset_id = "000168"
file_path = "jGCaMP8f/jGCaMP8f_ANM471993_cell01.nwb"

# Get the location of the file on DANDI
with DandiAPIClient() as client:
    asset = client.get_dandiset(dandiset_id, 'draft').get_asset_by_path(file_path)
    s3_url = asset.get_content_url(follow_redirects=1, strip_query=True)

from fsspec.implementations.cached import CachingFileSystem
from fsspec import filesystem
from h5py import File
from pynwb import NWBHDF5IO


# Create a virtual filesystem based on the http protocol and use caching to save accessed data to RAM.
fs = filesystem("http")
file_system = fs.open(s3_url, "rb")
file = File(file_system, mode="r")

data = file["acquisition"]["Registered movie 0"]["data"]

fetching single frames:

%timeit
ix = np.random.randint(0, data.shape[0])
frame = data[ix]

30.2 s ± 1.9 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

EDIT: My network is definitely not the bottleneck, it could either be a bandwidth limitation from where the files are stored, or from the way it's being access through dandi client, the virtual filesystem stuff, or nwb hdf (I doubt it would be nwb hdf). Are there any datasets hosted somewhere that is known to have very high bandwidth to help rule this out?

EDIT 2: For this particular dataset, it seems like fetching single frames is also quite slow even if the data are locally on disk, this is separate from the network issue though and possibly due to the chunk size or compression of this particular dataset. The network is still the main bottleneck:

308 ms ± 4.26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

You'd need ~30ms or less seek-time to have useful random-access. We have done this before many times with files on remote filesystems, but they were usually memmaps or zarr and not NWB files.

kushalkolar · 2023-10-11T23:00:32Z

Is there any way to know what bandwidth is available for a given dandiset?

kushalkolar · 2023-10-11T23:24:09Z

Regarding colab compatibility, we are trying to implement that here: vispy/jupyter_rfb#77

satra · 2023-10-12T01:54:13Z

@kushalkolar - a few things. we don't control the bandwidth, aws does hence it fluctuates based on demand at a given point in time. there are indeed times when it can be slow.

you also may want to look into https://github.com/magland/remfile for at least nwb files and also talk to @magland who is building new viz tools as well.

kushalkolar mentioned this issue Oct 12, 2023

option to skip computing quick min max of data arrays passed to ImageWidget fastplotlib/fastplotlib#317

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rich visualization of large datasets #24

Rich visualization of large datasets #24

kushalkolar commented Mar 19, 2023

kushalkolar commented Mar 19, 2023

satra commented Mar 19, 2023

kushalkolar commented Oct 11, 2023 •

edited

Loading

kushalkolar commented Oct 11, 2023

kushalkolar commented Oct 11, 2023

satra commented Oct 12, 2023

Rich visualization of large datasets #24

Rich visualization of large datasets #24

Comments

kushalkolar commented Mar 19, 2023

kushalkolar commented Mar 19, 2023

satra commented Mar 19, 2023

kushalkolar commented Oct 11, 2023 • edited Loading

kushalkolar commented Oct 11, 2023

kushalkolar commented Oct 11, 2023

satra commented Oct 12, 2023

kushalkolar commented Oct 11, 2023 •

edited

Loading