Skip to content

Commit

Permalink
Update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
magland committed Aug 5, 2024
1 parent 46f7d5f commit 52a3672
Show file tree
Hide file tree
Showing 4 changed files with 147 additions and 85 deletions.
167 changes: 82 additions & 85 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,19 +6,31 @@

:warning: Please note, LINDI is currently under development and should not yet be used in practice.

**HDF5 as Zarr as JSON for NWB**
LINDI is a cloud-friendly file format and Python library for working with scientific data, especially Neurodata Without Borders (NWB) datasets. It is an alternative to HDF5 and Zarr, but is compatible with both, with features that make it particularly well-suited for linking to remote datasets in the cloud such as those stored on [DANDI Archive](https://www.dandiarchive.org/).

LINDI provides a JSON representation of NWB (Neurodata Without Borders) data where the large data chunks are stored separately from the main metadata. This enables efficient storage, composition, and sharing of NWB files on cloud systems such as [DANDI](https://www.dandiarchive.org/) without duplicating the large data blobs.
**What is a LINDI file?**

LINDI provides:
You can think of a LINDI file as a differently-formatted HDF5 file that is cloud-friendly and capable of linking to data chunks in remote files (such as on DANDI Archive).

- A specification for representing arbitrary HDF5 files as Zarr stores. This handles scalar datasets, references, soft links, and compound data types for datasets.
- A Zarr wrapper for remote or local HDF5 files (LindiH5ZarrStore).
- A mechanism for creating .lindi.json (or .nwb.lindi.json) files that reference data chunks in external files, inspired by [kerchunk](https://github.com/fsspec/kerchunk).
- An h5py-like interface for reading from and writing to these data sources that can be used with [pynwb](https://pynwb.readthedocs.io/en/stable/).
- A mechanism for uploading and downloading these data sources to and from cloud storage, including DANDI.
There are two types of LINDI files: JSON/text format (.lindi.json) and binary format (.lindi or .lindi.tar). In the JSON format, the hierarchical group structure, attributes, a small datasets are all stored in a JSON structure, with references to larger data chunks store in external files. The binary format is a .tar file that contains this JSON file as well as optional internal data chunks that can be referenced by the JSON file in addition to the external chunks. The advantage of the JSON LINDI format is that it is human-readable and easily inspected and edited. The advantage of the binary LINDI format is that it can contain internal data chunks. Both formats are cloud-friendly in that they can be efficiently downloaded from cloud storage with random access.

This project was inspired by [kerchunk](https://github.com/fsspec/kerchunk) and [hdmf-zarr](https://hdmf-zarr.readthedocs.io/en/latest/index.html) and depends on [zarr](https://zarr.readthedocs.io/en/stable/), [h5py](https://www.h5py.org/) and [numcodecs](https://numcodecs.readthedocs.io/en/stable/).
**What are the main use cases?**

One use case is to represent a NWB file on DANDI using a condensed JSON file so that the entire group structure can be downloaded in a single request. Neurosift uses pre-generated LINDI JSON files to efficiently load NWB files from DANDI.

Another use case is to create ammended NWB files that add additional data objects to existing NWB files without redundantly storing the entire NWB file. This is done by creating a binary LINDI file that references the original NWB file and adds additional data objects that are stored as internal data chunks.

Check failure on line 21 in README.md

View workflow job for this annotation

GitHub Actions / Check for spelling errors

ammended ==> amended

**Why not use Zarr?**

Zarr provides a cloud-friendly alternative to HDF5, but an important limitation is that Zarr archives often contain thousands of individual files making it cumbersome to manage. LINDI files are more like HDF5 in that they favor the single index approach, but are just as cloud-friendly as Zarr. A second limitation of Zarr is that there is currently no mechanism for referencing chunks in external datasets.

**Why not use HDF5?**

HDF5 is not cloud-friendly in that if you have a remote HDF5 file, many small requests are required to obtain metadata before larger data chunks can be downloaded. Both JSON and binary LINDI files solve this problem by storing the entire group structure in a single JSON structure that can be downloaded in a single request. Furthermore, as with HDF5, there is no built-in mechanism for referencing chunks in external datasets.

**Does LINDI use Zarr?**

Yes, LINDI uses the Zarr format to store data, including attributes and group hierarchies. But instead of using directories and files, it stores all of the data in a single JSON data structure, with references to large data chunks, which can either be found in remote files (e.g., in a HDF5 NWB file on DANDI) or in internal data chunks in the binary LINDI file. However, NWB depends on certain HDF5 features that are not supported by Zarr, so LINDI also provides mechanism for representing these features in Zarr.

## Installation

Expand All @@ -33,116 +45,101 @@ cd lindi
pip install -e .
```

## Use cases
## Usage

* Lazy-load a remote NWB/HDF5 file for efficient access to metadata and data.
* Represent a remote NWB/HDF5 file as a .nwb.lindi.json file.
* Read a local or remote .nwb.lindi.json file using pynwb or other tools.
* Edit a .nwb.lindi.json file using pynwb or other tools.
* Add datasets to a .nwb.lindi.json file using a local staging area.
* Upload a .nwb.lindi.json file with staged datasets to a cloud storage service such as DANDI.
**Creating and reading a LINDI file**

### Lazy-load a remote NWB/HDF5 file for efficient access to metadata and data
The simplest way to start is to use it like HDF5.

```python
import pynwb
import lindi

# URL of the remote NWB file
h5_url = "https://api.dandiarchive.org/api/assets/11f512ba-5bcf-4230-a8cb-dc8d36db38cb/download/"

# Set up a local cache
local_cache = lindi.LocalCache(cache_dir='lindi_cache')

# Create the h5py-like client
client = lindi.LindiH5pyFile.from_hdf5_file(h5_url, local_cache=local_cache)

# Open using pynwb
with pynwb.NWBHDF5IO(file=client, mode="r") as io:
nwbfile = io.read()
print(nwbfile)

# The downloaded data will be cached locally, so subsequent reads will be faster
# Create a new lindi.json file
with lindi.LindiH5pyFile.from_lindi_file('example.lindi.json', mode='w') as f:
f.attrs['attr1'] = 'value1'
f.attrs['attr2'] = 7
ds = f.create_dataset('dataset1', shape=(10,), dtype='f')
ds[...] = 12

# Later read the file
with lindi.LindiH5pyFile.from_lindi_file('example.lindi.json', mode='r') as f:
print(f.attrs['attr1'])
print(f.attrs['attr2'])
print(f['dataset1'][...])
```

### Represent a remote NWB/HDF5 file as a .nwb.lindi.json file
You can inspect the example.lindi.json file to get an idea of how the data are stored. If you are familiar with the internal Zarr format, you will recognize the .group and .zarray files and the layout of the chunks.

Because the above dataset is very small, it can all fit reasonably inside the JSON file. For storing larger arrays (the usual case) it is better to use the binary format. Just leave off the .json extension.

```python
import json
import numpy as np
import lindi

# URL of the remote NWB file
h5_url = "https://api.dandiarchive.org/api/assets/11f512ba-5bcf-4230-a8cb-dc8d36db38cb/download/"

# Create the h5py-like client
client = lindi.LindiH5pyFile.from_hdf5_file(h5_url)

client.write_lindi_file('example.lindi.json')

# See the next example for how to read this file
# Create a new lindi binary file
with lindi.LindiH5pyFile.from_lindi_file('example.lindi', mode='w') as f:
f.attrs['attr1'] = 'value1'
f.attrs['attr2'] = 7
ds = f.create_dataset('dataset1', shape=(1000, 1000), dtype='f')
ds[...] = np.random.rand(1000, 1000)

# Later read the file
with lindi.LindiH5pyFile.from_lindi_file('example.lindi', mode='r') as f:
print(f.attrs['attr1'])
print(f.attrs['attr2'])
print(f['dataset1'][...])
```

### Read a local or remote .nwb.lindi.json file using pynwb or other tools
**Loading a remote NWB file from DANDI**

```python
import json
import pynwb
import lindi

# URL of the remote .nwb.lindi.json file
url = 'https://lindi.neurosift.org/dandi/dandisets/000939/assets/56d875d6-a705-48d3-944c-53394a389c85/nwb.lindi.json'

# Load the h5py-like client
client = lindi.LindiH5pyFile.from_lindi_file(url)
# Define the URL for a remote NWB file
h5_url = "https://api.dandiarchive.org/api/assets/11f512ba-5bcf-4230-a8cb-dc8d36db38cb/download/"

# Open using pynwb
with pynwb.NWBHDF5IO(file=client, mode="r") as io:
# Load as LINDI and view using pynwb
f = lindi.LindiH5pyFile.from_hdf5_file(h5_url)
with pynwb.NWBHDF5IO(file=f, mode="r") as io:
nwbfile = io.read()
print('NWB via LINDI')
print(nwbfile)
```

### Edit a .nwb.lindi.json file using pynwb or other tools
print('Electrode group at shank0:')
print(nwbfile.electrode_groups["shank0"]) # type: ignore

```python
import json
import lindi
print('Electrode group at index 0:')
print(nwbfile.electrodes.group[0]) # type: ignore

# URL of the remote .nwb.lindi.json file
url = 'https://lindi.neurosift.org/dandi/dandisets/000939/assets/56d875d6-a705-48d3-944c-53394a389c85/nwb.lindi.json'
# Save as LINDI JSON
f.write_lindi_file('example.nwb.lindi.json')

# Load the h5py-like client for the reference file system
# in read-write mode
client = lindi.LindiH5pyFile.from_lindi_file(url, mode="r+")
# Later, read directly from the LINDI JSON file
g = lindi.LindiH5pyFile.from_lindi_file('example.nwb.lindi.json')
with pynwb.NWBHDF5IO(file=g, mode="r") as io:
nwbfile = io.read()
print('')
print('NWB from LINDI JSON:')
print(nwbfile)

# Edit an attribute
client.attrs['new_attribute'] = 'new_value'
print('Electrode group at shank0:')
print(nwbfile.electrode_groups["shank0"]) # type: ignore

# Save the changes to a new .nwb.lindi.json file
client.write_lindi_file('new.nwb.lindi.json')
print('Electrode group at index 0:')
print(nwbfile.electrodes.group[0]) # type: ignore
```

### Add datasets to a .nwb.lindi.json file using a local staging area
## Ammending an NWB file

Check failure on line 134 in README.md

View workflow job for this annotation

GitHub Actions / Check for spelling errors

Ammending ==> Amending

```python
import lindi
Basically you save the remote NWB as a local binary LINDI file, and then add additional data objects to it.

# URL of the remote .nwb.lindi.json file
url = 'https://lindi.neurosift.org/dandi/dandisets/000939/assets/56d875d6-a705-48d3-944c-53394a389c85/nwb.lindi.json'

# Load the h5py-like client for the reference file system
# in read-write mode with a staging area
with lindi.StagingArea.create(base_dir='lindi_staging') as staging_area:
client = lindi.LindiH5pyFile.from_lindi_file(
url,
mode="r+",
staging_area=staging_area
)
# add datasets to client using pynwb or other tools
# upload the changes to the remote .nwb.lindi.json file
```
TODO: finish this section

### Upload a .nwb.lindi.json file with staged datasets to a cloud storage service such as DANDI
## Notes

See [this example](https://github.com/magland/lindi-dandi/blob/main/devel/lindi_test_2.py).
This project was inspired by [kerchunk](https://github.com/fsspec/kerchunk) and [hdmf-zarr](https://hdmf-zarr.readthedocs.io/en/latest/index.html) and depends on [zarr](https://zarr.readthedocs.io/en/stable/), [h5py](https://www.h5py.org/) and [numcodecs](https://numcodecs.readthedocs.io/en/stable/).

## For developers

Expand Down
14 changes: 14 additions & 0 deletions examples/example_a.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
import lindi

# Create a new lindi.json file
with lindi.LindiH5pyFile.from_lindi_file('example.lindi.json', mode='w') as f:
f.attrs['attr1'] = 'value1'
f.attrs['attr2'] = 7
ds = f.create_dataset('dataset1', shape=(10,), dtype='f')
ds[...] = 12

# Later read the file
with lindi.LindiH5pyFile.from_lindi_file('example.lindi.json', mode='r') as f:
print(f.attrs['attr1'])
print(f.attrs['attr2'])
print(f['dataset1'][...])
15 changes: 15 additions & 0 deletions examples/example_b.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
import numpy as np
import lindi

# Create a new lindi binary file
with lindi.LindiH5pyFile.from_lindi_file('example.lindi', mode='w') as f:
f.attrs['attr1'] = 'value1'
f.attrs['attr2'] = 7
ds = f.create_dataset('dataset1', shape=(1000, 1000), dtype='f')
ds[...] = np.random.rand(1000, 1000)

# Later read the file
with lindi.LindiH5pyFile.from_lindi_file('example.lindi', mode='r') as f:
print(f.attrs['attr1'])
print(f.attrs['attr2'])
print(f['dataset1'][...])
36 changes: 36 additions & 0 deletions examples/example_c.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
import json
import pynwb
import lindi

# Define the URL for a remote NWB file
h5_url = "https://api.dandiarchive.org/api/assets/11f512ba-5bcf-4230-a8cb-dc8d36db38cb/download/"

# Load as LINDI and view using pynwb
f = lindi.LindiH5pyFile.from_hdf5_file(h5_url)
with pynwb.NWBHDF5IO(file=f, mode="r") as io:
nwbfile = io.read()
print('NWB via LINDI')
print(nwbfile)

print('Electrode group at shank0:')
print(nwbfile.electrode_groups["shank0"]) # type: ignore

print('Electrode group at index 0:')
print(nwbfile.electrodes.group[0]) # type: ignore

# Save as LINDI JSON
f.write_lindi_file('example.nwb.lindi.json')

# Later, read directly from the LINDI JSON file
g = lindi.LindiH5pyFile.from_lindi_file('example.nwb.lindi.json')
with pynwb.NWBHDF5IO(file=g, mode="r") as io:
nwbfile = io.read()
print('')
print('NWB from LINDI JSON:')
print(nwbfile)

print('Electrode group at shank0:')
print(nwbfile.electrode_groups["shank0"]) # type: ignore

print('Electrode group at index 0:')
print(nwbfile.electrodes.group[0]) # type: ignore

0 comments on commit 52a3672

Please sign in to comment.