[Bug]: reading unit table arrays via string keys much slower than reading as properties #141

dyf · 2023-11-16T17:48:04Z

What happened?

I am trying to read information from the units table in this file: s3://aind-open-data/ecephys_625098_2022-08-15_08-51-36_nwb_2023-05-16_16-28-23/ecephys_625098_2022-08-15_08-51-36_nwb/ecephys_625098_2022-08-15_08-51-36_experiment1_recording1.nwb.zarr.

I notice a large difference in read times when accessing units.spike_times vs units['spike_times']:

This is surprising behavior.

Steps to Reproduce

from hdmf_zarr import NWBZarrIO
import time

nwb_file = 's3://aind-open-data/ecephys_625098_2022-08-15_08-51-36_nwb_2023-05-16_16-28-23/ecephys_625098_2022-08-15_08-51-36_nwb/ecephys_625098_2022-08-15_08-51-36_experiment1_recording1.nwb.zarr'

io = NWBZarrIO(nwb_file, "r")
nwbfile_read = io.read()

t0 = time.time()
spike_times = nwbfile_read.units.spike_times[:]
print(time.time()-t0)
t0 = time.time()
spike_times = nwbfile_read.units['spike_times'][:]
print(time.time()-t0)

Traceback

none

Operating System

Linux

Python Executable

Conda

Python Version

3.9

Package Versions

I am using @alejoe91's branch for reading from S3:

https://github.com/alejoe91/hdmf-zarr/tree/fix-linking-in-remote-zarr

Code of Conduct

I agree to follow this project's Code of Conduct
Have you checked the Contributing document?
Have you ensured this bug was not already reported?

The text was updated successfully, but these errors were encountered:

oruebel · 2023-11-17T02:00:34Z

Thanks for the helpful issue. I had to modify your script slightly to add anonymous access io = NWBZarrIO(nwb_file, "r", storage_options=dict(anon=True)) but am able to run it now. Just for reference, below the script I'm using for profiling.

These calls are actually taking different code paths. If you print the type of the objects that are being sliced into via:

print(type(nwbfile_read.units.spike_times))
print(type(nwbfile_read.units['spike_times']))

you should see:

<class 'hdmf.common.table.VectorData'>
<class 'hdmf.common.table.VectorIndex'>

I.e., units.spike_times and units['spike_times'] are different. This is because the "spike_times" column is indexed, i.e., the column is a ragged array that is represented by two datasets: 1) spike_times with the actual data values and 2) spike_times_index that segments the values to create the ragged array (i.e., a list of lists). Both of these are properties on the Units table, i.e., there is a property units.spike_times and units.spike_times_index. These properties exist to allow "low-level" access to the underlying data structure.

Selection via slicing (e.g., units['spike_times']) on the other hand is to insulate the user from these low-levels details and help with accessing elements in the table. I.e., when using the slicing syntax units['spike_times'] the class knows that this is a ragged column and uses the index column instead. I.e., the equivalent to units['spike_times'] would be units.spike_times_index.

As a result, the output from these operations is also different. When slicing into the units.spike_times[:] (which is VectorData with just the values), you get a 1D numpy array of all the spike times of all the units as one large array. In contrast when slicing into units['spike_times'] (or units.spike_times_index, which is a VectorIndex) you get a list of numpy arrays, where each array in the list has the spike times for a single unit.

I'll need to do some more in-depth profiling, but my guess is that the difference in timing between these operations is then likely due to:

the need to read more data, because we have to read both the spike_times and the spike_times_index
the need for many smaller read operations (rather than a single large read operation). At least I assume that the code to create the arrays for all the units likely reads the array for each unit separately.

The latter part may be something that could potentially be improved by reading all the data into memory first and then segmenting the array into its parts, rather than reading the array for each unit separately.

Profiling script

from hdmf_zarr import NWBZarrIO
from line_profiler import profile

@profile
def main():
    nwb_file = 's3://aind-open-data/ecephys_625098_2022-08-15_08-51-36_nwb_2023-05-16_16-28-23/ecephys_625098_2022-08-15_08-51-36_nwb/ecephys_625098_2022-08-15_08-51-36_experiment1_recording1.nwb.zarr'
    io = NWBZarrIO(nwb_file, "r", storage_options=dict(anon=True))
    nwbfile_read = io.read()
    spike_times = nwbfile_read.units.spike_times[:]
    spike_times = nwbfile_read.units['spike_times'][:]

if __name__ == '__main__':
    main()

dyf · 2023-11-22T15:23:56Z

Thanks @oruebel. This makes sense but it's pretty surprising behavior given the shape of the API. If there are opportunities to improve and standardize performance that would be great.

oruebel · 2024-01-27T08:55:08Z

@dyf Just FYI, I've created a separate issue for this here NeurodataWithoutBorders/nwb_benchmarks#13 We are working on setting up a nwb_benchmarks suite for performance tests, with an initial focus on cloud read. This is still early stage, but I just wanted to let you know that we are working on evaluating performance.

oruebel mentioned this issue Jan 27, 2024

Add benchmark for reading spike_times from units table NeurodataWithoutBorders/nwb_benchmarks#13

Open

mavaylon1 assigned oruebel Apr 16, 2024

mavaylon1 added category: enhancement improvements of code or code behavior priority: low alternative solution already working and/or relevant to only specific user(s) labels Apr 16, 2024

mavaylon1 added this to the 0.8.0 milestone Apr 16, 2024

oruebel mentioned this issue Jun 11, 2024

Temporary failure in name resolution NeurodataWithoutBorders/lindi#81

Closed

rly modified the milestones: 1.0.0, 1.1.0 Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: reading unit table arrays via string keys much slower than reading as properties #141

[Bug]: reading unit table arrays via string keys much slower than reading as properties #141

dyf commented Nov 16, 2023 •

edited

Loading

oruebel commented Nov 17, 2023

dyf commented Nov 22, 2023

oruebel commented Jan 27, 2024 •

edited

Loading

[Bug]: reading unit table arrays via string keys much slower than reading as properties #141

[Bug]: reading unit table arrays via string keys much slower than reading as properties #141

Comments

dyf commented Nov 16, 2023 • edited Loading

What happened?

Steps to Reproduce

Traceback

Operating System

Python Executable

Python Version

Package Versions

Code of Conduct

oruebel commented Nov 17, 2023

Profiling script

dyf commented Nov 22, 2023

oruebel commented Jan 27, 2024 • edited Loading

dyf commented Nov 16, 2023 •

edited

Loading

oruebel commented Jan 27, 2024 •

edited

Loading