Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: reading unit table arrays via string keys much slower than reading as properties #141

Open
3 tasks done
dyf opened this issue Nov 16, 2023 · 3 comments
Open
3 tasks done
Assignees
Labels
category: enhancement improvements of code or code behavior priority: low alternative solution already working and/or relevant to only specific user(s)
Milestone

Comments

@dyf
Copy link

dyf commented Nov 16, 2023

What happened?

I am trying to read information from the units table in this file: s3://aind-open-data/ecephys_625098_2022-08-15_08-51-36_nwb_2023-05-16_16-28-23/ecephys_625098_2022-08-15_08-51-36_nwb/ecephys_625098_2022-08-15_08-51-36_experiment1_recording1.nwb.zarr.

I notice a large difference in read times when accessing units.spike_times vs units['spike_times']:

image

This is surprising behavior.

Steps to Reproduce

from hdmf_zarr import NWBZarrIO
import time

nwb_file = 's3://aind-open-data/ecephys_625098_2022-08-15_08-51-36_nwb_2023-05-16_16-28-23/ecephys_625098_2022-08-15_08-51-36_nwb/ecephys_625098_2022-08-15_08-51-36_experiment1_recording1.nwb.zarr'

io = NWBZarrIO(nwb_file, "r")
nwbfile_read = io.read()

t0 = time.time()
spike_times = nwbfile_read.units.spike_times[:]
print(time.time()-t0)
t0 = time.time()
spike_times = nwbfile_read.units['spike_times'][:]
print(time.time()-t0)

Traceback

none

Operating System

Linux

Python Executable

Conda

Python Version

3.9

Package Versions

I am using @alejoe91's branch for reading from S3:

https://github.com/alejoe91/hdmf-zarr/tree/fix-linking-in-remote-zarr

Code of Conduct

@oruebel
Copy link
Contributor

oruebel commented Nov 17, 2023

Thanks for the helpful issue. I had to modify your script slightly to add anonymous access io = NWBZarrIO(nwb_file, "r", storage_options=dict(anon=True)) but am able to run it now. Just for reference, below the script I'm using for profiling.

These calls are actually taking different code paths. If you print the type of the objects that are being sliced into via:

print(type(nwbfile_read.units.spike_times))
print(type(nwbfile_read.units['spike_times']))

you should see:

<class 'hdmf.common.table.VectorData'>
<class 'hdmf.common.table.VectorIndex'> 

I.e., units.spike_times and units['spike_times'] are different. This is because the "spike_times" column is indexed, i.e., the column is a ragged array that is represented by two datasets: 1) spike_times with the actual data values and 2) spike_times_index that segments the values to create the ragged array (i.e., a list of lists). Both of these are properties on the Units table, i.e., there is a property units.spike_times and units.spike_times_index. These properties exist to allow "low-level" access to the underlying data structure.

Selection via slicing (e.g., units['spike_times']) on the other hand is to insulate the user from these low-levels details and help with accessing elements in the table. I.e., when using the slicing syntax units['spike_times'] the class knows that this is a ragged column and uses the index column instead. I.e., the equivalent to units['spike_times'] would be units.spike_times_index.

As a result, the output from these operations is also different. When slicing into the units.spike_times[:] (which is VectorData with just the values), you get a 1D numpy array of all the spike times of all the units as one large array. In contrast when slicing into units['spike_times'] (or units.spike_times_index, which is a VectorIndex) you get a list of numpy arrays, where each array in the list has the spike times for a single unit.

I'll need to do some more in-depth profiling, but my guess is that the difference in timing between these operations is then likely due to:

  • the need to read more data, because we have to read both the spike_times and the spike_times_index
  • the need for many smaller read operations (rather than a single large read operation). At least I assume that the code to create the arrays for all the units likely reads the array for each unit separately.

The latter part may be something that could potentially be improved by reading all the data into memory first and then segmenting the array into its parts, rather than reading the array for each unit separately.

Profiling script

from hdmf_zarr import NWBZarrIO
from line_profiler import profile

@profile
def main():
    nwb_file = 's3://aind-open-data/ecephys_625098_2022-08-15_08-51-36_nwb_2023-05-16_16-28-23/ecephys_625098_2022-08-15_08-51-36_nwb/ecephys_625098_2022-08-15_08-51-36_experiment1_recording1.nwb.zarr'
    io = NWBZarrIO(nwb_file, "r", storage_options=dict(anon=True))
    nwbfile_read = io.read()
    spike_times = nwbfile_read.units.spike_times[:]
    spike_times = nwbfile_read.units['spike_times'][:]

if __name__ == '__main__':
    main()

@dyf
Copy link
Author

dyf commented Nov 22, 2023

Thanks @oruebel. This makes sense but it's pretty surprising behavior given the shape of the API. If there are opportunities to improve and standardize performance that would be great.

@oruebel
Copy link
Contributor

oruebel commented Jan 27, 2024

@dyf Just FYI, I've created a separate issue for this here NeurodataWithoutBorders/nwb_benchmarks#13 We are working on setting up a nwb_benchmarks suite for performance tests, with an initial focus on cloud read. This is still early stage, but I just wanted to let you know that we are working on evaluating performance.

@mavaylon1 mavaylon1 added category: enhancement improvements of code or code behavior priority: low alternative solution already working and/or relevant to only specific user(s) labels Apr 16, 2024
@mavaylon1 mavaylon1 added this to the 0.8.0 milestone Apr 16, 2024
@rly rly modified the milestones: 1.0.0, 1.1.0 Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: enhancement improvements of code or code behavior priority: low alternative solution already working and/or relevant to only specific user(s)
Projects
None yet
Development

No branches or pull requests

4 participants