Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Unbounded stream to dataset with H5DataIO timestamps fails #1929

Closed
3 tasks done
cboulay opened this issue Jul 9, 2024 · 4 comments
Closed
3 tasks done

[Bug]: Unbounded stream to dataset with H5DataIO timestamps fails #1929

cboulay opened this issue Jul 9, 2024 · 4 comments
Assignees
Labels
category: bug errors in the code or code behavior priority: high impacts proper operation or use of feature important to most users
Milestone

Comments

@cboulay
Copy link

cboulay commented Jul 9, 2024

What happened?

I am streaming multiple unbounded data streams to NWB (i.e., recording live data), with one stream->dataset per file; I intend to combine files after recording. Most of the documentation around iterative writing outlines how to wrap around a DataChunkIterator around a generator but I find this cumbersome for unbounded data because I need to put the generator in another thread that pulls from a queue, and use the main thread to supply data to the queue.

The alternative approach is more flexible and seems to fit my use case better. However, the alternative approach does not work for timestamps! It seems the DataChunkIterator does work for timestamps.

Unfortunately, I don't fully trust the reported sample rate coming from my data sources so I would prefer to store the timestamps over setting the start time and srate only.

If using a DataChunkIterator is preferable to using H5DataIO then are there any other patterns I should try? Or do I bite the bullet and use the multi-threaded approach?

Steps to Reproduce

import datetime
import numpy as np
from hdmf.backends.hdf5.h5_utils import H5DataIO
from pynwb import NWBHDF5IO, NWBFile, TimeSeries


def test_nwb_stream_timestamps():
    # srate = 10.0
    data_shape = 20, 16, 2
    test_data = np.arange(np.prod(data_shape)).reshape(data_shape)

    dataio = H5DataIO(
        shape=(0,) + data_shape[1:],
        dtype=test_data.dtype,
        maxshape=(None,) + data_shape[1:],
        fillvalue=np.nan,
    )
    timestampio = H5DataIO(
        shape=(0,),
        dtype=np.float64,
        maxshape=(None,),
        fillvalue=np.nan,
    )
    # Next line fails
    data_series = TimeSeries(
        name="TODO: name from params",
        data=dataio,
        timestamps=timestampio,
        # rate=srate,  # Comment previous line and uncomment this to fix error
        unit="n/a",
    )

    nwbfile = NWBFile(
        session_description="demonstrate streaming timestamps",
        identifier="abcd1234",
        session_start_time=datetime.datetime.now(datetime.timezone.utc),
    )
    nwbfile.add_acquisition(data_series)

    # Write the data to file
    file_path = Path(tempfile.gettempdir()) / "test_nwb_stream_timestamps.nwb"
    io = NWBHDF5IO(file_path, "w")
    io.write(nwbfile)

    dataio.dataset.resize(len(dataio.dataset) + len(data), axis=0)
    dataio.dataset[-len(data) :] = data
    timestampio.dataset.resize(len(timestampio.dataset) + len(timestamps), axis=0)
    timestampio.dataset[-len(timestamps) :] = timestamps

    io.close()

Traceback

>       data_series = TimeSeries(
            name="TODO: name from params",
            data=dataio,
            timestamps=timestampio,
            # rate=srate,
            unit="n/a",  # TODO
        )

test_nwb.py:43: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../../.venv/lib/python3.9/site-packages/hdmf/utils.py:667: in func_call
    pargs = _check_args(args, kwargs)
../../../.venv/lib/python3.9/site-packages/hdmf/utils.py:639: in _check_args
    parsed = __parse_args(
../../../.venv/lib/python3.9/site-packages/hdmf/utils.py:351: in __parse_args
    valshape = get_data_shape(argval)
../../../.venv/lib/python3.9/site-packages/hdmf/utils.py:916: in get_data_shape
    if hasattr(data, 'maxshape'):
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <hdmf.backends.hdf5.h5_utils.H5DataIO object at 0x16db5fa00>
attr = 'maxshape'

    def __getattr__(self, attr):
        """Delegate attribute lookup to data object"""
        if attr == '__array_struct__' and not self.valid:
            # np.array() checks __array__ or __array_struct__ attribute dep. on numpy version
            raise InvalidDataIOError("Cannot convert data to array. Data is not valid.")
        if not self.valid:
>           raise InvalidDataIOError("Cannot get attribute '%s' of data. Data is not valid." % attr)
E           hdmf.data_utils.InvalidDataIOError: Cannot get attribute 'maxshape' of data. Data is not valid.

../../../.venv/lib/python3.9/site-packages/hdmf/data_utils.py:1104: InvalidDataIOError

Operating System

macOS

Python Executable

Python

Python Version

3.9

Package Versions

env.txt

Code of Conduct

@cboulay cboulay changed the title [Bug]: Unbounded stream to dataset with timestamps [Bug]: Unbounded stream to dataset with H5DataIO timestamps fails Jul 9, 2024
@cboulay
Copy link
Author

cboulay commented Jul 9, 2024

I may have jumped to a conclusion about the error. Even if I try manual timestamps with np.arange(test_data.shape[0]), I still get the same error. It's the data that is missing the maxshape. We only see the error when timestamps are provided because the TimeSeries __init__ call to self._check_time_series_dimension() only looks for maxshape if timestamps is not None.

But, I did provide the maxshape argument to my H5DataIO object... so where did it go? It goes into the object's io_settings, but the H5DataIO itself does not have a maxshape attribute or property.

I added the following to H5DataIO in hdmf/backends/hdf5/h5_utils.py

    @property
    def maxshape(self):
        return self.io_settings["maxshape"]

And with that, my test code works.

Edit: the examples in #1011 worked because DataChunkIterator has a getter for maxshape. So I'm guessing the "alternative approach" in the docs never worked with manual timestamps because of this maxshape problem.

@stephprince
Copy link
Contributor

Hi @cboulay, thanks for the detailed code example and bug description!

I think using the "alternative approach" to create the H5DataIO objects for data and timestamps seems reasonable for your use case.

Your fix to add a maxshape property to the H5DataIO object seems like a good way to address this issue. I believe this issue should mostly be on the hdmf side, so thank you for already opening the relevant issue there. Would you be interested in filing a pull request in the hdmf repository to fix this bug?

@stephprince stephprince added category: bug errors in the code or code behavior priority: high impacts proper operation or use of feature important to most users labels Jul 10, 2024
@stephprince stephprince self-assigned this Jul 10, 2024
@stephprince stephprince added this to the 2.8.0 milestone Jul 10, 2024
@cboulay
Copy link
Author

cboulay commented Jul 10, 2024

PR made: hdmf-dev/hdmf#1149

@stephprince stephprince modified the milestones: 2.8.0, 2.9.0 Jul 23, 2024
@cboulay
Copy link
Author

cboulay commented Aug 28, 2024

upstream PR merged

@cboulay cboulay closed this as completed Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: bug errors in the code or code behavior priority: high impacts proper operation or use of feature important to most users
Projects
None yet
Development

No branches or pull requests

2 participants