Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add file create data appending #1163

Open
wants to merge 2 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions src/pynwb/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -258,6 +258,15 @@ def export(self, **kwargs):
kwargs['container'] = nwbfile
call_docval_func(super().export, kwargs)

def read(self, **kwargs):

nwbfile = super().read(**kwargs)

if self.mode != 'r':
nwbfile._appendModificationEntry()

return nwbfile


from . import io as __io # noqa: F401,E402
from .core import NWBContainer, NWBData # noqa: F401,E402
Expand Down
13 changes: 11 additions & 2 deletions src/pynwb/file.py
Original file line number Diff line number Diff line change
Expand Up @@ -309,10 +309,12 @@ def __init__(self, **kwargs):
raise ValueError("'timestamps_reference_time' must be a timezone-aware datetime object.")

self.fields['file_create_date'] = getargs('file_create_date', kwargs)

if self.fields['file_create_date'] is None:
self.fields['file_create_date'] = datetime.now(tzlocal())
if isinstance(self.fields['file_create_date'], datetime):
self.fields['file_create_date'] = [datetime.now(tzlocal())]
elif isinstance(self.fields['file_create_date'], datetime):
self.fields['file_create_date'] = [self.fields['file_create_date']]

self.fields['file_create_date'] = list(map(_add_missing_timezone, self.fields['file_create_date']))

fieldnames = [
Expand Down Expand Up @@ -749,6 +751,13 @@ def copy(self):

return NWBFile(**kwargs)

def _appendModificationEntry(self):
"""
Append an entry with the current timestamp to the file_create_date array
"""

self.fields['file_create_date'].append(datetime.now(tzlocal()))


def _add_missing_timezone(date):
"""
Expand Down
44 changes: 44 additions & 0 deletions tests/unit/test_file.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import numpy as np
import os
import pandas as pd

from datetime import datetime
Expand Down Expand Up @@ -516,3 +517,46 @@ def test_reftime_tzaware(self):
'TEST124',
self.start_time,
timestamps_reference_time=self.ref_time_notz)


class TestFileCreateDateArray(TestCase):

def setUp(self):
self.path = 'unittest_file_create_date.nwb'

def tearDown(self):
if os.path.exists(self.path):
os.remove(self.path)

def test_simple(self):
file_create_date = datetime.now(tzlocal())
nwbfile_init = NWBFile(' ', ' ',
datetime.now(tzlocal()),
file_create_date=file_create_date,
institution='Rixdorf University, Berlin')

self.assertEqual(nwbfile_init.file_create_date, [file_create_date])
self.assertEqual(len(nwbfile_init.file_create_date), 1)

with NWBHDF5IO(self.path, 'w') as io:
io.write(nwbfile_init)

with NWBHDF5IO(self.path, 'r') as reader:
nwbfile = reader.read()

# no change as it was opened read-only
self.assertEqual(len(nwbfile.file_create_date), 1)

with NWBHDF5IO(self.path, 'r+') as writer:
nwbfile = writer.read()

# added one more entry as opened read/write
self.assertEqual(len(nwbfile.file_create_date), 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also test the second round-trip, i.e., close the file and re-open it in read-mode and confirm that the change to file_create_date is still present. I am concerned that the file_create_date dataset is not chunked and therefore cannot grow, or the change is not saved for some reason.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rly I've pushed something but I need to review that again tomorrow.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rly You were right. The additional entry does not reach the file.

h5dump -A unittest_file_create_date.nwb | grep -A 10 file_create_date

HDF5 "unittest_file_create_date.nwb" {
GROUP "/" {
   ATTRIBUTE ".specloc" {
      DATATYPE  H5T_REFERENCE { H5T_STD_REF_OBJECT }
      DATASPACE  SCALAR
      DATA {
      (0): GROUP 6512 /specifications 
      }
   }
   ATTRIBUTE "namespace" {
      DATATYPE  H5T_STRING {
--
   DATASET "file_create_date" {
      DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
   }
   GROUP "general" {
      DATASET "institution" {

Questions:

  • How can I fix that?
  • How can I require a newer hdmf version to that the tests pass?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To fix that, the dataset has to be chunked. @ajtritt -- is there a way to chunk only the NWBFile.file_create_date dataset? I am also in favor of blanket chunking all datasets in NWB...

To use changes in a newer hdmf version, the changes must have been released on PyPI. The recent "mode" function addition isn't released yet, but we could do that this week if these issues are pressing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A new hdmf would be nice!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do I force the stored dataset to be chunked?

I tried

diff --git a/src/pynwb/io/file.py b/src/pynwb/io/file.py
index 1ddeb310..2ec342d9 100644
--- a/src/pynwb/io/file.py
+++ b/src/pynwb/io/file.py
@@ -3,6 +3,7 @@ from hdmf.build import ObjectMapper
 from .. import register_map
 from ..file import NWBFile, Subject
 from ..core import ScratchData
+from hdmf.backends.hdf5.h5_utils import H5DataIO


 @register_map(NWBFile)
@@ -156,6 +157,10 @@ class NWBFileMap(ObjectMapper):
         dates = list(map(dateutil_parse, datestr))
         return dates

+    @ObjectMapper.object_attr('file_create_date')
+    def file_create_date_obj_attr(self, container, manager):
+        return H5DataIO(container.file_create_date, chunks=True)
+
     @ObjectMapper.constructor_arg('file_name')
     def name(self, builder, manager):
         return builder.name

but that does not work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I have the right solution for you, but a couple of thoughts:

  1. I think it is important to expose this behavior explicitly to user. While doing this implicitly behind the scenes is convenient, it make the process intransparent.
  2. We should try not to mix front-end and backend functionality, i.e, using the HDF5-specific H5DataIO in the ObjectMapper (or Container) is problematic as this will not translate to other backends.
  3. This issue also has come up with DynamicTable at some point, because we wanted all columns of the table to be chunked so they can be extended. @rly @ajtritt was that issue solved and would that same strategy apply here?

Ultimately, I think the core issue is that we want specific datasets to be written in a resizable fashion (so they can grow). In the case of HDF5 that requires chunking but for other backends that may or may not be the case. In that vain, I think what we may need is a generic (backend-agnostic) way to provide write-hints, which in this case would say "make this dataset resizeable". I'm wondering whether we could add I/O hints on the builder for this and in the object-mapper a way to ask for I/O hints for fields. It would then be up to the backend to decide what to do with those I/O hints.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oruebel It totally agree that a HDF5 specific solution is the wrong thing to do here. But up to now I don't have any solution at all.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm starting to work on this again.

@oruebel

I think it is important to expose this behavior explicitly to user. While doing this implicitly behind the scenes is convenient, it make the process intransparent.

What implicit part are you concerned about? The "making the dataset chunked" or "adding new entries in the file_create_dataset"? The latter is what nwb-schema says how file_create_dataset should be handled.

Ultimately, I think the core issue is that we want specific datasets to be written in a resizable fashion (so they can grow). In the case of HDF5 that requires chunking but for other backends that may or may not be the case. In that vain, I think what we may need is a generic (backend-agnostic) way to provide write-hints, which in this case would say "make this dataset resizeable". I'm wondering whether we could add I/O hints on the builder for this and in the object-mapper a way to ask for I/O hints for fields. It would then be up to the backend to decide what to do with those I/O hints.

Yes that would be required. Of course my above hack is a hack and can not be merged as is, but I first wanted to get something working and then make the solution generalizable. I just saw that hdmf.builders.DatasetBuilder has a chunks argument as well.

I seem to not understand how the object mappers work. According to https://pynwb.readthedocs.io/en/stable/overview_software_architecture.html?highlight=architecture#objectmapper I would think that

$ git diff .
diff --git a/src/pynwb/io/file.py b/src/pynwb/io/file.py
index 2c629ab7..a7057941 100644
--- a/src/pynwb/io/file.py
+++ b/src/pynwb/io/file.py
@@ -3,7 +3,7 @@ from hdmf.build import ObjectMapper
 from .. import register_map
 from ..file import NWBFile, Subject
 from ..core import ScratchData
-
+from hdmf.build import DatasetBuilder

 @register_map(NWBFile)
 class NWBFileMap(ObjectMapper):
@@ -152,6 +152,10 @@ class NWBFileMap(ObjectMapper):
         date = dateutil_parse(datestr)
         return date

+    @ObjectMapper.object_attr('file_create_date')
+    def file_create_date_obj_attr(self, container, manager):
+        return DatasetBuilder('file_create_date', data=container.file_create_date, chunks=True)
+
     @ObjectMapper.constructor_arg('file_create_date')
     def dateconversion_list(self, builder, manager):
         datestr = builder.get('file_create_date').data

should work, but it doesn't. Any hints?


writer.write(nwbfile)

with NWBHDF5IO(self.path, 'r') as reader:
nwbfile = reader.read()

# reopen again to check that it has still two entries
self.assertEqual(len(nwbfile.file_create_date), 2)