Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add file create data appending #1163

Open
wants to merge 2 commits into
base: dev
Choose a base branch
from

Conversation

t-b
Copy link
Collaborator

@t-b t-b commented Jan 29, 2020

Close #990.

Requires hdmf-dev/hdmf#280.

@t-b t-b requested a review from rly January 29, 2020 23:35
rly
rly previously approved these changes Jan 30, 2020
nwbfile = writer.read()

# added one more entry as opened read/write
self.assertEqual(len(nwbfile.file_create_date), 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also test the second round-trip, i.e., close the file and re-open it in read-mode and confirm that the change to file_create_date is still present. I am concerned that the file_create_date dataset is not chunked and therefore cannot grow, or the change is not saved for some reason.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rly I've pushed something but I need to review that again tomorrow.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rly You were right. The additional entry does not reach the file.

h5dump -A unittest_file_create_date.nwb | grep -A 10 file_create_date

HDF5 "unittest_file_create_date.nwb" {
GROUP "/" {
   ATTRIBUTE ".specloc" {
      DATATYPE  H5T_REFERENCE { H5T_STD_REF_OBJECT }
      DATASPACE  SCALAR
      DATA {
      (0): GROUP 6512 /specifications 
      }
   }
   ATTRIBUTE "namespace" {
      DATATYPE  H5T_STRING {
--
   DATASET "file_create_date" {
      DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
   }
   GROUP "general" {
      DATASET "institution" {

Questions:

  • How can I fix that?
  • How can I require a newer hdmf version to that the tests pass?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To fix that, the dataset has to be chunked. @ajtritt -- is there a way to chunk only the NWBFile.file_create_date dataset? I am also in favor of blanket chunking all datasets in NWB...

To use changes in a newer hdmf version, the changes must have been released on PyPI. The recent "mode" function addition isn't released yet, but we could do that this week if these issues are pressing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A new hdmf would be nice!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do I force the stored dataset to be chunked?

I tried

diff --git a/src/pynwb/io/file.py b/src/pynwb/io/file.py
index 1ddeb310..2ec342d9 100644
--- a/src/pynwb/io/file.py
+++ b/src/pynwb/io/file.py
@@ -3,6 +3,7 @@ from hdmf.build import ObjectMapper
 from .. import register_map
 from ..file import NWBFile, Subject
 from ..core import ScratchData
+from hdmf.backends.hdf5.h5_utils import H5DataIO


 @register_map(NWBFile)
@@ -156,6 +157,10 @@ class NWBFileMap(ObjectMapper):
         dates = list(map(dateutil_parse, datestr))
         return dates

+    @ObjectMapper.object_attr('file_create_date')
+    def file_create_date_obj_attr(self, container, manager):
+        return H5DataIO(container.file_create_date, chunks=True)
+
     @ObjectMapper.constructor_arg('file_name')
     def name(self, builder, manager):
         return builder.name

but that does not work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I have the right solution for you, but a couple of thoughts:

  1. I think it is important to expose this behavior explicitly to user. While doing this implicitly behind the scenes is convenient, it make the process intransparent.
  2. We should try not to mix front-end and backend functionality, i.e, using the HDF5-specific H5DataIO in the ObjectMapper (or Container) is problematic as this will not translate to other backends.
  3. This issue also has come up with DynamicTable at some point, because we wanted all columns of the table to be chunked so they can be extended. @rly @ajtritt was that issue solved and would that same strategy apply here?

Ultimately, I think the core issue is that we want specific datasets to be written in a resizable fashion (so they can grow). In the case of HDF5 that requires chunking but for other backends that may or may not be the case. In that vain, I think what we may need is a generic (backend-agnostic) way to provide write-hints, which in this case would say "make this dataset resizeable". I'm wondering whether we could add I/O hints on the builder for this and in the object-mapper a way to ask for I/O hints for fields. It would then be up to the backend to decide what to do with those I/O hints.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oruebel It totally agree that a HDF5 specific solution is the wrong thing to do here. But up to now I don't have any solution at all.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm starting to work on this again.

@oruebel

I think it is important to expose this behavior explicitly to user. While doing this implicitly behind the scenes is convenient, it make the process intransparent.

What implicit part are you concerned about? The "making the dataset chunked" or "adding new entries in the file_create_dataset"? The latter is what nwb-schema says how file_create_dataset should be handled.

Ultimately, I think the core issue is that we want specific datasets to be written in a resizable fashion (so they can grow). In the case of HDF5 that requires chunking but for other backends that may or may not be the case. In that vain, I think what we may need is a generic (backend-agnostic) way to provide write-hints, which in this case would say "make this dataset resizeable". I'm wondering whether we could add I/O hints on the builder for this and in the object-mapper a way to ask for I/O hints for fields. It would then be up to the backend to decide what to do with those I/O hints.

Yes that would be required. Of course my above hack is a hack and can not be merged as is, but I first wanted to get something working and then make the solution generalizable. I just saw that hdmf.builders.DatasetBuilder has a chunks argument as well.

I seem to not understand how the object mappers work. According to https://pynwb.readthedocs.io/en/stable/overview_software_architecture.html?highlight=architecture#objectmapper I would think that

$ git diff .
diff --git a/src/pynwb/io/file.py b/src/pynwb/io/file.py
index 2c629ab7..a7057941 100644
--- a/src/pynwb/io/file.py
+++ b/src/pynwb/io/file.py
@@ -3,7 +3,7 @@ from hdmf.build import ObjectMapper
 from .. import register_map
 from ..file import NWBFile, Subject
 from ..core import ScratchData
-
+from hdmf.build import DatasetBuilder

 @register_map(NWBFile)
 class NWBFileMap(ObjectMapper):
@@ -152,6 +152,10 @@ class NWBFileMap(ObjectMapper):
         date = dateutil_parse(datestr)
         return date

+    @ObjectMapper.object_attr('file_create_date')
+    def file_create_date_obj_attr(self, container, manager):
+        return DatasetBuilder('file_create_date', data=container.file_create_date, chunks=True)
+
     @ObjectMapper.constructor_arg('file_create_date')
     def dateconversion_list(self, builder, manager):
         datestr = builder.get('file_create_date').data

should work, but it doesn't. Any hints?

@rly rly self-requested a review January 30, 2020 00:00
@t-b t-b force-pushed the add-file-create-data-appending branch from e6459da to 9137f2a Compare January 30, 2020 00:19
@bendichter
Copy link
Contributor

@rly @t-b what's the status of this?

@t-b
Copy link
Collaborator Author

t-b commented Nov 3, 2020

We need to find a way to tell pynwb that certain datasets in HDF5 need to be written as chunked by default. Only then they are appendable. I don't know how to do that.

@bendichter
Copy link
Contributor

@t-b ah, ok. Sounds like a job for H5DataIO

t-b added 2 commits April 13, 2021 13:51
Using an if/elif chain is easier to understand.
… load

The file_create_date entry holds according to [1]

  A record of the date the file was created and of subsequent modifications.

But until now we never added additional entries to file_create_date.

We now do that when the file is not opened read-only.

[1]: https://nwb-schema.readthedocs.io/en/latest/format.html#nwb-n-file
@t-b t-b force-pushed the add-file-create-data-appending branch from 3d70398 to 3fb6358 Compare April 13, 2021 11:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update file_create_date on write
4 participants