Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Copy" via "export" is "larger" (10x fold in this silly example) than original! #1187

Closed
yarikoptic opened this issue Sep 5, 2024 · 5 comments
Assignees
Labels
category: bug errors in the code or code behavior priority: medium non-critical problem and/or affecting only a small set of users

Comments

@yarikoptic
Copy link
Contributor

Follow up to

If we use the same script as provided in #1186 with not broken hdmf 3.14.3, we get

❯ /tmp/simple2.py /tmp/simple2.nwb /tmp/simple2-copy.nwb
Copying /tmp/simple2.nwb /tmp/simple2-copy.nwb
Now reading /tmp/simple2-copy.nwb
/tmp/simple2.py /tmp/simple2.nwb /tmp/simple2-copy.nwb  5.06s user 1.36s system 211% cpu 3.033 total
❯ ls -l /tmp/simple2.nwb /tmp/simple2-copy.nwb
-rw-rw-r-- 1 yoh yoh 189120 Sep  5 15:24 /tmp/simple2-copy.nwb
-rw-rw-r-- 1 yoh yoh  19664 Sep  5 15:18 /tmp/simple2.nwb

so you can see that "copied" file is 189k while original just 19k. Is that expected/desired/unavoidable?

output of diff -Naur <(h5dump /tmp/simple2.nwb) <(h5dump /tmp/simple2-copy.nwb): http://www.oneukrainian.com/tmp/simple2-h5dump.diff

Original file is produced using this pytest fixture https://github.com/dandi/dandi-cli/blob/HEAD/dandi/tests/fixtures.py#L101

PS feel welcome to reassign to pynwb is the issue is there .

@mavaylon1 mavaylon1 added category: bug errors in the code or code behavior priority: medium non-critical problem and/or affecting only a small set of users labels Sep 5, 2024
@rly
Copy link
Contributor

rly commented Sep 5, 2024

@yarikoptic Your dandi pytest fixture writes NWB files without caching the spec. The export call caches the spec by default. I believe that explains all of the diff. If you want to export without caching the spec, you currently cannot do that using pynwb but we are going to remedy that in a quick bugfix to pynwb.

@yarikoptic
Copy link
Contributor Author

yarikoptic commented Sep 6, 2024

coolio, thanks @rly for quick response!
And confirming on above example that we would get the same size and only id changed as requested

❯ /tmp/simple2.py /tmp/simple2.nwb /tmp/simple2-copy.nwb && ls -l /tmp/simple2.nwb /tmp/simple2-copy.nwb
Copying /tmp/simple2.nwb /tmp/simple2-copy.nwb using pywnb 2.5.0.post0.dev15
Now reading /tmp/simple2-copy.nwb
/tmp/simple2.py /tmp/simple2.nwb /tmp/simple2-copy.nwb  3.32s user 2.43s system 229% cpu 2.510 total
-rw-rw-r-- 1 yoh yoh 19664 Sep  6 14:48 /tmp/simple2-copy.nwb
-rw-rw-r-- 1 yoh yoh 19664 Sep  5 15:18 /tmp/simple2.nwb
❯ diff -Naur <(h5dump /tmp/simple2.nwb) <(h5dump /tmp/simple2-copy.nwb)
--- /proc/self/fd/18	2024-09-06 14:48:31.938598041 -0400
+++ /proc/self/fd/19	2024-09-06 14:48:31.938598041 -0400
@@ -1,4 +1,4 @@
-HDF5 "/tmp/simple2.nwb" {
+HDF5 "/tmp/simple2-copy.nwb" {
 GROUP "/" {
    ATTRIBUTE "namespace" {
       DATATYPE  H5T_STRING {
@@ -45,7 +45,7 @@
       }
       DATASPACE  SCALAR
       DATA {
-      (0): "154bbc4f-4276-47db-bac9-f7cdc8880aa4"
+      (0): "c8b730fc-f3bf-4619-8069-c66f5ff0a9aa"
       }
    }
    GROUP "acquisition" {
@@ -183,7 +183,7 @@
             }
             DATASPACE  SCALAR
             DATA {
-            (0): "db410d65-a49a-4bd8-8ec9-ad6076d272e7"
+            (0): "eb09c10a-6ac9-461b-bb44-5bccd2551a3b"
             }
          }
          DATASET "date_of_birth" {

now I wonder -- how to discover if original file had spec cached or not so I export without only if prior one didn't have it cached?

@rly
Copy link
Contributor

rly commented Sep 10, 2024

The spec is cached in the hdf5 nwb file if the root hdf5 file contains an attribute named ".specloc" (the value of which is set to "/specifications" to indicate that the cached spec is in the specifications group)

@rly
Copy link
Contributor

rly commented Sep 10, 2024

Alternatively, you can run pynwb.NWBHDF5IO.get_namespaces(path) which returns an empty dict if there are no cached namespaces.

@rly
Copy link
Contributor

rly commented Oct 2, 2024

I believe this issue has been resolved. @yarikoptic please reopen if not.

@rly rly closed this as completed Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: bug errors in the code or code behavior priority: medium non-critical problem and/or affecting only a small set of users
Projects
None yet
Development

No branches or pull requests

4 participants