You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For the example NWB file sub-despereaux_ses-despereaux-07_behavior+ecephys.nwb (87 GB), the "analog" and "e-series" TimeSeries have the same timestamps, which take up 2.2 GB space. The "sample_count" TimeSeries has almost the same timestamps. They seem to be off from the others by a factor of 1e9. If these are all the same, then two of the three can link to the third to save 4.4 GB space.
Compression
The timestamps of these large TimeSeries where the timestamps almost always differ by 1 can be compressed for a ~70% reduction in file size by my tests.
The "sample_count" TimeSeries contains data values that are almost always increasing by a fixed amount. It would be more space efficient to store only the starting value and the timestamps at which this rule is broken. Alternatively, it would be quite efficient to compress these data (80% reduction by my tests).
The "analog" TimeSeries contains data values that change very slowly over time. It would be very efficient to compress these data (99% reduction by my tests).
The "e-series" TimeSeries contain data that could be compressed for a ~30% reduction in file size by my tests. These data are the largest contributor (70 GB) to the overall file size.
Just applying compression to these TimeSeries and not linking the timestamps would result in the 87 GB file shrinking to about 50 GB, a ~32% reduction in file size. Linking the timestamps of two out of the three to the third would save an additional ~1 GB.
Compression would increase the write and read time of these data, however. Reading a 1e6 x 128 slice of a TimeSeries data without compression takes 0.12 seconds and with compression takes 1.68 seconds (13x increase). This could be optimized further by modifying the chunk size.
Based on these tests, I would recommend compressing the large TimeSeries data and timestamps (raw voltage data, analog data, and sample count data). The other data in the NWB file are very small relative to these.
The text was updated successfully, but these errors were encountered:
Thanks @rly. Do you know what the "sample_count" timestamps are for?
And I agree that applying compression is a good idea. Does the read and write time scale linearly with the length of recording? I wonder how long they are for a real data file.
@lfrank will look into where "sample_count" is used. The "sample_count" timestamps are stored in nanoseconds intentionally.
@lfrank and I discussed modifying rec_to_nwb to do the reference between timestamps where applicable. I will take care of that, but I might need your help @khl02007 to test the new pipeline with actual data.
I will also run some more tests on the time it takes to decompress various datasets. We discussed that the decompression time for the data datasets is probably not worth it, but it may be worth it for the timestamps, which have a larger compression ratio.
@rly I recently read up on HDF5 and understand a bit more about compressions now. And I think your suggestions are great! @lfrank is especially concerned with disk space usage so it would be useful to compress the TimeSeries and timestamps, maybe with gzip and shuffle filters. I can also run some tests to benchmark the effect of compression on accessing the data through spikeinterface and spike sorting. As for the chunk size, I think one of the dimensions should be the number of channels and the other something reasonable; this could also help with extraction of waveform snippets down the road. Have you worked more on this? Would be happy to work with you / help.
edeno
transferred this issue from LorenFrankLab/rec_to_nwb
Jul 11, 2023
Linking identical timestamps
For the example NWB file
sub-despereaux_ses-despereaux-07_behavior+ecephys.nwb
(87 GB), the "analog" and "e-series" TimeSeries have the same timestamps, which take up 2.2 GB space. The "sample_count" TimeSeries has almost the same timestamps. They seem to be off from the others by a factor of 1e9. If these are all the same, then two of the three can link to the third to save 4.4 GB space.Compression
The timestamps of these large TimeSeries where the timestamps almost always differ by 1 can be compressed for a ~70% reduction in file size by my tests.
The "sample_count" TimeSeries contains data values that are almost always increasing by a fixed amount. It would be more space efficient to store only the starting value and the timestamps at which this rule is broken. Alternatively, it would be quite efficient to compress these data (80% reduction by my tests).
The "analog" TimeSeries contains data values that change very slowly over time. It would be very efficient to compress these data (99% reduction by my tests).
The "e-series" TimeSeries contain data that could be compressed for a ~30% reduction in file size by my tests. These data are the largest contributor (70 GB) to the overall file size.
Just applying compression to these TimeSeries and not linking the timestamps would result in the 87 GB file shrinking to about 50 GB, a ~32% reduction in file size. Linking the timestamps of two out of the three to the third would save an additional ~1 GB.
Compression would increase the write and read time of these data, however. Reading a 1e6 x 128 slice of a TimeSeries data without compression takes 0.12 seconds and with compression takes 1.68 seconds (13x increase). This could be optimized further by modifying the chunk size.
Based on these tests, I would recommend compressing the large TimeSeries data and timestamps (raw voltage data, analog data, and sample count data). The other data in the NWB file are very small relative to these.
The text was updated successfully, but these errors were encountered: