Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce file size using linking and compression #21

Open
rly opened this issue Sep 2, 2021 · 3 comments
Open

Reduce file size using linking and compression #21

rly opened this issue Sep 2, 2021 · 3 comments

Comments

@rly
Copy link
Collaborator

rly commented Sep 2, 2021

Linking identical timestamps

For the example NWB file sub-despereaux_ses-despereaux-07_behavior+ecephys.nwb (87 GB), the "analog" and "e-series" TimeSeries have the same timestamps, which take up 2.2 GB space. The "sample_count" TimeSeries has almost the same timestamps. They seem to be off from the others by a factor of 1e9. If these are all the same, then two of the three can link to the third to save 4.4 GB space.

Compression

The timestamps of these large TimeSeries where the timestamps almost always differ by 1 can be compressed for a ~70% reduction in file size by my tests.

The "sample_count" TimeSeries contains data values that are almost always increasing by a fixed amount. It would be more space efficient to store only the starting value and the timestamps at which this rule is broken. Alternatively, it would be quite efficient to compress these data (80% reduction by my tests).

The "analog" TimeSeries contains data values that change very slowly over time. It would be very efficient to compress these data (99% reduction by my tests).

The "e-series" TimeSeries contain data that could be compressed for a ~30% reduction in file size by my tests. These data are the largest contributor (70 GB) to the overall file size.

Just applying compression to these TimeSeries and not linking the timestamps would result in the 87 GB file shrinking to about 50 GB, a ~32% reduction in file size. Linking the timestamps of two out of the three to the third would save an additional ~1 GB.

Compression would increase the write and read time of these data, however. Reading a 1e6 x 128 slice of a TimeSeries data without compression takes 0.12 seconds and with compression takes 1.68 seconds (13x increase). This could be optimized further by modifying the chunk size.

Based on these tests, I would recommend compressing the large TimeSeries data and timestamps (raw voltage data, analog data, and sample count data). The other data in the NWB file are very small relative to these.

@khl02007
Copy link

khl02007 commented Sep 8, 2021

Thanks @rly. Do you know what the "sample_count" timestamps are for?
And I agree that applying compression is a good idea. Does the read and write time scale linearly with the length of recording? I wonder how long they are for a real data file.

@rly
Copy link
Collaborator Author

rly commented Sep 15, 2021

@lfrank will look into where "sample_count" is used. The "sample_count" timestamps are stored in nanoseconds intentionally.

@lfrank and I discussed modifying rec_to_nwb to do the reference between timestamps where applicable. I will take care of that, but I might need your help @khl02007 to test the new pipeline with actual data.

I will also run some more tests on the time it takes to decompress various datasets. We discussed that the decompression time for the data datasets is probably not worth it, but it may be worth it for the timestamps, which have a larger compression ratio.

@khl02007
Copy link

@rly I recently read up on HDF5 and understand a bit more about compressions now. And I think your suggestions are great! @lfrank is especially concerned with disk space usage so it would be useful to compress the TimeSeries and timestamps, maybe with gzip and shuffle filters. I can also run some tests to benchmark the effect of compression on accessing the data through spikeinterface and spike sorting. As for the chunk size, I think one of the dimensions should be the number of channels and the other something reasonable; this could also help with extraction of waveform snippets down the road. Have you worked more on this? Would be happy to work with you / help.

@edeno edeno transferred this issue from LorenFrankLab/rec_to_nwb Jul 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants