You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was trying to export my profile to an archive. My profile's disk-objectstore contains compressed packed files (verdi storage maintain --compress). During the process, I found that my disk space quickly shrinks due to the disk-objectstore losen all objects.
I think this should not happen as it would increase disk spacing needed to create the archive by several folds, espeically when the repository contains very compressible files (~ 20% compression ratio). In my case, the repository contains lots of VASP XML files.
Steps to reproduce
Steps to reproduce the behavior:
Run verdi storage maintain --compress with loose files
verdi archive create -a to create an archive
Go to the disk-objectstore contain folder to verify that the loose files reappear.
Expected behavior
No extra storage use as the archive writer should be able to just read the stream sequentially from the storage and write it to the archive.
Your environment
Operating system [e.g. Linux]: Linux
Python version [e.g. 3.7.1]: 3.9.12
aiida-core version [e.g. 1.2.1]: 2.6.2
Other relevant software versions, e.g. Postres & RabbitMQ
Additional context
I have pinned down the cause, this is due to compressed objects in the pack can only be read sequentially as a stream, so to support various seek options it opt to loose the object to disk on-demand (aiidateam/disk-objectstore#142).
When writing the archive, the writer uses seek(0, 2) in order to find the size of the stream, which triggers the loosening of the object to the disk.
# the disk-objectstore PackedObjectReader handler, does not support SEEK_END,
# so for these objects we always use ZIP64 to be safe
kwargs['force_zip64'] =True
I'm wondering if there is any other way to obtain the size of the object? It appears there is no repository API for doing so. Although such information is certainly available in the sqlite database of disk-objectstore / through file system (in case of loose file).
Alternatively, we can always have force_zip64, so there is no need to seek and tell. It also speeds up the archive process but can result in increased archive size due to the extra zip64 header.
The text was updated successfully, but these errors were encountered:
Hi @zhubonan, thanks for the report.
If I understand correctly, the issue happened only when the file repository is compressed for loose files without packing, because if it is packed, the force_zip64 will directly be used?
When writing the archive, the writer uses seek(0, 2) in order to find the size of the stream, which triggers the loosening of the object to the disk.
This I don't understand, I don't think seek will decompress the file, am I miss something?
It happens for a packed repository with compression. There is a limitation for reading from a decompressed stream that certain seek operation is not not allowed, such as going directly to the end of the stream because there is no way to know where the end is without reading it to the end.
We had a workaround for this such that the packed object is loosen (i.e. decompressed) if such operation is requested. This was the PR I mentioned.
The problem is that the archive writer uses seek(0, 2) to locate the end for every object so for a packed repository it result in all objects being loosen to the disk take a large amount of space.
Before the disk-objectstore PR was implemented, the seek would raise an error so it sets force_zip64 to true instead, without setting the size information.
I see, I didn't realize seek(0,2) where 2 is os.SEEK_END.
I think it should possible for the compressed file by reading the header can get the size of the file? But I think you make the correct point, the information of the size before and after compressed should be in the SQLite.
Describe the bug
I was trying to export my profile to an archive. My profile's disk-objectstore contains compressed packed files (
verdi storage maintain --compress
). During the process, I found that my disk space quickly shrinks due to the disk-objectstore losen all objects.I think this should not happen as it would increase disk spacing needed to create the archive by several folds, espeically when the repository contains very compressible files (~ 20% compression ratio). In my case, the repository contains lots of VASP XML files.
Steps to reproduce
Steps to reproduce the behavior:
verdi storage maintain --compress
with loose filesverdi archive create -a
to create an archiveExpected behavior
No extra storage use as the archive writer should be able to just read the stream sequentially from the storage and write it to the archive.
Your environment
Other relevant software versions, e.g. Postres & RabbitMQ
Additional context
I have pinned down the cause, this is due to compressed objects in the pack can only be read sequentially as a stream, so to support various
seek
options it opt to loose the object to disk on-demand (aiidateam/disk-objectstore#142).When writing the archive, the writer uses
seek(0, 2)
in order to find the size of the stream, which triggers the loosening of the object to the disk.aiida-core/src/aiida/tools/archive/implementations/sqlite_zip/writer.py
Lines 170 to 178 in 9baf3ca
I'm wondering if there is any other way to obtain the size of the object? It appears there is no repository API for doing so. Although such information is certainly available in the sqlite database of disk-objectstore / through file system (in case of loose file).
Alternatively, we can always have
force_zip64
, so there is no need toseek
andtell
. It also speeds up the archive process but can result in increased archive size due to the extra zip64 header.The text was updated successfully, but these errors were encountered: