List of caveats for non-seekable streams #59

VivekPanyam · 2023-01-12T00:25:05Z

It may be a good idea to add a few more notes to your list of caveats about decompressing from a non-seekable stream.

Because ZIP files may be appended to, only files specified in the central directory at the end of the file are valid. Scanning a ZIP file for local file headers is invalid (except in the case of corrupted archives), as the central directory may declare that some files have been deleted and other files have been updated.

Maybe consider adding a note about deleted and updated files to the list here:

rs-async-zip/src/read/stream.rs

Lines 17 to 28 in 6bca65b

    
           //! # Considerations 
        
           //! As the central directory of a ZIP archive is stored at the end of it, a non-seekable reader doesn't have access 
        
           //! to it. We have to rely on information provided within the local file header which may not be accurate or complete. 
        
           //! This results in: 
        
           //! - No file comment being avaliable (defaults to an empty string). 
        
           //! - No internal or external file attributes being avaliable (defaults to 0). 
        
           //! - The extra field data potentially being inconsistent with what's stored in the central directory. 
        
           //! - None of the following being avaliable when the entry was written with a data descriptor (defaults to 0): 
        
           //!     - CRC 
        
           //!     - compressed size 
        
           //!     - uncompressed size 
        
           //!

Majored · 2023-01-13T11:54:03Z

Do you happen to know how a file would be updated/deleted in this way for the LFH to become invalid? I could just be blind, but I can't find anything in the spec that strictly supports this.

Instead, this is a requirement by the spec:

Each "local file header" MUST be accompanied by a corresponding "central directory header" record within the central directory section of the ZIP file.

which means you can't just delete a file by removing the CDR but leaving the actual LFH/data present.

One other thing that occurred to me was that you can't use Stored when storing an inner ZIP file, because we'll start matching on that inner ZIP file's signatures. Will add that now.

VivekPanyam · 2023-01-16T22:21:39Z

4.3.2 Each file placed into a ZIP file MUST be preceded by a "local
file header" record for that file. Each "local file header" MUST be
accompanied by a corresponding "central directory header" record within
the central directory section of the ZIP file.

I think one way to interpret that statement is:

All files included in the ZIP must have a local file header.
Each of these local file headers must be pointed to by the central directory.

So the local file headers of every included file must have a corresponding record in the central directory.

It doesn't necessarily say that every local file header must be pointed to by the central directory.

Wikipedia has a specific example:

...the central directory may declare that some files have been deleted and other files have been updated.

For example, we may start with a ZIP file that contains files A, B and C. File B is then deleted and C updated. This may be achieved by just appending a new file C to the end of the original ZIP file and adding a new central directory that only lists file A and the new file C.

The link above also provides some rationale for why this feature was initially used with floppy disks when ZIP was first designed.

There are also several append-only storage systems so I'd expect that this approach is also still used in those cases.

I think Stored inner ZIP files should be a solvable problem. Anything contained in the outer ZIP file will be preceded by a local file header that lists the (possibly compressed) length of the contained file. If you ignore all data for the next compressed_size bytes, you can avoid matching on an internal zip file.

(I haven't looked at the code for the streaming implementation in this crate yet so this may not be easily applicable to the current implementation)

Majored added the documentation Improvements or additions to documentation label Apr 29, 2023

Majored self-assigned this Apr 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

List of caveats for non-seekable streams #59

List of caveats for non-seekable streams #59

VivekPanyam commented Jan 12, 2023

Majored commented Jan 13, 2023

VivekPanyam commented Jan 16, 2023

List of caveats for non-seekable streams #59

List of caveats for non-seekable streams #59

Comments

VivekPanyam commented Jan 12, 2023

Majored commented Jan 13, 2023

VivekPanyam commented Jan 16, 2023