Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing zip files in a streaming manner #9

Open
6 of 10 tasks
jimmywarting opened this issue Jul 3, 2019 · 2 comments
Open
6 of 10 tasks

Parsing zip files in a streaming manner #9

jimmywarting opened this issue Jul 3, 2019 · 2 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@jimmywarting
Copy link
Contributor

jimmywarting commented Jul 3, 2019

...without reading the hole content of the zip file

think we need to read it backwards, read the last bytes that tells how large the central dictionary is read that many bytes. (not exactly sure)

old stuff

Some basic idea of how it could work.

class ZipEntry {
  name = ''
  offset = 0
  size: = 0 // or length
  zip64 = true || false
  comment = ''
  stream () {
    // returns a new stream that read to-from something and inflate
  }
}

// some kind of zip parser based on ReadableStream that enqueues ZipEntries?
new ReadableStream({
  async start () {
    // return a promise of something
  }, 
  async pull (ctrl) {
    // user are ready to get the next entry
    // read/parse central directory or something
    ctrl.enqueue(new ZipEntry(...))
  }
}).pipeTo(new WritableStream({
  write (zipEntry) {
    // do something with zip entry
    if (zipEntry.name.includes('.css')) {
      // new Response(zipEntry.stream()).blob().then(URL.createObjectURL).then(appendToDOM)
    } 
  }
}))
  • decided upon a public api
  • read zip64 format ( zip64 )
    • read from from central zip64 dir?
  • read compressed data by inflating (inflate compressed entries using pako #30)
  • read from from central dir
  • read copied data (none deflated data)
  • read from readableStream (start to end)
  • prepended data (a.zip + b.zip)
  • read encrypted entries
@jimmywarting jimmywarting added this to the Backlog milestone Jul 3, 2019
@jimmywarting jimmywarting added the enhancement New feature or request label Jul 3, 2019
@jimmywarting jimmywarting self-assigned this Jul 10, 2019
@jimmywarting
Copy link
Contributor Author

jimmywarting commented Jul 12, 2019

I have 3 proposal of how we could create our read api. first i would like to give some background so you can understand some of the obstacle of reading a zip file.

The best solution is to read the end of the file and being able to seek/jump too different places multiple times. To do that we need to know the length of the file either by blob/file size or content-length header or some sort of fileLike object.
(Doing something like zip js did with it's TextReader, BlobReader, and HttpReader But unlike zip.js we read content with readable stream instead of doing multiple FileReads or range requests and our reader class could be a FileLike object instead providing us with 3 required items size, stream() and slice())

Doing a slice on http would just clone a Request object and just change content-range header.

But sometimes you don't always know the zip content length. So you need to get the hole body before you are able to read something, for example.

  • our zip writer produce a readableStream and the size is unknown.
  • a remote zip isn't always able to accept a partial request and has no content-length information
    getting a zip from any github master repo gives you a streamable zip and the size is unknown

So if the size (in our FileLike object) is not a number (could be NaN, null, or undefined) then we would read the content using only stream() from start the end and never use slice. (it will be less practical but could work) or the reader could accept two types of objects (FileLike or ReadableStream)

// using async iterator (current form)
for await (const entry of read(blob || readableStream)) {
  console.log(entry)
}

// just a iterator that returns promise
for (const it of read(blob || readableStream)) {
  const { value: entry, done } = await it
}

read(blob || readableStream).pipeTo(new WritableStream({
  write (entry) {
    console.log(entry)
  }
}))

// this could be fine for blob object but not so much for streams
const entries = await read(blob || readableStream)
const entry = entries[0]
console.log(entry)

@jimmywarting
Copy link
Contributor Author

fyi, i have figured out how to read zip64 formats now. (a bit more complicated) but i can grasp it now)
I have almost succeeded to read a zip64 file correctly, just have to get the right (size) information from "extraFields"

When i have manage to read it i can start working on making a zip64 file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant