Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc: a car block index format for external use. #9

Closed
wants to merge 3 commits into from

Conversation

olizilla
Copy link

@olizilla olizilla commented Jan 29, 2024

The CAR v2 Multihash Sorted Index and friends were designed to be appended to a CAR v1 file, and used for rapid random access where you have both CAR and index locally.

We need to be able to easily make range requests for block bytes, but this isn't possible with the existing format. A format more appropriate for our use-case is here needed.

Note that we want to move away from doing the work of indexing the blocks ourselves. We dont want to rely on bucket events, and we dont want to do large amounts of block reading and verifying per CAR unless users are willing to pay for us to do that. On the flip side it is trivial to create the CAR index on the client side while the CAR is being assembled. Let' make sure that when we switch to user created CAR indexes that they are in a format we can use as easily as our current block index db.

A Request For Comments!

Rendered view is here: https://github.com/web3-storage/RFC/blob/493365ac8ca023acdb32c069be9efb7072f3c126/rfc/car-block-indexing/README.md

License: MIT

The CAR v2 Multihash Sorted Index and friends were designed to be appended to a CAR v1 file, and used for rapid random access where you have both CAR and index locally.

We need to be able to easily make range requests for block bytes, but this isn't possible with the existing format. A format more appropriate for our use-case is here needed.

Note that we want to move away from doing the work of indexing the blocks ourselves. We dont want to rely on bucket events, and we dont want to do large amounts of block reading and verifying per CAR unless users are willing to pay for us to do that. On the flip side it is trivial to create the CAR index on the client side while the CAR is being assembled. Let' make sure that when we switch to user created CAR indexes that they are in a format we can use as easily as our current block index db.

A Request For Comments!

License: MIT
Signed-off-by: Oli Evans <[email protected]>
@olizilla
Copy link
Author

@mikeal you have thought about block indexes more than most. Is there an iteration of your multiblock idea that could include block bytes offset and length to let us make fetching block bytes from CARs in buckets via range requests less awful?

@mikeal
Copy link

mikeal commented Jan 29, 2024

The best way to think about this, is that all formats are interchangeable if the new format can be used to produce the other format, because we distinguish them by CID and we can generate the CID for equivalent formats whenever necessary.

So, if we have a better format we know works perfect for our needs, we shouldn't be shy about shipping it, so long as you can pass it into a function and get the old format out. If we want to be nice to old protocols we can even put the CID for the old formats in the claims. We would still only need to include the new format as a block in the claim as we can assume whoever wants the old format can get it from the new one like we did to create the CID 😊

License: MIT
Signed-off-by: Oli Evans <[email protected]>
@alanshaw
Copy link
Member

Some thoughts, not in meaningful order:

  1. We do want them to be super compact. Especially since we’re reading from CARs. They can be up to ~4GB in web3.storage and contain a LOT of blocks. We don’t really want read access speed to be affected adversely by downloading a large index so keeping it compact will help. CBOR encoding many CIDs is not going to yield the most compact format.
  2. AFAIK you can’t stream out dag-cbor encoded data . Whereas with a format more like an existing CARv2 index you can stream the data and improve access speeds i.e you can pause/stop/yield when you encounter the CID you’re looking for in the index. It also means you don’t have to hold the whole index in memory at any given time - you can just extract what you need.
  3. Either way I would specify an order for items in the spec. Bear in mind we might want to build an index that doesn’t cover the whole CAR. Having a deterministic encoding for an index that includes a specific set of blocks will ensure folks don’t duplicate information by encoding multiple indexes for the same data just with differently ordered entries.
  4. In multiformats spirit, having the index be prefixed by an identifier (as CARv2 indexes are) so we know what it is feels like a good idea.
  5. Overall I'm more inclined to create a new index format that is tailored for our needs than just using CBOR.

@olizilla
Copy link
Author

olizilla commented Jan 30, 2024

Yes let's iterate on the format.

The cost of encoding a CID in dag-cbor is 2 bytes in addition to byte representation of the CID.

Do we definitely want the index to store the CID or is the Multihash sufficient, as that is an easy way to shave bytes.

I can demo a streaming dag-cbor parser. The nice thing is the consumer can use the standard (non-streaming) dag-cbor encoder/decoder and see the index as cid linked data, or use a custom streaming parser for rapid index iteration.

CAR offset order is suggested, but cid sorted order may be preferable.

I'm gonna assume that we agree that there is a problem here worth solving; that Multihash Sorted Index is not the format we want clients to be building as it does not work well for range requests of block bytes and we're gonna continue to need to do that for the foreseeable future.

License: MIT
Signed-off-by: Oli Evans <[email protected]>
@ribasushi
Copy link

Do we definitely want the index to store the CID or is the Multihash sufficient, as that is an easy way to shave bytes.

My take is that you do not want to have CIDs ( the first 2 varints ) in the index. Multihash is all you are going to query by, which in turn warrants to have indexes by MH alone.

@alanshaw
Copy link
Member

alanshaw commented Feb 5, 2024

So to keep "CID-less" and retain ability to materialise other existing CARv2 indexes you either have to store block header offset/size per item OR group by header size (instead of grouping by version and codec in proposed CIDIndexSorted). I'm leaning towards the latter for succinctness.

@olizilla
Copy link
Author

olizilla commented Feb 5, 2024

Is "be as small as possible" the top priority for an external index file?

Yes is a reasonable answer, but it was not my main goal when opening this RFC... being maximally useable for our use case (one look up to find the details for a range request for just the block bytes) and also be compatible with as yet unknown others was what I had in mind... without introducing yet another format.

I tried to demo in the RFC that we could store them as inter-planetary linked data for a 25% size penalty.

@alanshaw
Copy link
Member

alanshaw commented Feb 6, 2024

Ok, lets not block on this - the proposal SGTM.

@alanshaw
Copy link
Member

Superseded by storacha/specs#121

@alanshaw alanshaw closed this May 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants