-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rfc: a car block index format for external use. #9
Conversation
The CAR v2 Multihash Sorted Index and friends were designed to be appended to a CAR v1 file, and used for rapid random access where you have both CAR and index locally. We need to be able to easily make range requests for block bytes, but this isn't possible with the existing format. A format more appropriate for our use-case is here needed. Note that we want to move away from doing the work of indexing the blocks ourselves. We dont want to rely on bucket events, and we dont want to do large amounts of block reading and verifying per CAR unless users are willing to pay for us to do that. On the flip side it is trivial to create the CAR index on the client side while the CAR is being assembled. Let' make sure that when we switch to user created CAR indexes that they are in a format we can use as easily as our current block index db. A Request For Comments! License: MIT Signed-off-by: Oli Evans <[email protected]>
@mikeal you have thought about block indexes more than most. Is there an iteration of your multiblock idea that could include block bytes offset and length to let us make fetching block bytes from CARs in buckets via range requests less awful? |
The best way to think about this, is that all formats are interchangeable if the new format can be used to produce the other format, because we distinguish them by CID and we can generate the CID for equivalent formats whenever necessary. So, if we have a better format we know works perfect for our needs, we shouldn't be shy about shipping it, so long as you can pass it into a function and get the old format out. If we want to be nice to old protocols we can even put the CID for the old formats in the claims. We would still only need to include the new format as a block in the claim as we can assume whoever wants the old format can get it from the new one like we did to create the CID 😊 |
License: MIT Signed-off-by: Oli Evans <[email protected]>
Some thoughts, not in meaningful order:
|
Yes let's iterate on the format. The cost of encoding a CID in dag-cbor is 2 bytes in addition to byte representation of the CID. Do we definitely want the index to store the CID or is the Multihash sufficient, as that is an easy way to shave bytes. I can demo a streaming dag-cbor parser. The nice thing is the consumer can use the standard (non-streaming) dag-cbor encoder/decoder and see the index as cid linked data, or use a custom streaming parser for rapid index iteration. CAR offset order is suggested, but cid sorted order may be preferable. I'm gonna assume that we agree that there is a problem here worth solving; that Multihash Sorted Index is not the format we want clients to be building as it does not work well for range requests of block bytes and we're gonna continue to need to do that for the foreseeable future. |
License: MIT Signed-off-by: Oli Evans <[email protected]>
My take is that you do not want to have CIDs ( the first 2 varints ) in the index. Multihash is all you are going to query by, which in turn warrants to have indexes by MH alone. |
So to keep "CID-less" and retain ability to materialise other existing CARv2 indexes you either have to store block header offset/size per item OR group by header size (instead of grouping by version and codec in proposed |
Is "be as small as possible" the top priority for an external index file? Yes is a reasonable answer, but it was not my main goal when opening this RFC... being maximally useable for our use case (one look up to find the details for a range request for just the block bytes) and also be compatible with as yet unknown others was what I had in mind... without introducing yet another format. I tried to demo in the RFC that we could store them as inter-planetary linked data for a 25% size penalty. |
Ok, lets not block on this - the proposal SGTM. |
Superseded by storacha/specs#121 |
The CAR v2 Multihash Sorted Index and friends were designed to be appended to a CAR v1 file, and used for rapid random access where you have both CAR and index locally.
We need to be able to easily make range requests for block bytes, but this isn't possible with the existing format. A format more appropriate for our use-case is here needed.
Note that we want to move away from doing the work of indexing the blocks ourselves. We dont want to rely on bucket events, and we dont want to do large amounts of block reading and verifying per CAR unless users are willing to pay for us to do that. On the flip side it is trivial to create the CAR index on the client side while the CAR is being assembled. Let' make sure that when we switch to user created CAR indexes that they are in a format we can use as easily as our current block index db.
A Request For Comments!
Rendered view is here: https://github.com/web3-storage/RFC/blob/493365ac8ca023acdb32c069be9efb7072f3c126/rfc/car-block-indexing/README.md
License: MIT