rfc: a car block index format for external use. #9

olizilla · 2024-01-29T15:32:09Z

The CAR v2 Multihash Sorted Index and friends were designed to be appended to a CAR v1 file, and used for rapid random access where you have both CAR and index locally.

We need to be able to easily make range requests for block bytes, but this isn't possible with the existing format. A format more appropriate for our use-case is here needed.

Note that we want to move away from doing the work of indexing the blocks ourselves. We dont want to rely on bucket events, and we dont want to do large amounts of block reading and verifying per CAR unless users are willing to pay for us to do that. On the flip side it is trivial to create the CAR index on the client side while the CAR is being assembled. Let' make sure that when we switch to user created CAR indexes that they are in a format we can use as easily as our current block index db.

A Request For Comments!

Rendered view is here: https://github.com/web3-storage/RFC/blob/493365ac8ca023acdb32c069be9efb7072f3c126/rfc/car-block-indexing/README.md

License: MIT

The CAR v2 Multihash Sorted Index and friends were designed to be appended to a CAR v1 file, and used for rapid random access where you have both CAR and index locally. We need to be able to easily make range requests for block bytes, but this isn't possible with the existing format. A format more appropriate for our use-case is here needed. Note that we want to move away from doing the work of indexing the blocks ourselves. We dont want to rely on bucket events, and we dont want to do large amounts of block reading and verifying per CAR unless users are willing to pay for us to do that. On the flip side it is trivial to create the CAR index on the client side while the CAR is being assembled. Let' make sure that when we switch to user created CAR indexes that they are in a format we can use as easily as our current block index db. A Request For Comments! License: MIT Signed-off-by: Oli Evans <[email protected]>

olizilla · 2024-01-29T15:35:34Z

@mikeal you have thought about block indexes more than most. Is there an iteration of your multiblock idea that could include block bytes offset and length to let us make fetching block bytes from CARs in buckets via range requests less awful?

mikeal · 2024-01-29T17:25:19Z

The best way to think about this, is that all formats are interchangeable if the new format can be used to produce the other format, because we distinguish them by CID and we can generate the CID for equivalent formats whenever necessary.

So, if we have a better format we know works perfect for our needs, we shouldn't be shy about shipping it, so long as you can pass it into a function and get the old format out. If we want to be nice to old protocols we can even put the CID for the old formats in the claims. We would still only need to include the new format as a block in the claim as we can assume whoever wants the old format can get it from the new one like we did to create the CID 😊

License: MIT Signed-off-by: Oli Evans <[email protected]>

alanshaw · 2024-01-30T13:11:46Z

Some thoughts, not in meaningful order:

We do want them to be super compact. Especially since we’re reading from CARs. They can be up to ~4GB in web3.storage and contain a LOT of blocks. We don’t really want read access speed to be affected adversely by downloading a large index so keeping it compact will help. CBOR encoding many CIDs is not going to yield the most compact format.
AFAIK you can’t stream out dag-cbor encoded data . Whereas with a format more like an existing CARv2 index you can stream the data and improve access speeds i.e you can pause/stop/yield when you encounter the CID you’re looking for in the index. It also means you don’t have to hold the whole index in memory at any given time - you can just extract what you need.
Either way I would specify an order for items in the spec. Bear in mind we might want to build an index that doesn’t cover the whole CAR. Having a deterministic encoding for an index that includes a specific set of blocks will ensure folks don’t duplicate information by encoding multiple indexes for the same data just with differently ordered entries.
In multiformats spirit, having the index be prefixed by an identifier (as CARv2 indexes are) so we know what it is feels like a good idea.
Overall I'm more inclined to create a new index format that is tailored for our needs than just using CBOR.

olizilla · 2024-01-30T17:00:35Z

Yes let's iterate on the format.

The cost of encoding a CID in dag-cbor is 2 bytes in addition to byte representation of the CID.

Do we definitely want the index to store the CID or is the Multihash sufficient, as that is an easy way to shave bytes.

I can demo a streaming dag-cbor parser. The nice thing is the consumer can use the standard (non-streaming) dag-cbor encoder/decoder and see the index as cid linked data, or use a custom streaming parser for rapid index iteration.

CAR offset order is suggested, but cid sorted order may be preferable.

I'm gonna assume that we agree that there is a problem here worth solving; that Multihash Sorted Index is not the format we want clients to be building as it does not work well for range requests of block bytes and we're gonna continue to need to do that for the foreseeable future.

License: MIT Signed-off-by: Oli Evans <[email protected]>

ribasushi · 2024-02-05T11:47:29Z

Do we definitely want the index to store the CID or is the Multihash sufficient, as that is an easy way to shave bytes.

My take is that you do not want to have CIDs ( the first 2 varints ) in the index. Multihash is all you are going to query by, which in turn warrants to have indexes by MH alone.

alanshaw · 2024-02-05T12:07:17Z

So to keep "CID-less" and retain ability to materialise other existing CARv2 indexes you either have to store block header offset/size per item OR group by header size (instead of grouping by version and codec in proposed CIDIndexSorted). I'm leaning towards the latter for succinctness.

olizilla · 2024-02-05T13:31:36Z

Is "be as small as possible" the top priority for an external index file?

Yes is a reasonable answer, but it was not my main goal when opening this RFC... being maximally useable for our use case (one look up to find the details for a range request for just the block bytes) and also be compatible with as yet unknown others was what I had in mind... without introducing yet another format.

I tried to demo in the RFC that we could store them as inter-planetary linked data for a 25% size penalty.

alanshaw · 2024-02-06T17:58:31Z

Ok, lets not block on this - the proposal SGTM.

alanshaw · 2024-05-11T16:09:00Z

Superseded by storacha/specs#121

chore: move to dir

7f0aaac

License: MIT Signed-off-by: Oli Evans <[email protected]>

fix: header bytes is 58

493365a

License: MIT Signed-off-by: Oli Evans <[email protected]>

alanshaw mentioned this pull request Feb 3, 2024

feat!: CID index sorted alanshaw/cardex#9

Open

alanshaw mentioned this pull request Feb 5, 2024

feat!: range index sorted alanshaw/cardex#10

Open

hannahhoward mentioned this pull request Mar 12, 2024

Completing content claims integration work with IPNI storacha/project-tracking#10

Closed

alanshaw mentioned this pull request Apr 17, 2024

feat: w3 index spec storacha/specs#121

Merged

alanshaw closed this May 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rfc: a car block index format for external use. #9

rfc: a car block index format for external use. #9

olizilla commented Jan 29, 2024 •

edited

Loading

olizilla commented Jan 29, 2024

mikeal commented Jan 29, 2024

alanshaw commented Jan 30, 2024

olizilla commented Jan 30, 2024 •

edited

Loading

ribasushi commented Feb 5, 2024

alanshaw commented Feb 5, 2024

olizilla commented Feb 5, 2024

alanshaw commented Feb 6, 2024 •

edited

Loading

alanshaw commented May 11, 2024

rfc: a car block index format for external use. #9

rfc: a car block index format for external use. #9

Conversation

olizilla commented Jan 29, 2024 • edited Loading

olizilla commented Jan 29, 2024

mikeal commented Jan 29, 2024

alanshaw commented Jan 30, 2024

olizilla commented Jan 30, 2024 • edited Loading

ribasushi commented Feb 5, 2024

alanshaw commented Feb 5, 2024

olizilla commented Feb 5, 2024

alanshaw commented Feb 6, 2024 • edited Loading

alanshaw commented May 11, 2024

olizilla commented Jan 29, 2024 •

edited

Loading

olizilla commented Jan 30, 2024 •

edited

Loading

alanshaw commented Feb 6, 2024 •

edited

Loading