-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What integration to have between Store and Blob protocol implementations #1343
Comments
I would suggest to try and do an amortized migration from CAR → Blob. Specifically I suggest to do following:
This way extra costs will be temporary, although sadly on every new write which is not great. Also I suspect we can manage to do dynamo queries without doing one as car other as blob, but I don't believe that would work for S3. |
Actually now that I'm thinking about it we probably need to move from looking if we have CAR/Blob in S3 to looking if we have it in location claims, don't we ? Because in the future we will not have it in S3 but we will have it in R2, so perhaps we should be checking index instead. We do need to consider that we may have content in S3/R2 before we have it indexed however. |
Yes we need to look for claims, being location or other. But same exact problem happens there, dynamo/allocation store has same thing happening and claim for CarCID or (TBD, we talked about raw right?) CID for the multihash. |
Personally I'd punt on de-duping against old data. There's already a lot more to implement here than I'd imagined and dealing with de-duping might make the code messy and hard to follow and leaves us with dependencies on buckets we may not be using in the future. When we get to the state where we're uploading to a node on a decentralized network de-duping will be on the level of the node you're uploading to, not some global store. If necessary we can implement de-duping with old data at a later date. |
Agree with Alan here. I'm fine with not worrying about deduping for now. I would rather handle the migration in a script when we feel its safe to deprecate |
This PR creates stores and wires up new `upload-api` running `blob/add`, `web3.storage/blob/allocate`, `web3.storage/blob/accept` and `ucan/conclude` capabilities. Tests are also imported from `upload-api` implementation and run here. As agreed on storacha/w3up#1343 , there won't be any deduping between old world and new world. Therefore, we have new `allocations` table, and use different key schema in `carpark`. We are writing blobs keyed as `base58btc` as previously discussed as `${base58btcEncodedMultihash}/${base58btcEncodedMultihash}.blob`. I added `.blob` suffix but I am happy to other suggestions. Depending on how we progress with the reads side, we can consider creating a new bucket to fully isolate new content? The `receipts` and `tasks` storage end up being more complicated as they need to follow https://github.com/web3-storage/w3infra/blob/main/docs/ucan-invocation-stream.md#buckets, and is essentially the same as what happens on https://github.com/web3-storage/w3infra/blob/main/upload-api/ucan-invocation.js#L66 but at a different level as this is a proactive write of tasks and receipts.
store protocol persisted state
Since we shipped w3up, the
store/*
protocol implementation is backed by two state stores:storeTable
:space
andlink
(CID with CAR codec)carStoreBucket
:${link}/${link}.car
blob protocol persisted state
On the other side, we are now implementing the
blob/*
protocol, which is less opinated about the bag of blocks ingested. Therefore, the blob protocol receives themultihash
bytes and returns backmultihash
bytes, even though naturally it will need to encode this multihash internally (for instance in base64).Blob protocol needs persisted state quite similar to the store protocol. To untie it from the "store" and "car" related namings, at the moment we are using names closer to the blob protocol:
allocationStorage
instead ofstoreTable
blobStorage
instead ofcarStoreBucket
Note that the indexing SHOULD be quite similar, and is likely out of scope of this issue to discuss it. The main thing is that the index keys will now be different for same CARs uploaded
Integrate new world with old world
The main problem we want to solve here is how to make both worlds work together, or if it is actually desired to do so.
When
store/add
handler is called, thecarStoreBucket
is checked so that we know if that CAR is already being stored. If so, we do not need to receive the bytes. Moreover, we check ifstoreTable
has a mapping of the CAR link to that space. Depending on the result of these ops, we can do one of the following:In the
blob/add
handler, we MUST do same set of verifications as the ones above. However, we MAY want to continue decoupling both allocating on user space, and requesting bytes to be written for content we already have received as a CAR before.We can check if we already received a CAR with the same bytes (in other words, we can derive CAR Cid from the multihash by creating a CID with CAR codec). However, this will also mean:
Note that this will be tied with looking up on bucket now, but then same applies to look for claims for that content
Alternatively, we could just start from scratch with the new bucket in R2/other write targets. This would also tie nicely with the previous discussions that a new Bucket should exist once nucleation happens, instead of having in nucleated entity bill historical content.
Would like your opinions to get to a decision cc @hannahhoward @alanshaw @Gozala @reidlw
The text was updated successfully, but these errors were encountered: