-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add IPNI spec #85
base: main
Are you sure you want to change the base?
Changes from 1 commit
720348a
05caee2
f01552e
8e5ee5a
458f660
e9ad459
f3922bc
ad9e329
2c212e6
3bc0977
40fb998
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,237 @@ | ||||||||||
# W3 IPNI Protocol | ||||||||||
|
||||||||||
![status:wip](https://img.shields.io/badge/status-wip-orange.svg?style=flat-square) | ||||||||||
|
||||||||||
## Authors | ||||||||||
|
||||||||||
- [olizilla], [Protocol Labs] | ||||||||||
|
||||||||||
# Abstract | ||||||||||
|
||||||||||
For IPNI we assert that we can provide batches of multihashes by signing "Advertisements". | ||||||||||
Check failure on line 11 in w3-ipni.md GitHub Actions / markdown-link-checkTrailing spaces [Expected: 0 or 2; Actual: 1]
Check failure on line 11 in w3-ipni.md GitHub Actions / spellcheckMisspelled word
Check failure on line 11 in w3-ipni.md GitHub Actions / spellcheckMisspelled word
|
||||||||||
|
||||||||||
With an inclusion claim, a user asserts that a CAR contains a given set of multihashes via a car index. | ||||||||||
Check failure on line 13 in w3-ipni.md GitHub Actions / spellcheckMisspelled word
|
||||||||||
|
||||||||||
This spec describes how to merge these two concepts by adding an `ipni/offer` capability to submit an inclusion claim as an IPNI Advertisement. | ||||||||||
|
||||||||||
## Language | ||||||||||
|
||||||||||
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119](https://datatracker.ietf.org/doc/html/rfc2119). | ||||||||||
|
||||||||||
## Introduction | ||||||||||
|
||||||||||
**What this unlocks** (tl;dr) | ||||||||||
Check failure on line 23 in w3-ipni.md GitHub Actions / spellcheckMisspelled word
Check failure on line 23 in w3-ipni.md GitHub Actions / spellcheckMisspelled word
|
||||||||||
|
||||||||||
- Create 1 or more IPNI Adverts per user uploaded CAR and set the ContextID to be the CAR CID (instead of arbitrary batches with no ContextId) | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Shard CID + Space DID? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this should be Shard CID only. If we publish the same set of multihashes again to IPNI becuase someone adds the same CAR to a new space, I don't think we want to double the set of results that come back from IPNI for that multihash (1 per space it's in), all the records would have the same set of provider info (w3s), and theres no mechanism to determine which of the space dids the user should pass to us when reading (if at all). related... I'm not sure what happens if we publish the same multihash multiple times to IPNI with different ContextIDs. I think you get multiple results back with same provider info. |
||||||||||
- With this we (or anyone, ipni is open access) can now use IPNI to find which CAR a block is in. The context id bytes provide the CAR CID for any block look up. The CAR CID can then be used to find the CAR index via our content-claims API. | ||||||||||
- We can **delete** the IPNI records by CAR CID if the CAR is deleted. | ||||||||||
- Make IPNI advertising an explicit UCAN capability that clients can invoke rather than a side-effect of bucket events | ||||||||||
- With this we are free to write CARs anywhere. The users agent invokes a `ipni/offer` capability to ask us to publish and IPNI ad for the blocks in their CAR. | ||||||||||
- This empowers the user to opt-in or out as they need, and allows us to bill for the (small) cost of running that service. | ||||||||||
- Put the lime in the coconut. Put an inclusion claim in the IPNI advert metadata. | ||||||||||
- We show the source of our provider claim is a user signed inclusion content claim. | ||||||||||
- We have to sign IPNI Adverts as the provider, so we can warn folks that this ad is as good as the user provided content claim it includes. | ||||||||||
|
||||||||||
### Quick IPNI primer | ||||||||||
|
||||||||||
IPNI ingests and replicates billions of signed provider claims for where individual block CIDs can be retrieved from. | ||||||||||
|
||||||||||
Users can query IPNI servers for any CID, and it provides a set of provider addresses and transport info, along with a provider specific ContextID and optional metadata. | ||||||||||
|
||||||||||
http://cid.contact hosts an IPNI server that Protocol Labs maintains. *(at time of writing)* | ||||||||||
|
||||||||||
```bash | ||||||||||
$ curl https://cid.contact/cid/bafybeicawc3qwtlecld6lmtvsndimoz3446xyaprgsxvhd3aapwa2twnc4 -sS | jq | ||||||||||
Check failure on line 44 in w3-ipni.md GitHub Actions / markdown-link-checkDollar signs used before commands without showing output [Context: "$ curl https://cid.contact/cid..."]
|
||||||||||
``` | ||||||||||
|
||||||||||
```json | ||||||||||
{ | ||||||||||
"MultihashResults": [ | ||||||||||
{ | ||||||||||
"Multihash": "EiBAsLcLTWQSx+WydZNGhjs75z18AfE0r1OPYAPsDU7NFw==", | ||||||||||
"ProviderResults": [ | ||||||||||
{ | ||||||||||
"ContextID": "YmFndXFlZXJheTJ2ZWJsZGNhY2JjM3Z0em94bXBvM2NiYmFsNzV3d3R0aHRyamhuaDdvN2o2c2J0d2xmcQ==", | ||||||||||
"Metadata": "gBI=", | ||||||||||
"Provider": { | ||||||||||
"ID": "QmQzqxhK82kAmKvARFZSkUVS6fo9sySaiogAnx5EnZ6ZmC", | ||||||||||
"Addrs": [ | ||||||||||
"/dns4/elastic.dag.house/tcp/443/wss" | ||||||||||
] | ||||||||||
} | ||||||||||
}, | ||||||||||
{ | ||||||||||
"ContextID": "YmFndXFlZXJheTJ2ZWJsZGNhY2JjM3Z0em94bXBvM2NiYmFsNzV3d3R0aHRyamhuaDdvN2o2c2J0d2xmcQ==", | ||||||||||
"Metadata": "oBIA", | ||||||||||
"Provider": { | ||||||||||
"ID": "QmUA9D3H7HeCYsirB3KmPSvZh3dNXMZas6Lwgr4fv1HTTp", | ||||||||||
"Addrs": [ | ||||||||||
"/dns4/dag.w3s.link/tcp/443/https" | ||||||||||
] | ||||||||||
} | ||||||||||
} | ||||||||||
``` | ||||||||||
|
||||||||||
web3.storage publishes the blocks it can provide by encoding a batch of multihashes as an IPLD object and writing it to S3 as an `Advertisement`, addressed by it's CID. | ||||||||||
|
||||||||||
An `Advertisement` includes `Provider` info which claims that a the batch of multihashes are available via bitswap or HTTP, and are signed by the provider PeerId private key; Each advert is a claim that this peer will provide that batch of multihashes. | ||||||||||
|
||||||||||
Advertisements also include a CID link to any previous ones from the same provider forming a hash linked list. | ||||||||||
|
||||||||||
The latest `head` CID of the ad list can be broadcast over gossipsub, to be replicated and indexed by all listeners, or POSTed over HTTP to specific IPNI servers as a notification to pull and index the latest ads from you at their earliest convenience. | ||||||||||
|
||||||||||
The advert `ContextID` allows providers to specify a custom grouping key for multiple adverts. You can update or remove multiple adverts by specifying the same ContextID. The value is an opaque byte array as far as IPNI is concerned, and is provided in the query response. | ||||||||||
|
||||||||||
A `Metadata` field is also available for provider specific retrieval hints, that a user should send to the provider when making a request for the block, but the mechanism here is unclear (http headers? bitswap what now?). Regardless it is more space for provider specified bytes... like maybe... a content claim! *(foreshadowing!)* | ||||||||||
|
||||||||||
### How web3.storage integrates IPNI today | ||||||||||
|
||||||||||
w3s publishes IPNI advertisements as a side-effect of the e-ipfs car block indexer. | ||||||||||
|
||||||||||
Each multihash in a CAR is sent to an SQS queue. The `publisher-lambda` takes batches from the queue, encodes and signs `Advertisement`s and writes them to S3 as json. | ||||||||||
|
||||||||||
The lambda makes an http request to the cid.contact to inform it when the head CID of the Advertisement linked list changes. | ||||||||||
|
||||||||||
The cid.contact IPNI server fetches new head Advertisement from our s3 bucket, and any others in the chain it hasn't read yet, and updates it's indexes. | ||||||||||
|
||||||||||
Our `Advertisement`s contain arbitrary batches of multihashes defined by SQS queue batching config. The ContextID is set to opaque bytes (a custom hash of the hashes). | ||||||||||
|
||||||||||
#### Diagram | ||||||||||
|
||||||||||
```mermaid | ||||||||||
flowchart TD | ||||||||||
A[(dotstorage\nbucket)] -->|ObjectCreated fa:fa-car| B(bucket-to-indexer ƛ) | ||||||||||
B -->|region/bucket/cid/cid.car| C[/indexer queue/] | ||||||||||
C --> indexer(Indexer ƛ) | ||||||||||
indexer --> |zQmUNLLsPACCz1vLxQVkXqqLX5R1X345qqfHbsf67hvA3Nn| E[/multihash queue/] | ||||||||||
E --> F(ipni Advertisement content ƛ) | ||||||||||
F --> |PUT /advertCid|I | ||||||||||
F --> |advert CID| G[/Advertisement queue/] | ||||||||||
G --> H(ipni publish ƛ) | ||||||||||
H --> |PUT /head|I[(Advert Bucket)] | ||||||||||
H --> |POST head|IPNI[["`**IPNI**`"]] | ||||||||||
|
||||||||||
carpark[(carpark\nbucket)] --> |ObjectCreated fa:fa-car|w3infra-carpark-consumer(carpark-consumer ƛ) | ||||||||||
w3infra-carpark-consumer -->|region/bucket/cid/cid.car| C[/indexer queue/] | ||||||||||
|
||||||||||
indexer ---> dynamo[Dynamo\nblocks index] | ||||||||||
``` | ||||||||||
|
||||||||||
## Proposal | ||||||||||
|
||||||||||
Provide a `ipni/offer` ucan ability to sign and publish an IPNI Advertisement for the set of multihashes in a CAR a user has stored with w3s, to make them discoverable via IPFS implementations and other IPNI consumers. | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder if we could make a goal of this to move closer to decentralize IPNS. Have a separate service that provides this capability that could be implemented by multiple parties. This would also allow us to reach out to IPNI team and see if they would like to run this service instead of us, making it available for other users of IPNS if they would like to get into the UCAN world. we could therefore decouple w3s from this system There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm in to this from the angle: "lets expose our block level index info in a way that is easy to replicate rather than trapping it in a private db"... decentralise our indexes. I don't think IPNI team would jump at the chance to host this but I'm in favour of seeing it as a separate service. I think we'll bundle it in to the client as part of the default upload flow to start with, but we can break it out and make it opt-in / hosted elsewhere once we have this in place. |
||||||||||
|
||||||||||
```mermaid | ||||||||||
sequenceDiagram | ||||||||||
actor Alice | ||||||||||
Alice->>w3s: ipni/offer (inclusion proof) | ||||||||||
activate w3s | ||||||||||
w3s-->>w3s: fetch & verify index | ||||||||||
w3s-->>w3s: write advert | ||||||||||
w3s-->>Alice: OK (advertisement CID) | ||||||||||
w3s-->>ipni: publish head (CID) | ||||||||||
deactivate w3s | ||||||||||
ipni-->>w3s: fetch advert | ||||||||||
activate ipni | ||||||||||
ipni-->>ipni: index entries | ||||||||||
deactivate ipni | ||||||||||
Alice->>ipni: query (CID) | ||||||||||
``` | ||||||||||
|
||||||||||
|
||||||||||
Invoke it with the CID for an [inclusion-claim] that associates a CAR CID wth [MultihashIndexSorted CARv2 Index] CID. | ||||||||||
|
||||||||||
:::info | ||||||||||
Other CAR index forms may be supported in the future. A more convenient external CAR index format would provide the offset byte and block byteLength for a multihash from the start of the CAR file. | ||||||||||
::: | ||||||||||
|
||||||||||
|
||||||||||
```json | ||||||||||
{ | ||||||||||
"iss": "did:key:zAlice", | ||||||||||
"aud": "did:web:web3.storage", | ||||||||||
"att": [{ | ||||||||||
"can": "ipni/offer", | ||||||||||
"with": "did:key:space", // users space DID | ||||||||||
"nb": { | ||||||||||
"inclusion": CID // inclusion claim CID | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm bit confused is the CID here link to the |
||||||||||
} | ||||||||||
}] | ||||||||||
} | ||||||||||
``` | ||||||||||
|
||||||||||
**Inclusion claim** | ||||||||||
```json | ||||||||||
{ | ||||||||||
"content": CID, // CAR CID | ||||||||||
"includes": CID // CARv2 Index CID | ||||||||||
} | ||||||||||
``` | ||||||||||
|
||||||||||
When `ipni/offer` is invoked the service must fetch the inclusion claim. The encoded claim block may be sent with the invocation. | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would make sending it be a requirement instead. Also perhaps we should just stick There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||
|
||||||||||
The service must fetch he CARv2 index and parse it to find the set of multihashes included in the CAR. see: [Verifying the CARv2 Index](#verifying-the-carv2-index) | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm bit confused here fetch from where ? Should not index be send along with the claim ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Probably not. If the client has already sent the CARv2 index to w3s when the CAR was stored, then the client should not be asked for it again. Also, in this case, the CARv2 index is probably already cached on the w3s service node from when it was stored. Getting the index from storage also means nothing needs to change if we change our decision about having the client create the CARv2 index or having the w3s service create it when the CAR is stored. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. clients currently don't send CAR indexes. |
||||||||||
|
||||||||||
The set of multihashes must be encoded as 1 or more [IPNI Advertisements]. | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is it really required to be 1? Could it not be 0 as well? I think the protocol should not imply that validation is required for at least one block, but that validation MAY happen for each block There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i think this comment is intended for the validation section. This is the encode it as 1 or more adverts part which is just dealing with max block size. |
||||||||||
|
||||||||||
```ipldsch | ||||||||||
type Advertisement struct { | ||||||||||
PreviousID optional Link | ||||||||||
Provider String | ||||||||||
Addresses [String] | ||||||||||
Signature Bytes | ||||||||||
Entries Link | ||||||||||
ContextID Bytes | ||||||||||
Metadata Bytes | ||||||||||
IsRm Bool | ||||||||||
ExtendedProvider optional ExtendedProvider | ||||||||||
} | ||||||||||
``` | ||||||||||
|
||||||||||
- `Entries` must be the CID of an `EntryChunk` for a subset (or all) of multihashes in the CAR. | ||||||||||
- `ContextID` must be the byte encoded form of the CAR CID. | ||||||||||
Comment on lines
+198
to
+199
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it's easier to follow if you mention what context is first as it's referenced from the other field. |
||||||||||
- `Metadata` must be the bytes of the inclusion claim. | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's a shame to loose the provenance info as in where the claim was originated from, would be nice to capture the source There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also I think it would be nice if I am also very tempted to be storing advertisements in user space as opposed to our own custom bucket, if they delete it we can then publish delete advertisement. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I like the idea of storing advertisements in user space. That way the user pays to index the advertisements as part of the storage cost. We will need to generate events when a file is deleted to that a removal advertisement can be created. Can a user opt-out of indexing? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. General direction I'm advocating for is that if you upload content to our service it just sits there without been indexed or advertised anywhere. If user wants to make it readable they must issue invocation requesting a location claim to be made for the uploaded content, which will in turn index and advertise.
Deletes happen on user invocation which can be a trigger to remove an advertisement. |
||||||||||
|
||||||||||
See: [Encoding the IPNI Advertisement](#encoding-the-ipni-advertisement) | ||||||||||
|
||||||||||
The Advertisement CID should be POSTed to an IPNI server. `cid.contact` is assumed initially. | ||||||||||
|
||||||||||
The Advertisement CID should be gossiped on the `/indexer/ingest/mainnet` topic so they can be replicated by other IPNI servers, to ensure many nodes can answer queries for the blocks we host. | ||||||||||
|
||||||||||
|
||||||||||
### Verifying the CARv2 Index | ||||||||||
|
||||||||||
The service must fetch the CARv2 Index and may verify 1 or more multihashes from the index exist at the specified offsets in the associated CAR. | ||||||||||
|
||||||||||
The verifier should pick a set of multihashes at random and fetch the bytes from the CAR identified by the index entry and verify it's multihash. The invocation must return an error if any entry is found to be invalid. | ||||||||||
|
||||||||||
Random validation of a number of blocks allows us to detect invalid indexes and lets us tune how much work we are willing to do per car index. | ||||||||||
|
||||||||||
Full validation of every block is not recommended as it opens us up to performing unbounded work. *We have seen CAR files with millions of tiny blocks.* | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. based on our previous chat, I was thinking the random validation would consider a % of the blocks in a CAR. But reading this now, looks like a specific number. Would that be the case? I would be more in favour of a random %, but probably a good idea to add a custom MAX. Otherwise, there is the attack vector of uploading gigantic CAR of tiny blocks to better try to not get bad ones validated There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah percent sounds good, but maybe with a max number we're willing to consider...in the spec I'd probably just specify a "random sample that may inculde none or all of the blocks" though. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I suggest a configurable percentage (validation factor) from 0 to 100%. Any non-zero fraction of a number of blocks is rounded up to the nearest integer. So that 10 blocks at 3% validation, still validates one block. |
||||||||||
|
||||||||||
|
||||||||||
### Encoding the IPNI Advertisement | ||||||||||
|
||||||||||
> The set of multihashes must be encoded as 1 or more [IPNI Advertisements]. | ||||||||||
|
||||||||||
Where the IPLD encoded size of an `EntryChunk` with the set of multihashes would exceed 4MiB (the upper limit for a block that can be transferred by libp2p) the set of multihashes must be split into multiple `EntryChunk` blocks | ||||||||||
|
||||||||||
```ipldsch | ||||||||||
type EntryChunk struct { | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I find There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not in this spec. This is the IPNI vocabulary. https://github.com/ipni/specs/blob/main/IPNI.md#entrychunk-chain There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is baked into the IPNI spec and encoded into existing advertisements. An
The term |
||||||||||
Entries [Bytes] | ||||||||||
Next optional Link | ||||||||||
} | ||||||||||
``` | ||||||||||
|
||||||||||
It is possible to create long chains of `EntryChunk` blocks by setting the `Next` field to the CID to another `EntryChunk`, but this requires an entire EntryChunk to be fetched and decoded, before the IPNI server can determine the next chunk to fetch. | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is not a problem since indexing is not guaranteed to be immediate, and it is much faster than having the same multihashes split over multiple advertisements. |
||||||||||
|
||||||||||
The containing CAR CID provides a useful `ContextID` for grouping multiple (light weight) Advertisement blocks so it is recommended to split the set across multiple `Advertisement` blocks each pointing to an `EntryChunk` with a partition of the set of multihashes in, and the `ContextId` set to the CAR CID. | ||||||||||
|
||||||||||
|
||||||||||
[MultihashIndexSorted CARv2 Index]: https://ipld.io/specs/transport/car/carv2/#format-0x0401-multihashindexsorted | ||||||||||
|
||||||||||
[inclusion-claim]: https://github.com/web3-storage/content-claims?tab=readme-ov-file#inclusion-claim | ||||||||||
|
||||||||||
[IPNI Advertisements]: https://github.com/ipni/specs/blob/main/IPNI.md#advertisements | ||||||||||
|
||||||||||
[olizilla]: https://github.com/olizilla |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ipni/offer
implies anipni/accept
fx per our own conventions...It might be good to have a
ipni/accept
task that is executed when the advert has been written. The receipt might include the advert (C)ID and an identifier for the chain that it is included in.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that the client is the consumer of that receipt, correct? If so, what does it do with this information? Knowing that an advertisement is published does not guarantee that has yet been ingested by IPNI.
Should this receipt have all the same data as the
Announce
message: advertisement CID, peerID, and addresses of where the chain is hosted? The peerID (publisher ID) would identify which chain the ad is on.