Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add IPNI EntryChunk encoding #18

Merged
merged 4 commits into from
Dec 15, 2023
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 53 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ Create signed Advertisement records for the [InterPlanetary Network Indexer](htt
> IPNI is a content routing system optimized to take billions of CIDs from large-scale data providers, and allow fast lookup of provider information using these CIDs over a simple HTTP REST API.
> – https://github.com/ipni

This library handles encoding and signing of IPNI advertisements. To share them with an indexer follow the guidance in the spec [here](https://github.com/ipni/specs/blob/main/IPNI.md#advertisement-transfer)
This library handles encoding and signing of IPNI EntryChunk and Advertisement objects. To share them with an indexer follow the guidance in the spec [here](https://github.com/ipni/specs/blob/main/IPNI.md#advertisement-transfer)

Supports single and [extended providers](https://github.com/ipni/specs/blob/main/IPNI.md#extendedprovider) by separating Provider and Advertisement creation.

Expand All @@ -23,7 +23,56 @@ Use `node` > 18. Install as dependency from `npm`.
npm i @web3-storage/ipni
```

## Single provider
## `EntryChunk`

Encode an IPNI `EntryChunk` as a dag-cbor block from 1 or more multihashes.

```js
import { EntryChunk } from '@web3-storage/ipni'
import { sha256 } from 'multiformats/hashes/sha2'

const hash = await sha256.digest(new Uint8Array())
const chunk = EntryChunk.fromMultihashes([hash])
const block = await chunk.export()

// the EntryChunk CID should be passed to an Advertisement as the `entries` Link.
console.log(`entries cid ${block.cid}`)
```

Encode a chain of EntryChunks, from a CARv2 Index. Write each encoded to a bucket or block-store.

Use `calculateEncodedSize()` to determine when to split the input into additional chunks.

Chain EntryChunks together as a CID linked list via the `next` parameter.


```js
import fs from 'node:fs'
import { Readable } from 'node:stream'
import { MultihashIndexSortedReader } from 'cardex'
const PREFERRED_BLOCK_SIZE = (1024 ** 2) * 1 // 1MiB

const carIndexReader = MultihashIndexSortedReader.createReader({
reader: Readable.toWeb(fs.createReadStream(`car.idx`)).getReader()
})

let entryChunk = new EntryChunk()
while (true) {
const { done, value } = await carIndexReader.read()
if (done) break
entryChunk.add(value.multihash.bytes)
if (entryChunk.calculateEncodedSize() >= PREFERRED_BLOCK_SIZE) {
const block = await entryChunk.export()
writeEntryChunk(block) // put to bucket
entryChunk = new EntryChunk({ next: block.cid })
}
}
const block = await entryChunk.export()
writeEntryChunk(block)
writeAdvert({entries: block.cid })
```

## `Advertisement`

Encode an signed advertisement for a new batch of entries available from a single provider.

Expand Down Expand Up @@ -106,7 +155,7 @@ A `dag-json` encoded Advertisement (formatted for readability):
}
```

## Extended Providers
### Extended Providers

Encode a signed advertisement with an Extended Providers section and no context id or entries cid to announce that **all** previous and future entries are available from multiple providers or different protocols.

Expand Down Expand Up @@ -243,6 +292,7 @@ A `dag-json` encoded Advertisement (formatted for readability):

</details>


[`0x0900`]: https://github.com/multiformats/multicodec/blob/df81972d764f30da4ad32e1e5b778d8b619de477/table.csv?plain=1#L145
[`0x0910`]: https://github.com/multiformats/multicodec/blob/df81972d764f30da4ad32e1e5b778d8b619de477/table.csv?plain=1#L146
[`0x0920`]: https://github.com/multiformats/multicodec/blob/df81972d764f30da4ad32e1e5b778d8b619de477/table.csv?plain=1#L147
10 changes: 10 additions & 0 deletions advertisement.js
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
import * as Block from 'multiformats/block'
import * as DagCbor from '@ipld/dag-cbor'
import { CID } from 'multiformats/cid'
import { sha256 } from 'multiformats/hashes/sha2'
import { concat } from 'uint8arrays/concat'
Expand Down Expand Up @@ -158,6 +160,14 @@ export class Advertisement {
new Uint8Array([IsRm])
])
}

/**
* the dag-json encoded IPLD Block
*/
async export () {
const value = await this.encodeAndSign()
return Block.encode({ codec: DagCbor, hasher: sha256, value })
}
}

/**
Expand Down
177 changes: 177 additions & 0 deletions entry-chunk.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
import { isLink } from 'multiformats/link'
import { sha256 } from 'multiformats/hashes/sha2'
import * as Block from 'multiformats/block'
import * as DagCbor from '@ipld/dag-cbor'
import { tokensToLength } from 'cborg/length'
import { Token, Type } from 'cborg'

export const MAX_BLOCK_BYTES = (1024 ** 2) * 4
export const MAX_ENTRYCHUNK_CHAIN_LENGTH = 400 // or 65536? https://github.com/ipni/storetheindex/blob/e7ffb913a1191909d572febf09fb9aac6ef8bfab/deploy/manifests/prod/us-east-2/tenant/storetheindex/instances/inga/config.json#L83

const CID_TAG = new Token(Type.tag, 42)

/**
* EntryChunk encodes an array of multihashes to the dag-cbor IPLD form.
*
* Call export to encode it as a dag-cbor block, and use the CID as the `entries`
* field in an Advertisement.
*
* dag-cbor ensures a small encoded size (you may end up with a lot of them)
* and allows us to calculate the exact encoded size cheaply, to allow you
* to keep within libp2p block size limits and let peers gossip your indexes.
*
* From the spec:
* > If an advertisement has more CIDs than fit into a single block for purposes of data transfer,
* > they may be split into multiple chunks, conceptually a linked list, by using Next as a reference to the next chunk.
* >
* > ...each EntryChunk should stay below 4MiB, and a linked list of entry chunks
* > should be no more than 400 chunks long.
* >
* > Above these constraints, the list of entries should be split into multiple advertisements.
* > This means each individual advertisement can hold up to ~40 million multihashes.
*
* @see https://github.com/ipni/specs/blob/main/IPNI.md#entrychunk-chain
*
* @typedef {import('./schema').Link } Link
* @typedef {import('./schema').EntryChunkOutput} EntryChunkOutput
* @typedef {import('multiformats').MultihashDigest} MultihashDigest
*/
export class EntryChunk {
/**
* @param {Object} config
* @param {Uint8Array[]} [config.entries] array of multihash byte arrays
* @param {Link} [config.next] cid for previous EntryChunk
*/
constructor ({ entries, next } = {}) {
if (entries && !Array.isArray(entries)) {
throw new Error('entries must be an array')
}
if (next && !isLink(next)) {
throw new Error('next must be a CID')
}
this.next = next
this.entries = entries ?? []
// the fixed cost of encoding, without the entries array.
this._encodingOverhead = entryChunkPartialEncodingOverhead(next)
// sum of the encoded entries bytelength, without the array wrapper.
this._encodedEntriesLength = tokensToLength(this.entries.map(e => new Token(Type.bytes, { length: e.byteLength })))
}

/**
* @param {Uint8Array} entry byte encoded multihash
*/
add (entry) {
olizilla marked this conversation as resolved.
Show resolved Hide resolved
this.entries.push(entry)
this._encodedEntriesLength += tokensToLength(new Token(Type.bytes, { length: entry.byteLength }))
}

/**
* dag-cbor encoded byteLength
*/
calculateEncodedSize () {
const arraySize = tokensToLength(new Token(Type.array, this.entries.length))
return this._encodingOverhead + this._encodedEntriesLength + arraySize
}

/**
* IPLD EntryChunk object shape
*/
ipldView () {
return encodeEntryChunk(this)
}

/**
* dag-cbor encoded Block
*/
async export () {
return Block.encode({ codec: DagCbor, hasher: sha256, value: this.ipldView() })
}

/**
* @param {MultihashDigest[]} multihashes
*/
static fromMultihashes (multihashes) {
const entries = multihashes.map(mh => mh.bytes)
return new EntryChunk({ entries })
}

/**
* @param {Link[]} cids
*/
static fromCids (cids) {
const entries = cids.map(c => c.multihash.bytes)
return new EntryChunk({ entries })
}
}

/**
* Encode to the EntryChunk IPLD shape
* @see https://github.com/ipni/specs/blob/main/IPNI.md#entrychunk-chain
*
* @param {Object} config
* @param {Uint8Array[]} config.entries array of multihash byte arrays
* @param {Link} [config.next] cid for previous EntryChunk
*/
export function encodeEntryChunk ({ entries, next }) {
/** @type {EntryChunkOutput} */
const entryChunk = {
Entries: entries,
...(next ? { Next: next } : {})
}
return entryChunk
}

/**
* Calculate the byteLength of the dag-cbor encoded bytes for an array of entries.
*
* We know the encoded shape, we're figuring out how many entries we can fit in a
* 4MiB block. We have to derive this from the entries, as hash length can vary.
*
* Adapted from @ipld/car https://github.com/ipld/js-car/blob/562c39266edda8422e471b7f83eadc8b7362ea0c/src/buffer-writer.js#L215
*
* @param {Object} config
* @param {Uint8Array[]} config.entries
* @param {Link} [config.next]
**/
export function calculateDagCborSize ({ entries, next }) {
const tokens = [
new Token(Type.map, next ? 2 : 1),
new Token(Type.string, 'Entries'),
new Token(Type.array, entries.length)
]
for (const entry of entries) {
tokens.push(new Token(Type.bytes, { length: entry.byteLength }))
}
if (next) {
tokens.push(new Token(Type.string, 'Next'))
tokens.push(CID_TAG)
// CIDs are prefixed with 0x00 for _historical reasons
// see: https://github.com/ipld/js-dag-cbor/blob/83cd99cf8a04a7192d3e3d1e8f3f1c74d2f39a3b/src/index.js#L30C1-L32C11
tokens.push(new Token(Type.bytes, { length: next.byteLength + 1 }))
}
return tokensToLength(tokens)
}

/**
* Returns byteLength of a partially encoded EntryChunk
* with optional Next link, but without the Entries array.
* Just the fixed cost.
*
* @param {Link} [next] CID for previous EntryChunk
*/
export function entryChunkPartialEncodingOverhead (next) {
const tokens = next
? [
new Token(Type.map, 2),
new Token(Type.string, 'Next'),
CID_TAG,
// CIDs are prefixed with 0x00 for _historical reasons_ see: https://github.com/ipld/js-dag-cbor/blob/83cd99cf8a04a7192d3e3d1e8f3f1c74d2f39a3b/src/index.js#L30C1-L32C11
new Token(Type.bytes, { length: next.byteLength + 1 }),
new Token(Type.string, 'Entries')
]
: [
new Token(Type.map, 1),
new Token(Type.string, 'Entries')
]
return tokensToLength(tokens)
}
Binary file not shown.
70 changes: 70 additions & 0 deletions examples/car-index.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
import fs from 'node:fs'
import { Readable } from 'node:stream'
import { CID } from 'multiformats/cid'
import { base58btc } from 'multiformats/bases/base58'
import { createEd25519PeerId } from '@libp2p/peer-id-factory'
import { Provider, Advertisement } from '../index.js'
import { MultihashIndexSortedReader } from 'cardex'
import { EntryChunk } from '../entry-chunk.js'

/**
* @typedef {import('../schema').Link } Link
* @typedef {import('../schema').EntryChunkOutput} EntryChunkOutput
* @typedef {import('multiformats').MultihashDigest} MultihashDigest
*/

// a peer, addr, and protocol that will provider your entries
const provider = new Provider({
protocol: 'http',
addresses: '/dns4/example.org/tcp/443/https',
peerId: await createEd25519PeerId() // load your peerID and private key here
})

const carCid = CID.parse('bagbaierarw3cf23e5fhc55yosqielfejjdl6rfrppotlnxl2lf6qultqi2ka')
const carIndexStream = fs.createReadStream(`./${carCid.toString()}.car.idx`)
const carIndexReader = MultihashIndexSortedReader.createReader({ reader: Readable.toWeb(carIndexStream).getReader() })

const PREFERRED_BLOCK_SIZE = (1024 ** 2) * 1 // 1MiB

let previous = null
let entryChunk = new EntryChunk()

while (true) {
const { done, value } = await carIndexReader.read()
if (done) break
console.log(`📌 ${base58btc.encode(value.multihash.bytes)} @ ${value.offset}`)
entryChunk.add(value.multihash.bytes)
if (entryChunk.calculateEncodedSize() >= PREFERRED_BLOCK_SIZE) {
const entries = await writeEntryChunk(entryChunk)
const context = carCid.bytes
previous = await writeAdvert({ entries, context, provider, previous })
entryChunk = new EntryChunk()
}
}
const entries = await writeEntryChunk(entryChunk)
const context = carCid.bytes
previous = await writeAdvert({ entries, context, provider, previous })

/**
* @param {EntryChunk} entryChunk
*/
async function writeEntryChunk (entryChunk) {
const entryBlock = await entryChunk.export()
console.log(`🧩 ${entryBlock.cid} # EntryChunk`)
return entryBlock.cid
}

/**
* @param {Object} config
* @param {Link} config.entries
* @param {Uint8Array} config.context
* @param {Provider} config.provider
* @param {Link|null} [config.previous=null]
*/
async function writeAdvert ({ entries, context, provider, previous = null }) {
// an advertisement with a single http provider
const advert = new Advertisement({ providers: [provider], entries, context, previous })
const block = await advert.export()
console.log(`🎟️ ${block.cid} # Advertisement`)
return block.cid
}
Loading