Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wip] Web Pathing Specification: initial outline with TODOs #453

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
143 changes: 143 additions & 0 deletions src/web-pathing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
---
title: Web Pathing Specification
description: >
Specification defines a subset of possible content paths that ensures
compatibility with existing HTTP and Web Platform standards.
date: 2023-11-12
maturity: wip
editors:
- name: Marcin Rataj
github: lidel
url: https://lidel.org/
affiliation:
name: Protocol Labs
url: https://protocol.ai/
tags: ['architecture', 'httpGateways', 'webHttpGateways']
---

Web Pathing Specification defines a subset of possible content paths
that ensures compatibility with existing HTTP and Web Platform standards.

## Introduction

TODO: Clearly explain why the specification exists, what is the problem solved here.

This document specifies details of pathing for content paths that start with
`/ipfs` and `/ipns` namespaces, and why a logical content root included in a
content path can facilitate security isolation and relative pathing in web
contexts.

Specification includes guidance around aspects such as hash functions,
multibases, CID versions, codecs, and how they impact implementation's ability
to translate pathing into traversal of a DAG.

The goal of this specification is to enable competing and interoperable
implementations, all while ensuring seamless traversal of paths within the web
ecosystem.

## Specification

TODO: Explain things in depth.
The resulting specification should be detailed enough to allow competing,
interoperable implementations.

### TODO: things to cover
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @Stebalien @dignifiedquire @hacdias @aschmahmann @Jorropo @rvagg @ribasushi @alanshaw @2color @autonome @darobin for visibility and sourcing early feedback on the scope of this spec.

Feel free to drop a comment about any tricky/painful pathing edge cases you've encountered over the years that we should clarify web behavior for by including them in this spec 🙏

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


- TODO: why it's called "web pathing": ensuring pathing is interoperable with how existing http and web platform works; covers both /ipfs and /ipns namespace semantics; defines logical content root CID that can be mapped to URL / root which enables subdomain/dnslink gateways and ipfs:// and ipns:// protocol handlers to load existing datasets, websites, and assets with relative pathing without the need for modifying them;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why it's called "web pathing"

I'm curious about this myself. It doesn't strike me as being particularly web-specific, at least not immediately.

Copy link
Collaborator

@bumblefudge bumblefudge Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would "URL-safe pathing pathing" or "web-deterministic pathing" or "web-compatible pathing" be more precise? it isn't pathing FOR or OF the web, but rather a web-compatible subset of the pathing currently possible with the tech to date, right?


- TODO: how web pathing is applied to CLI Tools; path gateways; and origin contexts: subdomain/dnslink, ipfs:// ipns:// URIs

- TODO: MUSTs, SHOULDs and MAYs in relation to

- TODO: multihash functions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the intention of this section to clarify baseline multihash & codecs that must be supported to provide content for libraries such as @helia/verified-fetch?

- MUSTs
- `sha2-256` (`0x12`)
- `blake2b-256` (`0xb220`)
- `blake3` (`0x1e`)
- `identity` (`0x00`) (i.e. the data itself inlined in place of a hash)
- TODO: Identity CIDs MUST NOT generate network I/O such as bitswap, http request, since the data is always available in Multihash itself
- SHOULDs
- `sha2-384` (`0x20`, aka SHA-384; as specified by [FIPS 180-4](https://csrc.nist.gov/pubs/fips/180-4/upd1/final)) TODO: where is this used? why is this on the list?
- sha3-512 TODO: code for such label does not exist, a typo in prior notes? follow up required
Comment on lines +60 to +61
Copy link
Member Author

@lidel lidel Nov 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@John-LittleBearLabs these two were included in your draft for WICG proposal, do you remember the reason/source?

I've found the code for the second one in https://github.com/multiformats/multicodec/blob/master/table.csv but not sure if we intended sha3 (0x14) or should switch to sha2 (0x13) here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is meant by 'label'?

I don't recall, no. sha2-384 doesn't ring a bell - perhaps it was one of the comments that's now deleted (hackmd doesn't seem to let me mark things as resolved/hidden). As for sha3-512... it was probably not a good source; I think what it was was I found someone somewhere was talking about future-proofing hashes and I looked for one of the recommendations that also was marked as permanent in the table.

I'm definitely open to this list being altered.


- TODO: mutlibases
- MUSTs
* f - base16
* b - base32
* k - base36
* z - base58btc (case-sensitive!)
* u - base64url (case-sensitive!)
- SHOULDs
* F - base16 (uppercase)
* B - base32 (uppercase)
* K - base36 (uppercase)

- TODO: cid versions
- MUST:
- CIDv1 (`0x01`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we MUST support CIDv1, we should call out the multibase/hash/codecs that aren't guaranteed to be supported by web-pathing spec implementers.

- CIDV0 (Multihash encoded with `base58btc`, with implicit dag-pb `0x70` codec)

- TODO: multicodecs that are required to facilitate path traversal
- DAG-PB
- RAW
- libp2p-key (for IPNS names)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is verfying an IPNS record outside the scope of this document? It's not exactly pathing, even if that's where it my codebase it happens to show up.

This makes me think there really are only 4 we care about, and 2 of them are MAY, and none of them are listed as permanent here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. My initial idea is to refer to IPNS spec which states that only Ed25519 is a MUST (RSA is SHOULD, other key types are MAY).

- DAG-CBOR
- DAG-JSON
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should MUST raw JSON as well, or is the intent to use RAW for that?


- TODO: MUST support UnixFS pathing
- TODO: traversing HAMTs
- TODO: traversing symlinks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have questions, and not sure if I should be instead commenting here ?

There's a few basic forms I could imagine this working in, and they're not necessarily incompatible:

  • /ipfs/cid1/a = "/ipfs/cid2/c" : /ipfs/cid1/a/b -> /ipfs/cid2/c/b
    • Replace all left of and including current path element with link contents.
    • IIRC I believe the gateway conformance test has this, so I'm guessing this is the real thing.
    • Are we allowed to link to /ipns/ namespace?
    • If so... even DNSLink? The link would still be immutable, but fully resolved what looks like part of your tree now depends on your local DNS setup?
  • /ipfs/cid1/a = "c" : /ipfs/cid1/a/b -> /ipfs/cid1/c/b (i.e. not starting with /)
    • Replace current path element with link contents.
    • I read someone talking about converting tar to car and if so there's an important special case...
    • "../b" : If allowed we might need rules about this.
  • /ipfs/cid1/a/b = "/c" : /ipfs/cid1/a/b -> /ipfs/cid1/c
    • Replace current path element and everything between the root and current element.
    • Need rules about DAGs that contain a directory under root named either /ipfs/ or /ipns/ etc.
    • I don't love features that break DAG symmetry, but others seem to 🤷

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, how does this interact with _redirects (since it both has to be in root and its redirects can be relative to the root)?
Site A has a _redirects with splat to /a.html
Site B has a symlink (called link) to a's root, and its own redirects splat to /b.html
ipfs://B/link/notfound.html
becomes what exactly?

In my current PR it would redirect to ipfs://B/link/a.html (e.g. it respects A's _redirects file, and does it relative to A's root). But if A did not have a redirect, it would be not found (e.g. B's _redirects is ignored).

Feels weird.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@John-LittleBearLabs (I've realized we've discussed this during one of sync calls but did not reply here)

  • symlinks are generally underspecified and not used much. I would mark this as unspecified behavior in this spec until we land Publish UnixFS specifications at specs.ipfs.tech #331
  • that being said, if you already implemented symlink support, its ok, only caveat is that following symlink should not allow for going beyond the content root (/ipfs/cid), so /ipfs/cid1/a pointing at /ipfs/cid2/b or ../cid2/b must error
  • rules from _redirects are executed only when requested content path is missing within same origin (based on root CID). in scenario you described you operate under origin B and it is not aware of _redirects from origin A (so _redirects is not executed)

- TODO: make sure [UnixFS spec draft](https://github.com/ipfs/specs/pull/331) includes relevant descriptions, only refer to them from here, dont duplicate content

- TODO MUST support DAG-CBOR/JSON pathing
- TODO `/ipfs/cbor-cid/unixfs-file`
- TODO `/ipfs/unixfs-dir-cid/dag-cbor-file/cbor-field` (boxo/gateway errors on this ([spec→traversing-cbor notes](https://specs.ipfs.tech/http-gateways/path-gateway/#traversing-through-dag-json-and-dag-cbor)), but we should specify behavior when someone wants to support this)
- TODO make it clear if both DAG variants of CBOR and JSON are a MUST, or if JSON is a SHOULD (right now conformance tests require both as a MUST).

- TODO: MUST what happens when we can't traverse part of the path
- TODO: separate errors for traversal errors due to missing codec vs missing content
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

- TODO: `/ipfs/valid-cid-dag-pb/invalid-path` (logical "not found", translates to HTTP 404 to indicate content does not exist, mention implicit http caching of 404 vs 500 – )
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: a browser/HTTP specific section with additional behaviors that are possible when HTTP redirects can be executed:

- TODO: `/ipfs/cid/unknown-codec-block/some/path` is requested (logical "path parser error", translates to HTTP 500 error page due to missing decoder)

- TODO: MUST describe handling of non-ascii characters
- TODO: dont invent anything new, refer to URL percent-encoding, like we did in [IPIP-383](https://github.com/ipfs/specs/pull/383)
- TODO: non-ascii characters (percent-encoding of unicode and arbitrary binary data)
- TODO: MUST: explicitly cover Unicode and that UTF-8 is implicit default
- TODO: have an answer for non-UTF-8 (e.g. UTF-16) code points (a MAY and error if are not supported? or error since this is web pathing, and web URL encoding uses UTF-8?)
- TODO: edge case: handling filenames that already look percent-encoded https://github.com/ipfs/gateway-conformance/issues/115
- TODO/TBD notes for implementers: mixing percent-encoded and raw paths is a very very comon case across the stack, writing down a sane MUST rule of thumb for implementers could improve resiliency across systems (e.g. if path includes `%` and produced 404, retry with percent-decoded value?)

- TODO: path normalization
- TODO: note that paths are equivalent, but HTTP 301 SHOULD be used in HTTP context to ensure clients always end up on normalized paths
- TODO: handling redundant slashes `///` (301 to resolved URL? `path.Clean`?)
- TODO: handling `.` and `..` (301 to resolved URL? `path.Clean`?)
- TODO: trailing slash `/` required for enumerable map-like entities (UnixFS dir, DAG-CBOR document?)
- TODO: CID normalization (to canonical text respresentation version and multibase)
- /ipfs to CIDv1 in base32
- /ipns to CIDV1 with libp2p-key in base36

### Test fixtures

TODO: List relevant CIDs. Describe how implementations can use them to determine
specification compliance.

TODO: [gateway-conformance](https://github.com/ipfs/gateway-conformance) tests for all MUSTs in this spec
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 👍

This ensure uniform behavior across implementations and contexts such as gateways vs `ipfs://` in browsers

### Security

TODO: Explain the security implications/considerations relevant to the spec.

TODO: length limit for entire path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we limiting to browser URLs, or do we want to support longer lengths? https://stackoverflow.com/a/417184/592760 is a really thorough answer talking about variants.

TODO: length limit for a path segment
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should limit path segment lengths, but we should prevent / in path segments opposite of IPLD pathing

TODO: content path normalization should be performed before comparing paths
TODO: mention how arbitrary content paths can be blocked via denylists defined in [IPIP-383](https://github.com/ipfs/specs/pull/383)

### Privacy and User Control

TODO: Note if there are any privacy or user control considerations that should be
taken into account by the implementers.

## Copyright

Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/).