Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gateway: WARC and WACZ archive replay #524

Closed
hacdias opened this issue Dec 14, 2023 · 3 comments
Closed

Gateway: WARC and WACZ archive replay #524

hacdias opened this issue Dec 14, 2023 · 3 comments
Labels
need/triage Needs initial labeling and prioritization topic/gateway Issues related to HTTP Gateway

Comments

@hacdias
Copy link
Member

hacdias commented Dec 14, 2023

This is mostly food for thought, and perhaps Boxo is not the right place for this.

What

When opening a WARC or a WACZ archive through the gateway, we should be able to directly replay its contents, instead of just downloading it. Therefore, it would be part of the "trusted" gateway. Some interesting links:

How

I see two main ways:

  1. Either implement our custom WARC/WACZ replaying website
  2. Somehow integrate ReplayPage into our gateway for this kind of files: https://replayweb.page/docs/embedding
@hacdias hacdias added need/triage Needs initial labeling and prioritization topic/gateway Issues related to HTTP Gateway labels Dec 14, 2023
@Jorropo
Copy link
Contributor

Jorropo commented Dec 14, 2023

Is this similar to MP4 or webm where we need to update to mime type detection and send the right header or does this needs other server side logic ?

@lidel
Copy link
Member

lidel commented Dec 14, 2023

My understanding is this is about HTML+JS reader returned for specific content type.

The way I see this, it is already possible: if you publish the HTML replayer along with WACZ.

WebRecorder calls this "self-hosting" in the docs you linked.
This is what https://webrecorder.github.io/save-tweet-now/ does :-)

Demo: https://bafybeifeg4nr2gdxw4ees7wta2qdg6ngycuzms3iudtjvrwbmf7nrdpu7y.ipfs.w3s.link/

(takes a while to announce new CID, but you can get data out of web3storage instantly via https://w3s.link/ipfs/bafybeifeg4nr2gdxw4ees7wta2qdg6ngycuzms3iudtjvrwbmf7nrdpu7y?format=car and import to local node, then open via localhost subdomain gateway, will work fine)

$ ipfs ls bafybeifeg4nr2gdxw4ees7wta2qdg6ngycuzms3iudtjvrwbmf7nrdpu7y
bafybeihg4khv4s4hjt6kko2c4gxi5n3mjqexj3jblllo2fsf4kvv366gmq 31076   favicon.ico
bafybeiekl2fjylgfa5zaadjl6gev63kybtv6eei22te6fecp3nitx6wbtq 793     index.html
bafybeicyiq4443ymeqhoqinwsjg6r4crhemhvflpyozyyo5nnvmfr6uxoe -       replay/
bafybeiepzlhosc52ehpubmmmcsg6edsypvmozvfegjcni5ivceljmmzdpa 474493  ui.js
bafybeifgvpnjrr5zpeyx46qhqn7u35fhs4srto46xvn3py35iw2jui3nay -       webarchive/
bafybeiho5npz7tetl24pefenmuviggyucrj6qerry3whmqu2w6zruemas4 1540637 webarchive.wacz

So it works already, just matter of putting .wacz in a directory with index.html replayer.

And I think that is enough.

We don't want to be responsible for deciding what replayer version should run when WACZ publisher did not specified one.

The self-hosting (shipping an index.html replayer along with .wacz archive) makes more sense:

  • it ensures the replayer works fine with the specific archive, and both are trustless when loaded from local gateway
    • publisher of the root CID controls end user experience, not boxo maintainers.
    • solves the problem when some archives work with latest version of https://cdn.jsdelivr.net/npm/[email protected]/ui.js, while others only work with older one. nothing is perfect, bugs happen. by shipping replayer with data different working combinations can exist at the same time on the same gateway.
  • we avoid feature creep the boxo/gateway library does not have to be in the business of maintaining reader for various content types (someone could want the same for .zip, .docx, .mov etc)
    • we don't want to maintain HTML+JS readers for any content types, that is the job of publisher and/or user agent (web browser).

@hacdias I think this means we can close this (bundling replayer with wacz works already, and bundling with boxo is out of scope)?

Only potential UX improvement that comes to mind is to include some helptext next to .wacz in generated HTML directory listings, when there is no index.html, hinting at embedding#self-hosting.

@hacdias
Copy link
Member Author

hacdias commented Dec 15, 2023

@lidel you're right, I got carried over by my emotions on this one 😆 and immediately opened an issue. If you use their browser extension, you can configure it to connect to your IPFS node. From there, it is able to "share" to IPFS and get a CID back. The CID will already have the reader itself, as well as the WARC file.

Therefore, I think we can close this.

@hacdias hacdias closed this as completed Dec 15, 2023
@hacdias hacdias closed this as not planned Won't fix, can't repro, duplicate, stale Dec 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need/triage Needs initial labeling and prioritization topic/gateway Issues related to HTTP Gateway
Projects
None yet
Development

No branches or pull requests

3 participants