Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New spec: Request body canonicalization #141

Open
tw4l opened this issue Apr 20, 2023 · 10 comments · May be fixed by #149
Open

New spec: Request body canonicalization #141

tw4l opened this issue Apr 20, 2023 · 10 comments · May be fixed by #149
Assignees

Comments

@tw4l
Copy link
Member

tw4l commented Apr 20, 2023

POST canonicalization (or POST append) is implemented in pywb, warcio.js, and cdxj-indexer, but doesn't have a written specification.

This was first implemented in pywb in webrecorder/pywb#636. Various implementations are described here: webrecorder/replayweb.page#69 (comment)

A proper specification would likely help in resolving issues such as webrecorder/pywb#768 as well as generally making it clear to users and developers the rationale and expected behavior for POST canonicalization.

@tw4l tw4l self-assigned this Apr 20, 2023
@tw4l tw4l moved this from Triage to Design In Progress in Webrecorder Projects Apr 20, 2023
@tw4l tw4l moved this from Design In Progress to Ready for Dev in Webrecorder Projects Apr 25, 2023
@tw4l tw4l moved this from Ready for Dev to Dev In Progress in Webrecorder Projects May 2, 2023
@ato
Copy link

ato commented Jun 8, 2023

I had an attempt at documenting it while writing a Java implementation.

http://iipc.github.io/warc-specifications/guidelines/cdx-non-get-requests/

There's some interesting quirks to it. JSON null is encoded as the string "None". It can also produce slightly different output when run on old versions of Python (< 3.7).

@tw4l
Copy link
Member Author

tw4l commented Jun 12, 2023

Thank you @ato , this is very helpful! Documenting the quirks especially - I'll have to look into the Python < 3.7 issues!

@tw4l
Copy link
Member Author

tw4l commented Jul 26, 2023

We will be modifying pywb and warcio.js to be consistent - necessary issues have been opened. Notably, we'll be making pywb use native JSON values rather than Pythonic values (e.g. True, None) in the canonicalized URL.

We'll need to make sure that this isn't a breaking change and that already-canonicalized URLs created by pywb will either continue to work with pywb and wabac.js's fuzzy matching (preferred) or that we have a conversion process available.

@kaij
Copy link

kaij commented Aug 17, 2023

@tw4l I have the modified outbackcdx from @ato using index version 5 running and indexed a warc using cdx-indexer -j -p. The urlkeys in outbackcdx look good for POST request (they include __wb_method as well as the parameters as described in atos draft at http://iipc.github.io/warc-specifications/guidelines/cdx-non-get-requests/).

However, I can't get the replay of POST requests working. It seems the __wb_method and value parameters are not getting through to outbackcdx (latest pywb 2.7.4) in the request from pywb. Before looking too far - did I miss something in the config? (I'm using index_paths: cdx+http://outbackcdx:8080/nb-webarchive)

@ato
Copy link

ato commented Aug 17, 2023

Pywb needs this patch to pass them through:
nla/pywb@2bb97fc

@kaij
Copy link

kaij commented Aug 17, 2023

It works very nicely now, thanks @ato!

Do you see any problems for using index version 5 in production? (besides having to recreate the index if the spec changes at a later time)

@ato
Copy link

ato commented Aug 17, 2023

We run index version 5 it in production and haven't had any issues so far. The only reason it's not the default yet is because the upgrade process needs a bit more polishing.

@kaij
Copy link

kaij commented Aug 17, 2023

We run index version 5 it in production and haven't had any issues so far. The only reason it's not the default yet is because the upgrade process needs a bit more polishing.

Sounds good! Mentioning upgrade... so there is a way to upgrade the index? (so far I only used a newly created index) It would of course simplify things a lot.

@ato
Copy link

ato commented Aug 17, 2023

I've written up some notes about upgrading here: nla/outbackcdx#117
If you have any further OutbackCDX questions please post them there as we've drifted off the topic of the POST canonicalization spec. :-)

@tw4l tw4l moved this from Implementing to Ready in Webrecorder Projects Apr 1, 2024
@tw4l tw4l moved this from Ready to Implementing in Webrecorder Projects Apr 2, 2024
@tw4l tw4l linked a pull request Apr 4, 2024 that will close this issue
@tw4l tw4l moved this from Implementing to In Review in Webrecorder Projects Apr 4, 2024
@tw4l tw4l changed the title New spec: POST canonicalization New spec: Request body canonicalization Apr 4, 2024
@tw4l
Copy link
Member Author

tw4l commented Apr 4, 2024

@ato PR is in and would love your eyes on it if you have the bandwith: #149

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Review
Development

Successfully merging a pull request may close this issue.

4 participants