Decode overencoded external Zimit2 URLs #1267

Jaifroid · 2024-07-10T10:53:26Z

As can be seen, this is a minor fix. Although we decode the entire URL, this should be an improvement in 95% of cases where the main URL does not contain encoded characters.

The main area of worry is PDFs, which is a common use case, so that functionality needs thorough testing with this workaround.

audiodude · 2024-07-10T17:09:09Z

Are the URLs from zimit2 "overencoded" or are they just simply "encoded"? That is, we know for sure they have been URL encoded once from the form of a working, clickable URL, right? So this process should be relatively safe.

I assume for the URL https://google.com/?q=foo%2fbar, zimit2 is providing https%3A%2F%2Fgoogle.com%2F%3Fq%3Dfoo%252fbar, so a single decoding pass is completely appropriate.

The problem is coordination. If the upstream bug is fixed and zimit2 starts providing https://google.com/?q=foo%2fbar, then this code will return https://google.com/?q=foo/bar which may be a problem.

Jaifroid · 2024-07-10T17:20:40Z

The problem is that they are irregularly encoded. The querystring part, including the question mark and separators, is encoded to get past the stripping of querystrings in libzim, but the rest of the URL is not. So we have difficulty in deciding whether a URL with %3F in it is actually an overencoded querystring, or it's part of the URL, a legitimately encoded question mark within the URL.

When it's a ZIM link (a link to an article in the ZIM), this isn't a problem, because it's now stored that way (for Zimit2 ZIMs) in the ZIM. For ZIM links, we just need a string, which is a key rather than an actual URL, to get it from the ZIM, so it really doesn't matter to us how it is encoded. It's just a string.

But the problem arises because at the time that warc2zim preprocesses URLs, it doesn't know whether an asset is in the ZIM or not. So it will go ahead and "overencode" querystrings even in external links. To be clear this decoding step only affects external links. Yes, it's crude. It doesn't attempt to check whether a %3F looks like it's a legitimate querystring or not, it just decodes the whole lot. Usually, this will be fine, probably in 95% of cases. Sometimes it won't, but I don't think there's a solution for that except fixing it upstream. The only inconvenience to a user clicking on such a link is the possibility that the external link they're navigating to won't work. It doesn't affect internal ZIM links.

It's definitely temporary, and should be removed once the upstream issue is fixed (but that could take a long time).

Not great, I know, but arguably a significant improvement over the current situation...

Jaifroid · 2024-07-10T17:38:16Z

So here's a concrete example from mesquartierschinois ZIM. Clicking on a "share with Facebook" icon, we get:

href = 'https://mesquartierschinois.wordpress.com/2014/07/18/que-faire/%3Fshare%3Dfacebook%26nb%3D1'

For this to work, we do decodeURIComponent on it, and the resulting href is now:

https://mesquartierschinois.wordpress.com/2014/07/18/que-faire/?share=facebook&nb=1

This works. But of course we can imagine cases where the part preceding the querystring had a legitimately encoded ? or =, which should not be decoded. It's unlikely, but possible...

Jaifroid · 2024-07-10T17:40:30Z

I won't merge this if you think it's a bad idea. I'm a bit undecided, but generally I prefer code that actually works "most of the time" 🤣.

Jaifroid · 2024-07-10T17:46:01Z

I assume for the URL https://google.com/?q=foo%2fbar, zimit2 is providing https%3A%2F%2Fgoogle.com%2F%3Fq%3Dfoo%252fbar, so a single decoding pass is completely appropriate.

Sorry, should have answered this directly. Unfortunately, zimit2 is providing ' https://google.com/%3Fq%3Dfoo%252fbar`. I.e., a mix of unencoded and encoded...

audiodude · 2024-07-10T17:52:40Z

So here's a concrete example from mesquartierschinois ZIM. Clicking on a "share with Facebook" icon, we get:

href = 'https://mesquartierschinois.wordpress.com/2014/07/18/que-faire/%3Fshare%3Dfacebook%26nb%3D1'

This looks like a "once" encoded URL to me. I'm still confused.

What if the original URL on the page was:

https://example.com/login?mode=user&next=https%3A%2F%2Fexample.com%2Fwelcome ?

warc2zim would encode only the querystring and produce:

https://example.com/login%3Fmode%3Duser%26next%3Dhttps%253A%252F%252Fexample.com%252Fwelcome

And a single decoding pass would produce the original string, with the "next" URL still properly encoded, right? So no problem.

Are you anticipating equal signs and such encoded directly into url paths, a la https://example.com/site/equals%3Darecool/welcome%3Ffoo%3Dbar ?

audiodude

Okay I think I still don't fully understand the upstream issue more than just "URL encoding is pretty broken".

But if you're confident that this fixes it ~95% of the time, and the result of it being broken is just that an external link doesn't work, then LGTM.

Jaifroid · 2024-07-11T08:35:49Z

Okay I think I still don't fully understand the upstream issue more than just "URL encoding is pretty broken".

But if you're confident that this fixes it ~95% of the time, and the result of it being broken is just that an external link doesn't work, then LGTM.

Yeah, encoding querystrings was a pragmatic decision taken so as not to break existing readers, but one with wider consequences than originally envisioned. We're kind of stuck with it...

I'll do some careful testing on this before merge, to be sure it doesn't break more external links than I am guessing it will. Need to test some multilingual cases and especially ~~Chinese Wikipedia~~ (EDIT: no, that won't do, as it's Zimit2 only... need a multilingual Zimit2 resource).

Jaifroid · 2024-07-11T08:46:27Z

So here's a concrete example from mesquartierschinois ZIM. Clicking on a "share with Facebook" icon, we get:
href = 'https://mesquartierschinois.wordpress.com/2014/07/18/que-faire/%3Fshare%3Dfacebook%26nb%3D1'

This looks like a "once" encoded URL to me. I'm still confused.

What if the original URL on the page was:

https://example.com/login?mode=user&next=https%3A%2F%2Fexample.com%2Fwelcome ?

warc2zim would encode only the querystring and produce:

https://example.com/login%3Fmode%3Duser%26next%3Dhttps%253A%252F%252Fexample.com%252Fwelcome

And a single decoding pass would produce the original string, with the "next" URL still properly encoded, right? So no problem.

Are you anticipating equal signs and such encoded directly into url paths, a la https://example.com/site/equals%3Darecool/welcome%3Ffoo%3Dbar ?

Yes your example (https%253A%252F%252Fexample.com%252Fwelcome) is what I mean by "overencoded". And yes, we could have encoded strings in the URL component (as opposed to the querystring component) that would get wrongly decoded by this decoding step. For example, https://gutenberg.org/shakespeare/what's-in-a-name%3F.html would be wrongly decoded as https://gutenberg.org/shakespeare/what's-in-a-name?.html.

Jaifroid · 2024-07-11T13:30:21Z

OK, so I've tested extensively on lowtechmagazine including its Arabic at Vietnamese pages, and I haven't encountered anything that is not working due to this code. So will commit now.

Decode overencoded external Zimit2 URLs

2b62e55

Jaifroid added upstream Issues that need to be dealt with at scraper level or some other repo bug-non-critical For bugs that it would be nice to fix rather than critical to fix zimit Code relating to the support of Zimit-style archives labels Jul 10, 2024

Jaifroid added this to the v4.1 milestone Jul 10, 2024

Jaifroid self-assigned this Jul 10, 2024

Jaifroid marked this pull request as draft July 10, 2024 11:01

Jaifroid marked this pull request as ready for review July 10, 2024 17:34

Jaifroid requested a review from audiodude July 10, 2024 17:39

audiodude approved these changes Jul 10, 2024

View reviewed changes

Jaifroid merged commit 3e7f5db into main Jul 11, 2024
9 checks passed

Jaifroid deleted the Decode-overencoded-external-zimit2-urls branch July 11, 2024 13:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decode overencoded external Zimit2 URLs #1267

Decode overencoded external Zimit2 URLs #1267

Jaifroid commented Jul 10, 2024 •

edited

Loading

audiodude commented Jul 10, 2024

Jaifroid commented Jul 10, 2024

Jaifroid commented Jul 10, 2024

Jaifroid commented Jul 10, 2024

Jaifroid commented Jul 10, 2024

audiodude commented Jul 10, 2024

audiodude left a comment

Jaifroid commented Jul 11, 2024 •

edited

Loading

Jaifroid commented Jul 11, 2024 •

edited

Loading

Jaifroid commented Jul 11, 2024

Decode overencoded external Zimit2 URLs #1267

Decode overencoded external Zimit2 URLs #1267

Conversation

Jaifroid commented Jul 10, 2024 • edited Loading

audiodude commented Jul 10, 2024

Jaifroid commented Jul 10, 2024

Jaifroid commented Jul 10, 2024

Jaifroid commented Jul 10, 2024

Jaifroid commented Jul 10, 2024

audiodude commented Jul 10, 2024

audiodude left a comment

Choose a reason for hiding this comment

Jaifroid commented Jul 11, 2024 • edited Loading

Jaifroid commented Jul 11, 2024 • edited Loading

Jaifroid commented Jul 11, 2024

Jaifroid commented Jul 10, 2024 •

edited

Loading

Jaifroid commented Jul 11, 2024 •

edited

Loading

Jaifroid commented Jul 11, 2024 •

edited

Loading