Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decode overencoded external Zimit2 URLs #1267

Merged
merged 1 commit into from
Jul 11, 2024

Conversation

Jaifroid
Copy link
Member

@Jaifroid Jaifroid commented Jul 10, 2024

Fixes #1258.

As can be seen, this is a minor fix. Although we decode the entire URL, this should be an improvement in 95% of cases where the main URL does not contain encoded characters.

The main area of worry is PDFs, which is a common use case, so that functionality needs thorough testing with this workaround.

@Jaifroid Jaifroid added upstream Issues that need to be dealt with at scraper level or some other repo bug-non-critical For bugs that it would be nice to fix rather than critical to fix zimit Code relating to the support of Zimit-style archives labels Jul 10, 2024
@Jaifroid Jaifroid added this to the v4.1 milestone Jul 10, 2024
@Jaifroid Jaifroid self-assigned this Jul 10, 2024
@Jaifroid Jaifroid marked this pull request as draft July 10, 2024 11:01
@audiodude
Copy link
Collaborator

Are the URLs from zimit2 "overencoded" or are they just simply "encoded"? That is, we know for sure they have been URL encoded once from the form of a working, clickable URL, right? So this process should be relatively safe.

I assume for the URL https://google.com/?q=foo%2fbar, zimit2 is providing https%3A%2F%2Fgoogle.com%2F%3Fq%3Dfoo%252fbar, so a single decoding pass is completely appropriate.

The problem is coordination. If the upstream bug is fixed and zimit2 starts providing https://google.com/?q=foo%2fbar, then this code will return https://google.com/?q=foo/bar which may be a problem.

@Jaifroid
Copy link
Member Author

The problem is that they are irregularly encoded. The querystring part, including the question mark and separators, is encoded to get past the stripping of querystrings in libzim, but the rest of the URL is not. So we have difficulty in deciding whether a URL with %3F in it is actually an overencoded querystring, or it's part of the URL, a legitimately encoded question mark within the URL.

When it's a ZIM link (a link to an article in the ZIM), this isn't a problem, because it's now stored that way (for Zimit2 ZIMs) in the ZIM. For ZIM links, we just need a string, which is a key rather than an actual URL, to get it from the ZIM, so it really doesn't matter to us how it is encoded. It's just a string.

But the problem arises because at the time that warc2zim preprocesses URLs, it doesn't know whether an asset is in the ZIM or not. So it will go ahead and "overencode" querystrings even in external links. To be clear this decoding step only affects external links. Yes, it's crude. It doesn't attempt to check whether a %3F looks like it's a legitimate querystring or not, it just decodes the whole lot. Usually, this will be fine, probably in 95% of cases. Sometimes it won't, but I don't think there's a solution for that except fixing it upstream. The only inconvenience to a user clicking on such a link is the possibility that the external link they're navigating to won't work. It doesn't affect internal ZIM links.

It's definitely temporary, and should be removed once the upstream issue is fixed (but that could take a long time).

Not great, I know, but arguably a significant improvement over the current situation...

@Jaifroid Jaifroid marked this pull request as ready for review July 10, 2024 17:34
@Jaifroid
Copy link
Member Author

So here's a concrete example from mesquartierschinois ZIM. Clicking on a "share with Facebook" icon, we get:

href = 'https://mesquartierschinois.wordpress.com/2014/07/18/que-faire/%3Fshare%3Dfacebook%26nb%3D1'

For this to work, we do decodeURIComponent on it, and the resulting href is now:

https://mesquartierschinois.wordpress.com/2014/07/18/que-faire/?share=facebook&nb=1

This works. But of course we can imagine cases where the part preceding the querystring had a legitimately encoded ? or =, which should not be decoded. It's unlikely, but possible...

@Jaifroid Jaifroid requested a review from audiodude July 10, 2024 17:39
@Jaifroid
Copy link
Member Author

I won't merge this if you think it's a bad idea. I'm a bit undecided, but generally I prefer code that actually works "most of the time" 🤣.

@Jaifroid
Copy link
Member Author

I assume for the URL https://google.com/?q=foo%2fbar, zimit2 is providing https%3A%2F%2Fgoogle.com%2F%3Fq%3Dfoo%252fbar, so a single decoding pass is completely appropriate.

Sorry, should have answered this directly. Unfortunately, zimit2 is providing ' https://google.com/%3Fq%3Dfoo%252fbar`. I.e., a mix of unencoded and encoded...

@audiodude
Copy link
Collaborator

So here's a concrete example from mesquartierschinois ZIM. Clicking on a "share with Facebook" icon, we get:

href = 'https://mesquartierschinois.wordpress.com/2014/07/18/que-faire/%3Fshare%3Dfacebook%26nb%3D1'

This looks like a "once" encoded URL to me. I'm still confused.

What if the original URL on the page was:

https://example.com/login?mode=user&next=https%3A%2F%2Fexample.com%2Fwelcome ?

warc2zim would encode only the querystring and produce:

https://example.com/login%3Fmode%3Duser%26next%3Dhttps%253A%252F%252Fexample.com%252Fwelcome

And a single decoding pass would produce the original string, with the "next" URL still properly encoded, right? So no problem.

Are you anticipating equal signs and such encoded directly into url paths, a la https://example.com/site/equals%3Darecool/welcome%3Ffoo%3Dbar ?

Copy link
Collaborator

@audiodude audiodude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I think I still don't fully understand the upstream issue more than just "URL encoding is pretty broken".

But if you're confident that this fixes it ~95% of the time, and the result of it being broken is just that an external link doesn't work, then LGTM.

@Jaifroid
Copy link
Member Author

Jaifroid commented Jul 11, 2024

Okay I think I still don't fully understand the upstream issue more than just "URL encoding is pretty broken".

But if you're confident that this fixes it ~95% of the time, and the result of it being broken is just that an external link doesn't work, then LGTM.

Yeah, encoding querystrings was a pragmatic decision taken so as not to break existing readers, but one with wider consequences than originally envisioned. We're kind of stuck with it...

I'll do some careful testing on this before merge, to be sure it doesn't break more external links than I am guessing it will. Need to test some multilingual cases and especially Chinese Wikipedia (EDIT: no, that won't do, as it's Zimit2 only... need a multilingual Zimit2 resource).

@Jaifroid
Copy link
Member Author

Jaifroid commented Jul 11, 2024

So here's a concrete example from mesquartierschinois ZIM. Clicking on a "share with Facebook" icon, we get:
href = 'https://mesquartierschinois.wordpress.com/2014/07/18/que-faire/%3Fshare%3Dfacebook%26nb%3D1'

This looks like a "once" encoded URL to me. I'm still confused.

What if the original URL on the page was:

https://example.com/login?mode=user&next=https%3A%2F%2Fexample.com%2Fwelcome ?

warc2zim would encode only the querystring and produce:

https://example.com/login%3Fmode%3Duser%26next%3Dhttps%253A%252F%252Fexample.com%252Fwelcome

And a single decoding pass would produce the original string, with the "next" URL still properly encoded, right? So no problem.

Are you anticipating equal signs and such encoded directly into url paths, a la https://example.com/site/equals%3Darecool/welcome%3Ffoo%3Dbar ?

Yes your example (https%253A%252F%252Fexample.com%252Fwelcome) is what I mean by "overencoded". And yes, we could have encoded strings in the URL component (as opposed to the querystring component) that would get wrongly decoded by this decoding step. For example, https://gutenberg.org/shakespeare/what's-in-a-name%3F.html would be wrongly decoded as https://gutenberg.org/shakespeare/what's-in-a-name?.html.

@Jaifroid
Copy link
Member Author

OK, so I've tested extensively on lowtechmagazine including its Arabic at Vietnamese pages, and I haven't encountered anything that is not working due to this code. So will commit now.

@Jaifroid Jaifroid merged commit 3e7f5db into main Jul 11, 2024
9 checks passed
@Jaifroid Jaifroid deleted the Decode-overencoded-external-zimit2-urls branch July 11, 2024 13:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-non-critical For bugs that it would be nice to fix rather than critical to fix upstream Issues that need to be dealt with at scraper level or some other repo zimit Code relating to the support of Zimit-style archives
Projects
None yet
2 participants