-
-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decode overencoded external Zimit2 URLs #1267
Conversation
Are the URLs from zimit2 "overencoded" or are they just simply "encoded"? That is, we know for sure they have been URL encoded once from the form of a working, clickable URL, right? So this process should be relatively safe. I assume for the URL The problem is coordination. If the upstream bug is fixed and zimit2 starts providing |
The problem is that they are irregularly encoded. The querystring part, including the question mark and separators, is encoded to get past the stripping of querystrings in libzim, but the rest of the URL is not. So we have difficulty in deciding whether a URL with When it's a ZIM link (a link to an article in the ZIM), this isn't a problem, because it's now stored that way (for Zimit2 ZIMs) in the ZIM. For ZIM links, we just need a string, which is a key rather than an actual URL, to get it from the ZIM, so it really doesn't matter to us how it is encoded. It's just a string. But the problem arises because at the time that warc2zim preprocesses URLs, it doesn't know whether an asset is in the ZIM or not. So it will go ahead and "overencode" querystrings even in external links. To be clear this decoding step only affects external links. Yes, it's crude. It doesn't attempt to check whether a It's definitely temporary, and should be removed once the upstream issue is fixed (but that could take a long time). Not great, I know, but arguably a significant improvement over the current situation... |
So here's a concrete example from mesquartierschinois ZIM. Clicking on a "share with Facebook" icon, we get:
For this to work, we do
This works. But of course we can imagine cases where the part preceding the querystring had a legitimately encoded |
I won't merge this if you think it's a bad idea. I'm a bit undecided, but generally I prefer code that actually works "most of the time" 🤣. |
Sorry, should have answered this directly. Unfortunately, zimit2 is providing ' https://google.com/%3Fq%3Dfoo%252fbar`. I.e., a mix of unencoded and encoded... |
This looks like a "once" encoded URL to me. I'm still confused. What if the original URL on the page was:
warc2zim would encode only the querystring and produce:
And a single decoding pass would produce the original string, with the "next" URL still properly encoded, right? So no problem. Are you anticipating equal signs and such encoded directly into url paths, a la |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I think I still don't fully understand the upstream issue more than just "URL encoding is pretty broken".
But if you're confident that this fixes it ~95% of the time, and the result of it being broken is just that an external link doesn't work, then LGTM.
Yeah, encoding querystrings was a pragmatic decision taken so as not to break existing readers, but one with wider consequences than originally envisioned. We're kind of stuck with it... I'll do some careful testing on this before merge, to be sure it doesn't break more external links than I am guessing it will. Need to test some multilingual cases and especially |
Yes your example ( |
OK, so I've tested extensively on lowtechmagazine including its Arabic at Vietnamese pages, and I haven't encountered anything that is not working due to this code. So will commit now. |
Fixes #1258.
As can be seen, this is a minor fix. Although we decode the entire URL, this should be an improvement in 95% of cases where the main URL does not contain encoded characters.
The main area of worry is PDFs, which is a common use case, so that functionality needs thorough testing with this workaround.