Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REGRESSION: Remove new inline JS from ZIMs produced by dev (at least 1.14.0) to comply with restrictive CSPs #2096

Open
Jaifroid opened this issue Oct 31, 2024 · 11 comments · May be fixed by #2111
Assignees
Labels
bug regression First as tragedy, then as farce ;-)
Milestone

Comments

@Jaifroid
Copy link
Collaborator

This is a regression of #1578 and its fix may depend on a fix to #2074. However, it is useful to revisit this issue, as a lot of work was done in the past to eradicate inline JS from mwOffliner ZIMs. The issue is that where a restrictive CSP is imposed that bans inline JS as opposed to attached JS (a real-life example of this is in Chrome extensions, so this negatively affects Kiwix JS Browser Extension), no inline JS can be run (as opposed to JS in attached scripts).

In the latest ZIMs scraped from the mobile-html endpoint using dev 1.14.0 (at least), e.g. in wikipedia_bm_all_nopic_2024-10.zim, I believe the functionality to unhide hidden sections depends on the following piece of inline JS:

Image

As this is blocked by the CSP directive enforced in Chromium extensions, it seriously affects the usability of such ZIMs in Chromium extensions running in ServiceWorkerLocal mode.

This is a regression because, as mentioned, we did a lot of work previously to remove inline JS from all Kiwix scrapers (see the list here kiwix/kiwix-js#865), so we should be very careful not to re-introduce them in development. If it is in fact needed, then the script containing the desired function to run should attach itself internally to an appropriate browser event, and not be run from the HTML.

The fix to this issue may simply be to fix #2074 to ensure all sections are open by default. Then I believe it won't be necessary to run the inline JS here, unless it is doing other things that are desirable (though they should not be essential to be able to read the ZIM in browsers that don't run any JS from the ZIM).

@audiodude
Copy link
Member

I'm familiar with CSP and I understand why inline scripts need to be removed. My feeling is that this entire area was overlooked in the transition from mobile-endpoint to mobile-html.

However, my reading of the code and the README for "pcs", which is the service provided by these scripts, is that they are mostly for theming and image queueing/lazy loading. That is, I don't think this is the script which unhides the sections (#2074).

Agree it should be fixed anyways, but I wonder if we could ask upstream to let us opt out of the inline scripts, rather than "forcibly" removing them at the time of scraping. If we don't do that, we will always be playing catch up if new instances are introduced.

@audiodude
Copy link
Member

My inclination is to just strip all <script> tags indiscriminately and see what breaks. @Jaifroid do you know if there are Kiwix clients where we would benefit from these scripts?

@Jaifroid
Copy link
Collaborator Author

Jaifroid commented Nov 4, 2024

@audiodude It's definitely pcs.js that is doing the unhiding. You can easily see the loaded script in the KJS Browser Extension console when running in ServiceWorker mode (not ServiceWorkerLocal or Restricted, where it is blocked). The code style.dsiplay = "" removes all properties from the style.display attribute of each Section. In the raw HTML, these sections have style="display: none;", and the inline script runs the onBodyStart() function that pcs.js exports into the document window's scope. The snippet shown below is inside the onBodyStart() function, and it explicitly iterates over each Section and removes its display: none; property (or any other property), therefore unhiding them:

Image

Regarding your plan of removing all inline scripts, unfortunately at least in this case, it would result in invisible Sections. You can tell what's broken by opening wikipedia_bm_all_nopic_2024-10.zim in KJS Browser Extension in Restricted mode. That mode does what you propose: strips all inline scripts and also can't run attached scripts. The result is we only see the lede of each article...

But that is something to deal with in #2074 (by stripping style="display: none;" ourselves, and not depending on packaged JS to do it). The point of this issue is more generic: if we have to run onBodyStart(), or any other code, it should not be executed by inline scripting. Instead, we should attach our own script that executes it on DOMContentLoaded or similar (or it could just be a deferred script that will simply execute the function).

Removing all scripts could also affect lazy loading of images, unless mwOffliner handles lazy loading itself. I guess we don't want two competing systems doing that! There may also be other dynamic functions on other pages you probably don't want to strip. NB I'm not against running any JS at all from the ZIM, it's just not feasible to have basic display of content on all pages depending on running an inline script!

@audiodude
Copy link
Member

audiodude commented Nov 4, 2024

Thanks for the detailed response, I missed the code in pcs.js and didn't see how it was called from onBodyStart().

I have #2097 out (and waiting for your review!) to fix #2074.

There is no problem with lazy loading of images, mwoffliner already resolves all images, downloads them ahead of time, and embeds them in the ZIM.

if we have to run onBodyStart(), or any other code, it should not be executed by inline scripting. Instead, we should attach our own script that executes it on DOMContentLoaded or similar (or it could just be a deferred script that will simply execute the function).

I'm not sure I understand this. Are you saying that <script> tags in the <body> are a no-no, but ones in <head> are fine? If that's the case, yes we could collect any <body> scripts and wrap them in a DOMContentLoaded event listener, and hope for the best. Does restricted mode run any JavaScript, I thought no? I guess this would be for other clients?

@Jaifroid
Copy link
Collaborator Author

Jaifroid commented Nov 4, 2024

It's for the Chrome extension, which can run JS but blocks inline scripts as per its CSP, when running locally in Service Worker mode... For all I know this is the only example of inline scripts with such a disastrous effect of they cannot be run, so maybe the point will be moot...

@Jaifroid
Copy link
Collaborator Author

Jaifroid commented Nov 4, 2024

@audiodude I already reviewed #2097 which looks great.

@audiodude
Copy link
Member

@Jaifroid I'm having trouble reproducing. When I load a ZIM in the browser extension that should have the problem and view source with CTRL-U, I don't see any <script> tags at all. I assume this is because the extension is modifying the document source? Or maybe Chrome is stripping the tags?

What I'd like to do is scrape an article or two with 1.14-dev with the bug, and again with my fix, and compare.

@Jaifroid
Copy link
Collaborator Author

Jaifroid commented Dec 3, 2024

@audiodude Don't use Ctrl-U to view. Use DevTools (Ctrl-Shift-I, or F12). Here's a step-by-step:

  1. Open extension or if you don't have it installed merely go to https://browser-extension.kiwix.org, and you an use that as a pseudo-extension
  2. Open the ZIM wikipedia_bm_all_nopic_2024-10.zim using the selectors in Configuration
  3. When landing page loads, click on "Mali" (one of the first links)
  4. Open DevTools (F12), and make sure you can see both the page and the DevTools panel (rearrange windows if necessary)
  5. Go to the "</>Elements" tab (in Chrome - see screenshot), or the "Inspector" tab (in Firefox)
  6. Use the "Select an element" picker (top left, circled in screenshot below), click it, then click an element in the Mali article, so that the inspector zones in on that element in the HTML
  7. In the DevTools pane, search with Ctrl-F, and search for pcs.c1.Page (indicated in screenshot)
  8. You should be in one of two inline script elements

Image

@audiodude
Copy link
Member

Thanks for the extensively detailed instructions! I guess the reason it doesn't work with "view source" is because the article is dynamically loaded into an iframe?

@Jaifroid
Copy link
Collaborator Author

Jaifroid commented Dec 3, 2024

Possibly, tbh I never use view source, since much more extensive tools are available in DevTools! Quite likely the HTML in the iframe is not shown in view source, because the browser would assume it's a Cross Origin site. In our case, it's Same Origin because we insert that content.

@audiodude
Copy link
Member

I imagine the iframe source isn't shown in "view source" because it's loaded after the page loads, but nbd.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug regression First as tragedy, then as farce ;-)
Projects
None yet
3 participants