May 2022 - Replace Splash with Playwright
New Features
Playwright
The captures are now made via Playwright instead of Splash. It is a major improvement as Playwright uses actual up-to-date browsers, in headless mode (instead of qt-webkit from ~2016). You can read more about the research that lead to this change in the discussion.
The main other advantages of using playwright are the following:
- Easier to install: it doesn't requires Docker in order to use Splash
- Much better control of what happen in the browser while capturing: Playwright makes it extremely simple to instrument everything in the browsers. The capturing module already tries to solve reCaptcha if it detects it on the page.
The capture is made by a standalone python module that you can use in your own tools if you wish to.
De-duplication
If the exact same capture is triggered multiple times within 5 min, it is skipped and the requestor is redirected to the capture done before.
Fixes
- Avoid discarding a capture on network error: when a redirect is broken down the line, we keep the chain up to that point
- Issue when the MISP was submitted as un-published
- [Docker] Properly handle archiving
- [Docker] Init SRI hashes
Changes
- Improve subsequent capture template on long URLs
- Improve view of the capture page on small-ish screens
- General maintenance and code cleanup
- Improvement in the tree generation on edge cases
- Bump JS/CSS libraries
- Update bundled-in User-Agent file
- Use pydeep2, comes with a bundled-in libfuzzy, easier to install.