Javascript crawler Architecture and Implementation plan

Goal

The goal of this document is to define how the JavaScript crawler feature is going to work, and the steps required to implement it.

Architecture diagram

Architecture

The JS crawler will become part of the web_spider plugin, where we'll add the browser wrapper. The wrapper should have a very simple API:

start() creates all the browser processes and prepares to work
load(uri) loads a uri in a browser engine. Returns a task-id
is_ready(task_id) returns True when the uri has been loaded
get_fuzzable_requests(task_id) returns a list of all the HTTP requests which were sent by the browser while loading the URI associated with the task_id
kill(task_id) kills the browser associated with the task_id

The wrapper responsibilities are:

Starting new browser processes using a multiprocessing pool (Pebble)
Starting an HTTP proxy
Configuring the browsers to navigate via the proxy
Two-way communication with the browser

The browser process will be responsible for:

Two-way communication with the wrapper: receive instructions such as "load page X", "click on Y", "return status", "return DOM". And also send requests like "please fill this form".
The browser process needs to have JS code injected into the page in order to click / instrument the page.

All browsers are run in different sub-processes, with the goal to kill them if they become unresponsive and "bypass the GIL". There will be around 10 processes running browsers in w3af.

The HTTP proxy is responsible for capturing all HTTP requests sent by the browser processes. This allows us to capture new injection points.

Implementation steps

Proof of concept

The proof of concept step allows us to confirm that the architecture meets all the requirements needed to have a working javascript crawler.

The proof of concept step will only send the URL to the browser and wait for it to complete loading. During this load phase the site might redirect the user to another location using JS, send XMLHTTPRequests, etc. All of those HTTP requests will be captured and were invisible to w3af without the browser engine.

Click on links

The crawler is now extended to click on links, buttons, etc.

This step gives more value to the users, since now the javascript crawler is actually interacting with the site.

Submit forms

The crawler is now extended to fill forms and submit them.

This is the final step in the implementation and will give users great value. This final step is making w3af interact with the site like any real-life user would.

Login forms

In some rare cases the application login form uses multiple redirects and potentially JS to authenticate the user after username / password are entered into the form.

Currently users could write a specific auth plugin to handle that case, even without JS support, the plugin can have hard-coded steps to run, regular expressions, etc. that will mimic the same thing a browser would do. The bad thing about this is that most users won't write the plugin and instead will stop using w3af.

This is a nice to have, but... the authentication plugins should be able to handle JavaScript too. It would be nice to have a new auth plugin where the user could enter (maybe?) a CSS selector of the inputs to fill, a CSS selector of the button to press, and then the browser will do all the steps and return the cookie to the framework.

References

GitHub Issue #1796

Provide feedback

Saved searches

Use saved searches to filter your results more quickly