Skip to content

Crawler based on a modified browser to detect online tracking.

Notifications You must be signed in to change notification settings

EmreAtes/modCrawler

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

modCrawler

Crawler based on a modified browser to detect online tracking. Used in the canvas fingerprinting and evercookie detection experiments in our CCS 2014 paper, The Web Never Forgets. Visit The Web Never Forgets website for more info.

Installation

We strongly suggest you to use a virtual machine, container or similar isolation to install modCrawler.

git clone https://github.com/fpdetective/modCrawler
cd modCrawler
./setup.sh

Running the tests

Please run the tests before running the crawler. For simplicity you can run py.test from within the test directory. py.test will discover and run all the test. Alternatively, you can run individual tests from the command line such as: python -m test.runenv_test

Command line parameters

Below we give a description of the parameters that are passed to the agents.py module.

  • --urls: path to file that contains the list of URLs to crawl
  • --max_rank: max line number of the url to be crawled (if urls contain rank info)
  • --min_rank (optional): min line number of the url to be crawled (if url contains rank info)
  • --max_proc: maximum number of browsers that will run in parallel
  • --flash: Flash support (0: disable, 1: enable (default))
  • --cookie: Cookie support (0: allow all (default), 1: allow 1st party, 2: disable, 3: allow third-party cookies from visited)
  • --upload: Upload crawl results to a remote server via SSH. 0: don't upload (default), 1: upload (SSH server info should be completed in crawler/common.py)

Example:

  • To crawl top 100 urls in the etc/top-1m.csv file using 10 parallel crawlers (Flash disabled).

    • python crawl.py --urls etc/top-1m.csv --max_rank 100 max_proc 10 --flash 0
  • To crawl urls between rank 100-1000 in the etc/top-1m.csv file using 5 parallel crawlers (Flash enabled).

    • python crawl.py --urls etc/top-1m.csv --max_rank 1000 --min_rank 100 max_proc 5

After the crawl

Report screenshot

modCrawler will store the data about the crawls in the jobs directory. For convenience, it places a symlink called latest that points to the directory of the most recent crawl.

During the crawl, you can watch the debug.log tail -f jobs/latest/debug.log

Once the crawl has finished, you can find the crawl data in the jobs/latest/ directory.

  • crawl.sqlite: Sqlite based crawl database.
  • ...report.html: An HTML based report that gives an overview of the results. The name of the file depends on the date and crawl parameters.
  • debug.log: Debug logs.
  • error.log: Error logs, file is not created if there is no error.

In addition, the crawl directory is gzipped and stored in the jobs directory.

Building your own browser

The setup.sh script will download a modified Firefox which logs canvas fingerprinting related function calls. Alternatively, you can build your own Firefox using the provided browser patch. Make sure you use the right .mozconfig file for building (e.g., export MOZCONFIG=~/path/to/gecko-dev/.mozconfig-ffstd) Assuming you checked out the Firefox repository into ~/dev/gecko-dev/

cd ~/dev/gecko-dev/;
git fetch
git checkout GECKO4401_2016020518_RELBRANCH
git apply ~/dev/modCrawler/browser_patch/0001-Log-canvas-fingerprinting-related-function-calls.patch
./mach build
cd firefox-static
make package;
# copy it from dist dir to destination
cp dist/*.bz2 /path/to/modCrawler/bins

You need to place your freshly built browser to bins/ff-mod directory to make sure it is used by the crawler. Please consult the Mozilla documentation for errors you may run into.

About

Crawler based on a modified browser to detect online tracking.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 76.6%
  • Perl 15.1%
  • HTML 5.3%
  • PLpgSQL 1.3%
  • Other 1.7%