Webdriver Crawler

This repository contains the webdriver crawler for the second assignment of Capita Selecta in Cyber Security (2023) at Radboud University. The crawler is made using Selenium and works for Chrome.

The folder crawl-data contains all JSON files produced by the crawl-accept and crawl-noop runs.

The folder crawl-src contains all code and data to run the webdriver crawler.

The file accept_words.txt contains a list of words used to scan and click on consent button.

The file services.json contains a list of tracker entities and their associated tracker domains.

Installation

pip install -r requirements.txt

Usage

# Run crawl on a single domain, without accepting cookies (--noop):
python script.py -u www.google.com --noop 

# Run crawl on the 500 domains as requested in the assignment, also accepting cookies
python script.py -i tranco-top-500-safe.csv --accept

Required Download

It might occur that running the crawler results in the following error: ERROR: No matching issuer found. To fix this, add ca.crt to your trusted Chrome certificates. This will make the webdriver trusted by Chrome and resolve the error.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
analysis		analysis
crawl_data		crawl_data
crawler_src		crawler_src
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
ca.crt		ca.crt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Webdriver Crawler

Installation

Usage

Required Download

About

Releases

Packages

Contributors 4

Languages

jonatelintelo/webdriver-crawler

Folders and files

Latest commit

History

Repository files navigation

Webdriver Crawler

Installation

Usage

Required Download

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages