Skip to content

jonatelintelo/webdriver-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Webdriver Crawler

This repository contains the webdriver crawler for the second assignment of Capita Selecta in Cyber Security (2023) at Radboud University. The crawler is made using Selenium and works for Chrome.

The folder crawl-data contains all JSON files produced by the crawl-accept and crawl-noop runs.

The folder crawl-src contains all code and data to run the webdriver crawler.

The file accept_words.txt contains a list of words used to scan and click on consent button.

The file services.json contains a list of tracker entities and their associated tracker domains.

Installation

pip install -r requirements.txt

Usage

# Run crawl on a single domain, without accepting cookies (--noop):
python script.py -u www.google.com --noop 

# Run crawl on the 500 domains as requested in the assignment, also accepting cookies
python script.py -i tranco-top-500-safe.csv --accept

Required Download

It might occur that running the crawler results in the following error: ERROR: No matching issuer found. To fix this, add ca.crt to your trusted Chrome certificates. This will make the webdriver trusted by Chrome and resolve the error.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •