Skip to content

brave-experiments/cookiemonster-tools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tools for interacting with Cookiemonster API

Run a crawl

Init

Dependencies

Initialize the Python virtual environment and install the requirements. For example:

python3 -m venv env
source env/bin/activate
pip3 install -r requirements.txt

List of domains to crawl

You will need a list of domains to crawl. It presumes a list of Tranco domains, though other lists might work.

Run the get_latest_tranco.sh script (curls and unzips the latest top 1M Tranco list of domains with subdomains).

Alternatively, you can obtain by hand the latest (zipped) Tranco list with subdomains here: https://tranco-list.eu/top-1m-incl-subdomains.csv.zip

Using

You need to pass as input a CSV file of domains. You also need to pass an output file. If the supplied output file already exists, it will be appended to, not overwritten. Also, you can pass in a --skip parameter to skip the first N rows. All this helps restart crawls.

The script will both print to stdout and write to the output file.

python3 crawl.py -i top-1m.csv -o sept14-crawl-results.txt

About

Scripts to help work with Cookiemonster API

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published