HTTP crawling tool for Darknets sites.
Although it was originally conceived to be used for the I2P anonymous network, this tool can also be used for crawling some others HTTP based web sites like those found in TOR, Freenet and/or the surface web.
The crawler automatically extracts links to other darknet site thus getting an overall view of the site darknet inter-connections and some other useful information.
To function in a darknet it is necessary to implement access to it in crawler/darknet/spiders/spider.py using the functions of spiderBase.py. Currently spiders for I2P and Freenet are implemented.
The crawler relies on the use of an adequate environment to run it. Mandatory elements for that are:
- Linux Unbuntu 16.04 and above (it can be run in older version)
- I2P router (latest version), FProxy or darknet proxy appropiate
- Mysql 5.7, though some other DBMS can be used like SQLite.
- Python 3.7 environment (+ dependencies found in requeriments.txt)
As python based tool, we recommend to use virtual environments. In the following, we are going to use conda (https://www.anaconda.com) to create and manage python environments.
Dabatase
- Download and install database management system. We choose Mysql but some other can be used.
sudo apt install -y mysql-server-5.7 mysql-client-5.7
- Creating schema and users.
Scheme i2p_database
, user i2p
and password password
will be created after
executing the following commands.
$ sudo mysql
mysql> create database i2p_database;
mysql> create user 'i2p'@'localhost' identified by 'password';
mysql> grant all privileges on `i2p_database`.* to 'i2p'@'localhost';
mysql> quit;
Python environment and dependencies
- Creating a virtual environment.
$ conda create -n py37 python=3.7
$ conda activate py37
(py37) $
- Installing python dependencies.
(py37) $ cd <root_project_folder>/crawler/
(py37) $ pip install -r requirements.txt
- Database access from python.
We use pony ORM for data persistence layer, so how to connect to database must be configured.
Please edit the lien in file connection_settings.py
which is located
in <root_project_folder>/crawler/database/
.
Now it is time to crawl the darknet network. Every time you want to start a new crawling procedure, we recommend to follow the next steps.
- Database population.
We recommend to drop and create the scheme before running the crawling for a clean and fresh running.
$ sudo mysql
mysql> drop database i2p_database;
mysql> create database i2p_database;
mysql> quit;
(py37) $ cd <root_project_folder>/crawler/
(py37) $ python populate.py
- Spiders crawling output.
Spiders output JSON files in specific folders so they should already be created. On the contrary, please create them. For a clean and fresh running, delete all files in that folders.
(py37) $ cd <root_project_folder>/crawler/darknet/spiders/
(py37) $ mkdir finished ongoing
- Supervising crawling procedure: log.
In order to supervise the crawling procedure, the log file is created in a specific folder. If "logs" folder is not created, please create it. For a clean and fresh running, delete this file.
(py37) $ cd <root_project_folder>/
(py37) $ mkdir logs
- Starting the crawling process.
(py37) $ cd <root_project_folder>/crawler/
(py37) $ python manager.py &> /dev/null
If you want to supervise the crawling procedure please use see
<root_project_folder>/logs/darknetcrawler.log
. Also, more information is being storage in
the database.
Note: The crawling procedure output tons of logs and information on standard output so we recommend to
launch the crawler appending &> /dev/null
but it is up to the user.
- Roberto Magán-Carrión
- Alberto Abellán-Galera
- Gabriel Maciá-Fernández
- Emilio Figueras Martín
See also the list of contributors who participated in this project.
This project is licensed under the MIT License - see the LICENSE.md file for details.