Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev tor integration #1

Open
wants to merge 223 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
223 commits
Select commit Hold shift + click to select a range
82da64c
Added authors to readme
gmacia Jan 18, 2019
85c0527
An eepsite must be in just one status.
robertomagan Jan 20, 2019
d223c16
Changing UNKWNON status to DISCARDED
robertomagan Jan 20, 2019
bf0ce00
Refactoring configuration constants.
robertomagan Jan 20, 2019
95b02ad
Refactoring in common functionality.
robertomagan Jan 20, 2019
f6fa9e5
Changing ok_files and fail_files lists.
robertomagan Jan 21, 2019
40f30e9
Adding extra debug traces
robertomagan Jan 22, 2019
5c6ef78
Sometimes, a spiders tag as *.ok an empty json file. This way, the cu…
robertomagan Jan 22, 2019
49168ab
Minor changes on post-processing *.ok files
robertomagan Jan 23, 2019
fd0fb76
logs folder
robertomagan Jan 23, 2019
90f5f4e
Changing order in error_to_pending call
robertomagan Jan 23, 2019
7e8518f
Changing the way of removing *.ok files
robertomagan Jan 23, 2019
fc62b27
Changing the way of removing *.ok files
robertomagan Jan 24, 2019
bc503a8
Dealing with large json files for extracting their connections.
robertomagan Jan 25, 2019
0762b5c
Using Scrapy Pipelines for overwriting JSON files thus keeping last s…
robertomagan Jan 28, 2019
ff5d7fa
Bug fixed
robertomagan Jan 28, 2019
ac6003b
Dealing with large json files for extracting their connections.
robertomagan Jan 29, 2019
d355e29
Setting up the inferred site language in the data base
robertomagan Jan 30, 2019
b2f0511
Getting original seeds.
robertomagan Jan 30, 2019
967cd8b
Code refactoring: files and folders.
robertomagan Feb 13, 2019
ae7f318
Deteting exception folders ...
robertomagan Feb 13, 2019
bdcec3f
Moving method getting the crawler status to siteutils.
robertomagan Feb 13, 2019
216340f
First approach on developing site discovering process.
robertomagan Feb 13, 2019
fdd0505
Adding memory limit algorithm to the spider
Abellan09 Feb 17, 2019
f78856e
Adding the eepsite's total number of internal pages
Abellan09 Feb 18, 2019
60d653d
Analyzing details from the main page of an eepsite
Abellan09 Feb 21, 2019
5bced33
Discovering procedure: sequential approach.
robertomagan Feb 21, 2019
dd631d3
Discovering process: distributed approach.
robertomagan Feb 21, 2019
13b8a4d
Adding regex package to requirements.txt file
robertomagan Feb 21, 2019
cf54fee
Son minor processing status flow changes to reset ERROR tries when th…
robertomagan Feb 25, 2019
55b16cc
Comment add_fake_discovery_info
robertomagan Feb 25, 2019
c40349e
Some code lines has been added for a better debugging.
robertomagan Feb 26, 2019
d974e74
Filtering and limiting the sample size in order not to crash the Goog…
Abellan09 Feb 26, 2019
c5ce9d0
Fixing bugs in discovering process and adding some debug code lines.
robertomagan Mar 1, 2019
a1c4069
Simple thread status management.
robertomagan Mar 1, 2019
43e0bfe
Simple thread status management.
robertomagan Mar 1, 2019
eef8aba
Adding HTTP response time to SiteProcessingLog entity.
robertomagan Mar 5, 2019
87418c1
Adding HTTP response time to SiteProcessingLog entity.
robertomagan Mar 5, 2019
f2760e6
Dividing into groups the number of words of the main page to facilita…
Abellan09 Mar 6, 2019
1f86924
Adding new info about the home, title, links on data base
robertomagan Mar 8, 2019
0f28866
Adding links that cause exceptions to visited_links
Abellan09 Mar 13, 2019
609bde9
Changing a function's name
Abellan09 Mar 13, 2019
235c802
Loading the spider's status properly
Abellan09 Mar 13, 2019
fb35caa
Putting in the json file only the final results of the crawling that …
Abellan09 Mar 13, 2019
5cb6ac2
Managing atomic database transactions when a new site is crawler. Bug…
robertomagan Mar 16, 2019
0f8e974
Minor logging bugs fixed.
robertomagan Mar 16, 2019
e8f812c
Some minor changes in spider (init function)
Abellan09 Mar 16, 2019
5715043
Adding the functionality of count the number of scripts and images th…
Abellan09 Mar 18, 2019
f8cc750
Some minor changes in spider debugging
Abellan09 Mar 20, 2019
a4fe136
Changing the way of count the number of scripts and images that conta…
Abellan09 Mar 22, 2019
1898bd5
Updating README and minor changes in manager to manage fail and ok si…
robertomagan Mar 23, 2019
c65103e
Update README.md
robertomagan Mar 23, 2019
ad6a2ea
All seeds found after adding floodfill sources and i2prouter addressb…
robertomagan Mar 25, 2019
c76ec9e
Some minor changes in README
Abellan09 Mar 28, 2019
9548156
Adding ways get the sites in specific status depending on their statu…
robertomagan Apr 5, 2019
aacd2a9
The number of tries to discover a site is equally distributed within …
robertomagan Apr 5, 2019
c60f09b
Getting seeds from floodfill node. A mayor bug has been solver in lin…
robertomagan Apr 5, 2019
0067c63
Managing IOError when seed files do not exists.
robertomagan Apr 5, 2019
d291649
all_seeds.txt
robertomagan Apr 5, 2019
5b8b1a5
Logging management. Spiders outs log information to logs/spider.log.
robertomagan Apr 8, 2019
742211b
Reducing logging information.
robertomagan Apr 8, 2019
f8b9394
Minor bug when restoring spiders in ONGOING status.
robertomagan Apr 8, 2019
6b3f8b3
Re-raise exception in parse method.
robertomagan Apr 8, 2019
7425712
Re-raise exception in parse method.
robertomagan Apr 8, 2019
f2ec8e7
More logging details.
robertomagan Apr 8, 2019
1c7cfe2
More logging details.
robertomagan Apr 8, 2019
02678e7
Monitoring process script.
robertomagan Apr 18, 2019
96c1c14
Debugging improvements.
robertomagan Apr 18, 2019
cc2894a
Returning main page tokenized words
Abellan09 Apr 18, 2019
8232f11
Checking status coherence of scrapy processes on DB and sub-processes.
robertomagan Apr 29, 2019
1c92d35
Adding log lines for debugging purposes.
robertomagan Apr 30, 2019
bb97fc3
Scenario configuration parameters.
robertomagan Apr 30, 2019
c2d6247
Saving 'non_visited_links' list in the disk instead of in memory
Abellan09 May 2, 2019
0f254fc
Adding comments to some functions
Abellan09 May 2, 2019
764b2a5
Removing the site name in non_visited
Abellan09 May 4, 2019
342f2ec
Debug logs.
robertomagan May 10, 2019
81636c8
Adding home site text to db.
robertomagan May 11, 2019
3c5aa9d
Generating UUID for the crawling process.
robertomagan May 11, 2019
7a51f88
Changing where the crawling procedure setting is.
robertomagan May 13, 2019
10e8f63
Deployment process in a distributed environment.
robertomagan May 13, 2019
71b001a
Minor changes on deployment/management scripts
robertomagan May 13, 2019
8858f60
Minor changes on deployment/management scripts
robertomagan May 13, 2019
1a4b64e
Minor changes on deployment/management scripts
robertomagan May 13, 2019
f298752
Minor changes on deployment/management scripts. Adding psutil depende…
robertomagan May 14, 2019
0855e9b
Minor changes on deployment/management scripts. Adding psutil depende…
robertomagan May 14, 2019
dd39d38
Minor changes on deployment/management scripts.
robertomagan May 14, 2019
fda28c9
Minor changes on deployment/management scripts.
robertomagan May 14, 2019
1f4564a
Minor changes on deployment/management scripts.
robertomagan May 14, 2019
135aa85
Minor changes on deployment/management scripts.
robertomagan May 14, 2019
78280b6
Minor changes on deployment/management scripts.
robertomagan May 14, 2019
49e538e
Setting DISCOVERING status when a HTTP error is raised.
robertomagan May 20, 2019
4d8ca1c
Adding virtualized host instances and solving some mintor bug in stat…
robertomagan May 20, 2019
888dff6
Deployment and management scripts modifications.
robertomagan May 23, 2019
44328a2
Deleting pid.txt file
robertomagan May 23, 2019
fda61df
Now we can also get the status of just one VM
robertomagan May 23, 2019
85a204f
Deployment improvements
robertomagan May 23, 2019
22d9f2a
Deployment improvements
robertomagan May 23, 2019
41f800f
Deployment improvements
robertomagan May 23, 2019
a25008c
Minor changes, variables response_code and response_time in http requ…
robertomagan May 23, 2019
d892778
Adding root_path to the setup_bbdd.sh script
robertomagan May 23, 2019
145f581
Relative path to remotely rung the manager
robertomagan May 24, 2019
5d28639
Absolute path to remotely stop the manager
robertomagan May 24, 2019
3ad4cfe
BBDD VM does not have processes to stop
robertomagan May 24, 2019
a1e109b
Configuring manager
robertomagan May 24, 2019
a83776c
Each crawling process must manage only its known sites.
robertomagan May 24, 2019
eba160a
Each crawling process must manage only its known sites.
robertomagan May 24, 2019
5f14135
Each crawling process must manage only its known sites.
robertomagan May 24, 2019
7db30e3
Each crawling process must manage only its known sites.
robertomagan May 24, 2019
74c3f02
Minor changes in logging and management processes.
robertomagan May 25, 2019
4db905b
Minor changes in logging and management processes.
robertomagan May 25, 2019
812dac8
To control sys.out output
robertomagan May 25, 2019
ef240bf
Console log and specific VM stop.
robertomagan May 25, 2019
2ebf0ca
exists() method launched a exception during if the site being created…
robertomagan May 26, 2019
43ea928
Managing transaction integrity exception.
robertomagan May 26, 2019
5bdd5e9
Managing transaction integrity exception in distributed environments.
robertomagan May 28, 2019
e5b6a00
The usual way to start the crawling process.
robertomagan May 28, 2019
05a121a
Adding missing html5lib to requirements.txt
robertomagan May 31, 2019
e94192f
Installing project dependencies when setting up the environment.
robertomagan May 31, 2019
7408aac
Fixing some bugs on managing scripts.
robertomagan May 31, 2019
537dc9b
Initial and seeds fairly distribution among all the manager instances.
robertomagan Jun 3, 2019
2392213
Config files.
robertomagan Jun 3, 2019
c42de9c
Increasing the period of seed self-assigment to 5 mins.
robertomagan Jun 3, 2019
e697ffa
Config for zappa machines.
robertomagan Jun 3, 2019
8b24305
Config for metis machines.
robertomagan Jun 3, 2019
118fb99
Config for zappa machines.
robertomagan Jun 3, 2019
bab90f9
Bug fixed in processing ok files when an exception is raised.
robertomagan Jun 4, 2019
b3b0bb2
Config.
robertomagan Jun 4, 2019
a4e62de
Config.
robertomagan Jun 4, 2019
2aaecf2
Config.
robertomagan Jun 4, 2019
478c283
Config.
robertomagan Jun 4, 2019
cd8bb59
Config.
robertomagan Jun 7, 2019
afbf299
Config.
robertomagan Jun 7, 2019
8bd3db4
Config.
robertomagan Jun 7, 2019
e5717d8
Config.
robertomagan Jun 7, 2019
e29e824
Config.
robertomagan Jun 7, 2019
70c131c
Config.
robertomagan Jun 7, 2019
684603d
Config.
robertomagan Jun 7, 2019
1fe4a05
Config.
robertomagan Jun 7, 2019
21db122
Performance improvements on managing running alive single discovering…
robertomagan Oct 21, 2019
db7d2c6
Fixing some errors
Abellan09 Nov 10, 2019
e8e227e
Updating ...
robertomagan Nov 22, 2019
92a64cc
Adding an utility to update siteconnectivitysummary table after the e…
robertomagan Nov 22, 2019
a3a60ae
Minor changes in README
Abellan09 Jan 14, 2020
9e72b16
Merge branch 'master' of https://github.com/nesg-ugr/I2P_Crawler
Abellan09 Jan 14, 2020
5de71a0
Updating notebooks for analysis ...
robertomagan Feb 3, 2020
df9b64b
New scripts form backing up an experiment from all machines
robertomagan Apr 20, 2020
e0aa387
Snapshot scripts minor changes
robertomagan May 4, 2020
388f63c
Script modifications ...
robertomagan May 6, 2020
f97cb4b
Adding new figures ..
robertomagan May 7, 2020
184d918
Adding and setting up new discovered eepsites now as seeds
robertomagan May 7, 2020
b807c18
Setting up database connection on zappa machines
robertomagan May 11, 2020
3e0bc49
Setting up database connection on zappa machines
robertomagan May 11, 2020
afafc35
Setting up database connection on metis machines
robertomagan May 11, 2020
7d4115a
Adding vai host (instead of zappa) and vai configuration.
robertomagan May 15, 2020
9335205
Vai configuration update
robertomagan May 15, 2020
9344644
Setting up database
robertomagan May 15, 2020
e7dd149
Vai configuration update
robertomagan May 15, 2020
47e67d9
Metis configuration update
robertomagan May 15, 2020
572bae1
Some initial seeds were duplicated. Fixed!
robertomagan May 15, 2020
4cc47ea
Vai's database connection configuration.
robertomagan May 15, 2020
976e69d
Config.
robertomagan May 15, 2020
23803fe
Config.
robertomagan May 15, 2020
097e878
Adding database post-processing utilities
robertomagan Jun 9, 2020
2c97b9e
Removing waiting for a minute ...
robertomagan Jun 11, 2020
f2943e1
Freenet scripts
EmilioFigueras Jul 30, 2020
bdbc01e
Update Git Ignore
EmilioFigueras Jul 30, 2020
b3cf3eb
Update Config
EmilioFigueras Jul 30, 2020
2785bbc
Update freenet scripts
EmilioFigueras Jul 30, 2020
e540cfa
Update freenet scripts
EmilioFigueras Jul 30, 2020
bd93fa1
Update freenet scripts
EmilioFigueras Jul 30, 2020
af92fa6
Update freenet scripts
EmilioFigueras Jul 30, 2020
d8871e5
Update freenet scripts
EmilioFigueras Jul 30, 2020
b785e26
Update freenet scripts
EmilioFigueras Jul 30, 2020
20dae9a
Update freenet scripts
EmilioFigueras Jul 31, 2020
a3be2c5
Update freenet scripts
EmilioFigueras Jul 31, 2020
5531725
minor changes to logs
EmilioFigueras Jul 31, 2020
f70080c
Update freenet scripts
EmilioFigueras Aug 3, 2020
f1a69b6
Bug fixes and settings changes.
EmilioFigueras Aug 4, 2020
52021b3
Changed nomenclature from I2P to Darknet
EmilioFigueras Aug 8, 2020
0c444ef
Fixed a minor bug with freesite urls
EmilioFigueras Aug 11, 2020
1904698
Seed list updated
EmilioFigueras Aug 17, 2020
2867278
pull updated
EmilioFigueras Aug 17, 2020
f771d4b
gitignore updated
EmilioFigueras Aug 17, 2020
e36ba04
Remove UUID
EmilioFigueras Aug 17, 2020
7774e2c
Code analysis performed
EmilioFigueras Aug 20, 2020
ea9ad72
Database deleted
EmilioFigueras Aug 21, 2020
fb11908
Notebook updated
EmilioFigueras Aug 30, 2020
3374e6c
Config.
robertomagan Dec 29, 2020
f3bbbc6
README update
robertomagan Dec 29, 2020
954cfb5
README update
robertomagan Dec 29, 2020
91c8368
Update
EmilioFigueras Jul 25, 2021
417faa9
Updated the name of the configuration parameters
EmilioFigueras Jul 25, 2021
c68adb7
Add files via upload
EmilioFigueras Sep 14, 2021
ec3edc1
Delete c4darknet_modules2.pdf
EmilioFigueras Sep 14, 2021
8ec932c
Add files via upload
EmilioFigueras Sep 14, 2021
b9d00e2
Update README.md
EmilioFigueras Sep 14, 2021
bebff25
Merge branch 'master' into master
EmilioFigueras Sep 17, 2021
7080a9b
Merge pull request #1 from EmilioFigueras/master
nesg-ugr Sep 18, 2021
5abb0f0
Update README.md
EmilioFigueras Sep 19, 2021
a02b0ab
Merge pull request #2 from EmilioFigueras/patch-1
EmilioFigueras Sep 19, 2021
df2d6d0
Update LICENSE
robertomagan Dec 28, 2021
7fd1e3c
Update README.md
robertomagan Jan 14, 2022
911f4ed
I added some comments ...
robertomagan Feb 16, 2023
ab318c3
clase TOR_Spider
edx173 Mar 1, 2023
07cf8db
Adición del proxy TOR y de comentarios
edx173 Mar 2, 2023
7bd6cbc
TOR_Spider, proxy configuration and discovertythread.py modifications…
edx173 May 10, 2023
c1f5525
New deployment scripts
robertomagan Dec 12, 2023
f8382e0
New path for remote script execution
robertomagan Dec 12, 2023
4024687
Database connections settings updated
robertomagan Dec 12, 2023
4da82b6
Configuration script updated
robertomagan Dec 12, 2023
b1598aa
Configuration script updated
robertomagan Dec 12, 2023
15bf738
Configuration script updated
robertomagan Dec 12, 2023
b104099
Deployment script and configuration files - Ready for crawling TOR sites
robertomagan Dec 20, 2023
7b23555
Homogenizing paths for crawling different darknets
robertomagan Dec 29, 2023
7a65f8e
Adding dummy floodfill_seeds.txt file for tor and freenet.
robertomagan Dec 29, 2023
8128557
Adding new i2p seeds and vms for the deployment
robertomagan Dec 29, 2023
a1b87ac
Adding languages types for detection language engines.
robertomagan Dec 29, 2023
cba7525
Adding scripts for dumping and downloading BBDD after the experiments.
robertomagan Jan 3, 2024
9a1c104
Updating script
robertomagan Jan 3, 2024
f3dd859
Updating script
robertomagan Jan 3, 2024
7f18b2e
Setting up complete crawling scenario of 15 vms in total.
robertomagan Jan 3, 2024
61b1d58
Freenet seed - url quotes removed
robertomagan Jan 3, 2024
4ba7ea4
Experimental setup with only 2 machines crawling per darknet
robertomagan Apr 10, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .gitignore
100755 → 100644
Original file line number Diff line number Diff line change
Expand Up @@ -107,3 +107,9 @@ venv.bak/
.mypy_cache/

/notes

#Data experiments
notebooks/data

#UUID
crawler/uuid.txt
2 changes: 1 addition & 1 deletion LICENSE
100755 → 100644
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
MIT License

Copyright (c) 2018 Alberto Abellán Galera
Copyright (c) 2020 Emilio Figueras Martín

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand Down
162 changes: 92 additions & 70 deletions README.md
100755 → 100644
Original file line number Diff line number Diff line change
@@ -1,121 +1,143 @@
# I2P CRAWLER
# c4darknet: Crawl for Darknet

This tool enables crawling on the I2P Darknet.
HTTP crawling tool for Darknets sites.

## How to install

You can launch the crawler on Windows 10 and Ubuntu Linux (from 16.04 version) systems.

In both systems, just download or clone this repository:

```
git clone https://github.com/Abellan09/i2p_crawler
```
<div align="center">
<img src="c4darknet_modules.png" alt="c4darknet functional modules" width="70%"/>
</div>

Then, you have to install/configure some things:
Although it was originally conceived to be used for the I2P anonymous network,
this tool can also be used for crawling some others HTTP based web sites
like those found in TOR, Freenet and/or the surface web.

First of all, it is neccesary to install an instance of I2P.
In second place, you need Python 2.7 and Scrapy.
It is recommended to install DB Browser for SQLite to manage the database easily.
Last, you have to create the "ongoing" and "finished" directories.
The crawler automatically extracts links to other darknet site thus getting an overall
view of the site darknet inter-connections and some other useful information.

### Windows
To function in a darknet it is necessary to implement access to it in
crawler/darknet/spiders/spider.py using the functions of spiderBase.py.
Currently spiders for I2P and Freenet are implemented.

1) I2P.

Download (and execute) the installer from [I2P](https://geti2p.net/es/download).

2) Python and Scrapy
## How to install

Download (and execute) the installer from [Python](https://www.python.org/downloads).
Then, to install scrapy: ```pip install scrapy```
#### Requirements

3) DB Browser for SQLite.
The crawler relies on the use of an adequate environment to run it. Mandatory elements
for that are:

Download (and execute) the installer from [SQLite Browser](https://sqlitebrowser.org).
- Linux **Unbuntu 16.04** and above (it can be run in older version)
- **I2P router** (latest version), **FProxy** or darknet proxy appropiate
- **Mysql 5.7**, though some other DBMS can be used like SQLite.
- **Python 3.7** environment (+ dependencies found in requeriments.txt)

4) Ongoing and finished directories.
#### Installation steps
As python based tool, we recommend to use virtual environments. In the following, we are going
to use conda (https://www.anaconda.com) to create and manage python environments.

Go to the root of the cloned project.
Change directory to ~/spiders and create the directories inside it.
**Dabatase**
1) Download and install database management system. We choose Mysql but some other can be used.

```
cd /i2p_crawler/crawler/i2p/i2p/spiders
mkdir ongoing
mkdir finished
sudo apt install -y mysql-server-5.7 mysql-client-5.7
```

### Linux
2) Creating schema and users.

1) I2P.
Scheme ```i2p_database```, user ```i2p``` and password ```password``` will be created after
executing the following commands.

```
sudo apt-add-repository ppa:i2p-maintainers/i2p
sudo apt-get update
sudo apt-get install i2p
$ sudo mysql
mysql> create database i2p_database;
mysql> create user 'i2p'@'localhost' identified by 'password';
mysql> grant all privileges on `i2p_database`.* to 'i2p'@'localhost';
mysql> quit;
```

2) Python and Scrapy
**Python environment and dependencies**

1) Creating a virtual environment.
```
sudo apt install python2.7
sudo apt install python-pip
sudo pip install scrapy
$ conda create -n py37 python=3.7
$ conda activate py37
(py37) $
```

3) DB Browser for SQLite.

2) Installing python dependencies.
```
sudo add-apt-repository -y ppa:linuxgndu/sqlitebrowser
sudo apt-get update
sudo apt-get install sqlitebrowser
(py37) $ cd <root_project_folder>/crawler/
(py37) $ pip install -r requirements.txt
```

4) Ongoing and finished directories.
3) Database access from python.

We use pony ORM for data persistence layer, so how to connect to database must be configured.
Please edit the lien in file ```connection_settings.py``` which is located
in ```<root_project_folder>/crawler/database/```.

### Crawling
Now it is time to crawl the darknet network. Every time you want to start a new crawling procedure,
we recommend to follow the next steps.

1) Database population.

We recommend to drop and create the scheme before running the crawling for a clean and fresh running.

```
cd /i2p_crawler/crawler/i2p/i2p/spiders
mkdir ongoing
mkdir finished
$ sudo mysql
mysql> drop database i2p_database;
mysql> create database i2p_database;
mysql> quit;
```

## Usage example

First, you have to raise an instance of I2P (it is recommended that the instance is active as much time as possible for better results).
```
(py37) $ cd <root_project_folder>/crawler/
(py37) $ python populate.py
```

In Windows, just click on the "Start I2P" button; in Linux, start the service with ```i2prouter start```
2) Spiders crawling output.

Then, go to the directory ~i2p_crawler/crawler/i2p with ```cd ~/i2p_crawler/crawler/i2p/``` and run the crawler:
Spiders output JSON files in specific folders so they should already be created.
On the contrary, please create them. For a clean and fresh running, delete all files in that folders.

```
python manager.py
(py37) $ cd <root_project_folder>/crawler/darknet/spiders/
(py37) $ mkdir finished ongoing
```

The script "manager.py" will try to crawl the entire I2P (all the eepsites it finds). This can take too much time (difficult to estimate).
If you prefer crawling only one eepsite, run the spider "spider.py" in the next way:
3) Supervising crawling procedure: log.

In order to supervise the crawling procedure, the log file is created in a specific folder.
If "logs" folder is not created, please create it. For a clean and fresh running, delete this file.

```
scrapy crawl i2p -a url=URL -o OUTPUT.json
(py37) $ cd <root_project_folder>/
(py37) $ mkdir logs
```

Where "URL" is the URL of the eepsite you want to crawl and "OUTPUT" is the name of the file where the results of crawling will be.
For example:
4) Starting the crawling process.


```
scrapy crawl i2p -a url=http://eepsite.example.i2p -o output_example.json
(py37) $ cd <root_project_folder>/crawler/
(py37) $ python manager.py &> /dev/null
```

## Built With
If you want to supervise the crawling procedure please use see
```<root_project_folder>/logs/darknetcrawler.log```. Also, more information is being storage in
the database.


* [Python](https://www.python.org) - Used language.
* [Scrapy](https://scrapy.org) - Used crawling framework.
*Note:* The crawling procedure output tons of logs and information on standard output so we recommend to
launch the crawler appending ```&> /dev/null``` but it is up to the user.

## Author
## Authors

* **Alberto Abellán**
* **Roberto Magán-Carrión**
* **Alberto Abellán-Galera**
* **Gabriel Maciá-Fernández**
* **Emilio Figueras Martín**

See also the list of [contributors](https://github.com/Abellan09/i2p_crawler/graphs/contributors) who participated in this project.
See also the list of [contributors](https://github.com/EmilioFigueras/c4darknet/graphs/contributors) who participated in this project.

## License

This project is licensed under the MIT License - see the [LICENSE.md](LICENSE.md) file for details.
This project is licensed under the MIT License - see the [LICENSE.md](LICENSE) file for details.
Binary file added c4darknet_modules.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added c4i2p_modules.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
File renamed without changes.
39 changes: 26 additions & 13 deletions crawler/i2p/i2p/settings.py → crawler/darknet/darknetsettings.py
100755 → 100644
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# -*- coding: utf-8 -*-

# Scrapy settings for i2p project
# Scrapy settings for darknet project
#
# For simplicity, this file contains only settings considered important or
# commonly used. More settings and their documentation in:
Expand All @@ -9,13 +9,15 @@
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'i2p'
import settings

SPIDER_MODULES = ['i2p.spiders']
NEWSPIDER_MODULE = 'i2p.spiders'
BOT_NAME = 'darknet'

SPIDER_MODULES = ['darknet.spiders']
NEWSPIDER_MODULE = 'darknet.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'i2p (+http://www.yourdomain.com)'
#USER_AGENT = 'darknet (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False
Expand All @@ -27,8 +29,8 @@
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1
DOWNLOAD_TIMEOUT = 30 # 30s
RETRY_TIMES = 2
DOWNLOAD_TIMEOUT = settings.HTTP_TIMEOUT
RETRY_TIMES = settings.MAX_CRAWLING_ATTEMPTS_ON_ERROR
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
Expand All @@ -48,14 +50,14 @@
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'i2p.middlewares.I2PSpiderMiddleware': 543,
# 'i2darknetp.middlewares.DarknetSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'i2p.middlewares.I2PProxyMiddleware': 200,
'i2p.middlewares.I2PFilterMiddleware': 300,
'darknet.middlewares.DarknetProxyMiddleware': 200,
'darknet.middlewares.DarknetFilterMiddleware': 300,
}

# Enable or disable extensions
Expand All @@ -66,9 +68,9 @@

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'i2p.pipelines.I2PPipeline': 300,
#}
ITEM_PIPELINES = {
'darknet.pipelines.DarknetPipeline': 300,
}

# The maximum depth that will be allowed to crawl for any site:
DEPTH_LIMIT = 3
Expand All @@ -80,3 +82,14 @@
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

#FEED_STORAGES_BASE = {
# '': 'darknet.exportutils.CustomFileFeedStorage',
# 'file': 'darknet.exportutils.CustomFileFeedStorage'
#}

# CUSTOM CONFIGURATION
PATH_ONGOING_SPIDERS = "darknet/spiders/ongoing/"
PATH_FINISHED_SPIDERS = "darknet/spiders/finished/"
PATH_LOG = '../logs/'
PATH_DATA = '../data/'
2 changes: 1 addition & 1 deletion crawler/i2p/i2p/items.html → crawler/darknet/items.html
100755 → 100644
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
<td valign=bottom>&nbsp;<br>
<font color="#ffffff" face="helvetica, arial">&nbsp;<br><big><big><strong>items</strong></big></big></font></td
><td align=right valign=bottom
><font color="#ffffff" face="helvetica, arial"><a href=".">index</a><br><a href="file:///C:/users/alberto/dropbox/universidad/5.%20tfg/tfg_crawler_i2p/crawler/i2p/i2p/items.py">c:\users\alberto\dropbox\universidad\5. tfg\tfg_crawler_i2p\crawler\i2p\i2p\items.py</a></font></td></tr></table>
><font color="#ffffff" face="helvetica, arial"><a href="i2p">index</a><br><a href="file:///C:/users/alberto/dropbox/universidad/5.%20tfg/tfg_crawler_i2p/crawler/i2p/i2p/items.py">c:\users\alberto\dropbox\universidad\5. tfg\tfg_crawler_i2p\crawler\i2p\i2p\items.py</a></font></td></tr></table>
<p><tt>#&nbsp;-*-&nbsp;coding:&nbsp;utf-8&nbsp;-*-</tt></p>
<p>
<table width="100%" cellspacing=0 cellpadding=2 border=0 summary="section">
Expand Down
17 changes: 10 additions & 7 deletions crawler/i2p/i2p/items.py → crawler/darknet/items.py
100755 → 100644
Original file line number Diff line number Diff line change
Expand Up @@ -7,15 +7,18 @@

import scrapy

class I2P_spider_state(scrapy.Item):
class Darknet_spider_state(scrapy.Item):

'''
EN: Item that represents the state of the spider.
SP: Item que representa el estado del spider.
'''
eepsite = scrapy.Field()
'''

darksite = scrapy.Field()
visited_links = scrapy.Field()
non_visited_links = scrapy.Field()
language = scrapy.Field()
extracted_eepsites = scrapy.Field()
extracted_darksites = scrapy.Field()
total_darksite_pages = scrapy.Field()
title = scrapy.Field()
size_main_page = scrapy.Field()
main_page_tokenized_words = scrapy.Field()
2 changes: 1 addition & 1 deletion crawler/i2p/i2p/middlewares.html → crawler/darknet/middlewares.html
100755 → 100644
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
<td valign=bottom>&nbsp;<br>
<font color="#ffffff" face="helvetica, arial">&nbsp;<br><big><big><strong>middlewares</strong></big></big></font></td
><td align=right valign=bottom
><font color="#ffffff" face="helvetica, arial"><a href=".">index</a><br><a href="file:///C:/users/alberto/dropbox/universidad/5.%20tfg/tfg_crawler_i2p/crawler/i2p/i2p/middlewares.py">c:\users\alberto\dropbox\universidad\5. tfg\tfg_crawler_i2p\crawler\i2p\i2p\middlewares.py</a></font></td></tr></table>
><font color="#ffffff" face="helvetica, arial"><a href="i2p">index</a><br><a href="file:///C:/users/alberto/dropbox/universidad/5.%20tfg/tfg_crawler_i2p/crawler/i2p/i2p/middlewares.py">c:\users\alberto\dropbox\universidad\5. tfg\tfg_crawler_i2p\crawler\i2p\i2p\middlewares.py</a></font></td></tr></table>
<p><tt>#&nbsp;-*-&nbsp;coding:&nbsp;utf-8&nbsp;-*-</tt></p>
<p>
<table width="100%" cellspacing=0 cellpadding=2 border=0 summary="section">
Expand Down
Loading