Abellan09 · edx173 · Jan 18, 2019 · Jan 20, 2019 · Jan 20, 2019 · Jan 20, 2019
diff --git a/.gitignore b/.gitignore
@@ -107,3 +107,9 @@ venv.bak/
 .mypy_cache/
 
 /notes
+
+#Data experiments
+notebooks/data
+
+#UUID
+crawler/uuid.txt
diff --git a/LICENSE b/LICENSE
@@ -1,6 +1,6 @@
 MIT License
 
-Copyright (c) 2018 Alberto Abellán Galera
+Copyright (c) 2020 Emilio Figueras Martín
 
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal

diff --git a/README.md b/README.md
@@ -1,121 +1,143 @@
-# I2P CRAWLER
+# c4darknet: Crawl for Darknet
 
-This tool enables crawling on the I2P Darknet.
+HTTP crawling tool for Darknets sites.
 
-## How to install
-
-You can launch the crawler on Windows 10 and Ubuntu Linux (from 16.04 version) systems.
-
-In both systems, just download or clone this repository:
-
-```
-git clone https://github.com/Abellan09/i2p_crawler
-```
+<div align="center">
+<img src="c4darknet_modules.png" alt="c4darknet functional modules" width="70%"/>
+</div>
 
-Then, you have to install/configure some things:
+Although it was originally conceived to be used for the I2P anonymous network, 
+this tool can also be used for crawling some others HTTP based web sites 
+like those found in TOR, Freenet and/or the surface web.
 
-First of all, it is neccesary to install an instance of I2P.
-In second place, you need Python 2.7 and Scrapy.
-It is recommended to install DB Browser for SQLite to manage the database easily.
-Last, you have to create the "ongoing" and "finished" directories.
+The crawler automatically extracts links to other darknet site thus getting an overall 
+view of the site darknet inter-connections and some other useful information.
 
-### Windows
+To function in a darknet it is necessary to implement access to it in 
+crawler/darknet/spiders/spider.py using the functions of spiderBase.py.
+Currently spiders for I2P and Freenet are implemented.
 
-1) I2P.
-
-Download (and execute) the installer from [I2P](https://geti2p.net/es/download).
-
-2) Python and Scrapy
+## How to install
 
-Download (and execute) the installer from [Python](https://www.python.org/downloads).
-Then, to install scrapy: ```pip install scrapy```
+#### Requirements
 
-3) DB Browser for SQLite.
+The crawler relies on the use of an adequate environment to run it. Mandatory elements
+for that are:
 
-Download (and execute) the installer from [SQLite Browser](https://sqlitebrowser.org).
+- Linux **Unbuntu 16.04** and above (it can be run in older version)
+- **I2P router** (latest version), **FProxy** or darknet proxy appropiate
+- **Mysql 5.7**, though some other DBMS can be used like SQLite.
+- **Python 3.7** environment (+ dependencies found in requeriments.txt)
 
-4) Ongoing and finished directories.
+#### Installation steps
+As python based tool, we recommend to use virtual environments. In the following, we are going
+to use conda (https://www.anaconda.com) to create and manage python environments.
 
-Go to the root of the cloned project.
-Change directory to ~/spiders and create the directories inside it.
+**Dabatase**
+1) Download and install database management system. We choose Mysql but some other can be used.
 
 ```
-cd /i2p_crawler/crawler/i2p/i2p/spiders
-mkdir ongoing
-mkdir finished
+sudo apt install -y mysql-server-5.7 mysql-client-5.7
 ```
 
-### Linux
+2) Creating schema and users.
 
-1) I2P.
+Scheme ```i2p_database```, user ```i2p``` and password ```password``` will be created after 
+executing the following commands.
 
 ```
-sudo apt-add-repository ppa:i2p-maintainers/i2p
-sudo apt-get update
-sudo apt-get install i2p
+$ sudo mysql
+mysql> create database i2p_database;
+mysql> create user 'i2p'@'localhost' identified by 'password';
+mysql> grant all privileges on `i2p_database`.* to 'i2p'@'localhost';
+mysql> quit;
 ```
 
-2) Python and Scrapy
+**Python environment and dependencies**
 
+1) Creating a virtual environment.
 ```
-sudo apt install python2.7
-sudo apt install python-pip
-sudo pip install scrapy
+$ conda create -n py37 python=3.7
+$ conda activate py37
+(py37) $
 ```
-
-3) DB Browser for SQLite.
-
+2) Installing python dependencies.
 ```
-sudo add-apt-repository -y ppa:linuxgndu/sqlitebrowser
-sudo apt-get update
-sudo apt-get install sqlitebrowser
+(py37) $ cd <root_project_folder>/crawler/
+(py37) $ pip install -r requirements.txt
 ```
 
-4) Ongoing and finished directories.
+3) Database access from python.
+
+We use pony ORM for data persistence layer, so how to connect to database must be configured.
+Please edit the lien in file ```connection_settings.py``` which is located 
+in ```<root_project_folder>/crawler/database/```.
+
+### Crawling
+Now it is time to crawl the darknet network. Every time you want to start a new crawling procedure,
+we recommend to follow the next steps.
+
+1) Database population.
+
+We recommend to drop and create the scheme before running the crawling for a clean and fresh running.
 
 ```
-cd /i2p_crawler/crawler/i2p/i2p/spiders
-mkdir ongoing
-mkdir finished
+$ sudo mysql
+mysql> drop database i2p_database;
+mysql> create database i2p_database;
+mysql> quit;
 ```
 
-## Usage example
-
-First, you have to raise an instance of I2P (it is recommended that the instance is active as much time as possible for better results).
+```
+(py37) $ cd <root_project_folder>/crawler/
+(py37) $ python populate.py
+```
 
-In Windows, just click on the "Start I2P" button; in Linux, start the service with ```i2prouter start```
+2) Spiders crawling output.
 
-Then, go to the directory ~i2p_crawler/crawler/i2p with ```cd ~/i2p_crawler/crawler/i2p/``` and run the crawler:
+Spiders output JSON files in specific folders so they should already be created. 
+On the contrary, please create them. For a clean and fresh running, delete all files in that folders.
 
 ```
-python manager.py
+(py37) $ cd <root_project_folder>/crawler/darknet/spiders/
+(py37) $ mkdir finished ongoing
 ```
 
-The script "manager.py" will try to crawl the entire I2P (all the eepsites it finds). This can take too much time (difficult to estimate).
-If you prefer crawling only one eepsite, run the spider "spider.py" in the next way:
+3) Supervising crawling procedure: log.
+
+In order to supervise the crawling procedure, the log file is created in a specific folder.
+If "logs" folder is not created, please create it. For a clean and fresh running, delete this file.
 
 ```
-scrapy crawl i2p -a url=URL -o OUTPUT.json
+(py37) $ cd <root_project_folder>/
+(py37) $ mkdir logs
 ```
 
-Where "URL" is the URL of the eepsite you want to crawl and "OUTPUT" is the name of the file where the results of crawling will be.
-For example:
+4) Starting the crawling process.
+
 
 ```
-scrapy crawl i2p -a url=http://eepsite.example.i2p -o output_example.json
+(py37) $ cd <root_project_folder>/crawler/
+(py37) $ python manager.py &> /dev/null
 ```
 
-## Built With
+If you want to supervise the crawling procedure please use see 
+```<root_project_folder>/logs/darknetcrawler.log```. Also, more information is being storage in
+the database.
+
 
-* [Python](https://www.python.org) - Used language.
-* [Scrapy](https://scrapy.org) - Used crawling framework.
+*Note:* The crawling procedure output tons of logs and information on standard output so we recommend to 
+launch the crawler appending ```&> /dev/null``` but it is up to the user.
 
-## Author
+## Authors
 
-* **Alberto Abellán**
+* **Roberto Magán-Carrión**
+* **Alberto Abellán-Galera**
+* **Gabriel Maciá-Fernández**
+* **Emilio Figueras Martín**
 
-See also the list of [contributors](https://github.com/Abellan09/i2p_crawler/graphs/contributors) who participated in this project.
+See also the list of [contributors](https://github.com/EmilioFigueras/c4darknet/graphs/contributors) who participated in this project.
 
 ## License
 
-This project is licensed under the MIT License - see the [LICENSE.md](LICENSE.md) file for details.
+This project is licensed under the MIT License - see the [LICENSE.md](LICENSE) file for details.
diff --git a/c4darknet_modules.png b/c4darknet_modules.png
diff --git a/c4i2p_modules.png b/c4i2p_modules.png
diff --git a/crawler/i2p/__init__.py → crawler/darknet/__init__.py b/crawler/i2p/__init__.py → crawler/darknet/__init__.py
diff --git a/crawler/i2p/i2p/settings.py → crawler/darknet/darknetsettings.py b/crawler/i2p/i2p/settings.py → crawler/darknet/darknetsettings.py
@@ -1,6 +1,6 @@
 # -*- coding: utf-8 -*-
 
-# Scrapy settings for i2p project
+# Scrapy settings for darknet project
 #
 # For simplicity, this file contains only settings considered important or
 # commonly used. More settings and their documentation in:
@@ -9,13 +9,15 @@
 #     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
 #     https://doc.scrapy.org/en/latest/topics/spider-middleware.html
 
-BOT_NAME = 'i2p'
+import settings
 
-SPIDER_MODULES = ['i2p.spiders']
-NEWSPIDER_MODULE = 'i2p.spiders'
+BOT_NAME = 'darknet'
+
+SPIDER_MODULES = ['darknet.spiders']
+NEWSPIDER_MODULE = 'darknet.spiders'
 
 # Crawl responsibly by identifying yourself (and your website) on the user-agent
-#USER_AGENT = 'i2p (+http://www.yourdomain.com)'
+#USER_AGENT = 'darknet (+http://www.yourdomain.com)'
 
 # Obey robots.txt rules
 ROBOTSTXT_OBEY = False
@@ -27,8 +29,8 @@
 # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
 # See also autothrottle settings and docs
 DOWNLOAD_DELAY = 1
-DOWNLOAD_TIMEOUT = 30 # 30s
-RETRY_TIMES = 2
+DOWNLOAD_TIMEOUT = settings.HTTP_TIMEOUT
+RETRY_TIMES = settings.MAX_CRAWLING_ATTEMPTS_ON_ERROR
 # The download delay setting will honor only one of:
 #CONCURRENT_REQUESTS_PER_DOMAIN = 16
 #CONCURRENT_REQUESTS_PER_IP = 16
@@ -48,14 +50,14 @@
 # Enable or disable spider middlewares
 # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
 #SPIDER_MIDDLEWARES = {
-#    'i2p.middlewares.I2PSpiderMiddleware': 543,
+#    'i2darknetp.middlewares.DarknetSpiderMiddleware': 543,
 #}
 
 # Enable or disable downloader middlewares
 # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
 DOWNLOADER_MIDDLEWARES = {
-    'i2p.middlewares.I2PProxyMiddleware': 200,
-    'i2p.middlewares.I2PFilterMiddleware': 300,
+    'darknet.middlewares.DarknetProxyMiddleware': 200,
+    'darknet.middlewares.DarknetFilterMiddleware': 300,
 }
 
 # Enable or disable extensions
@@ -66,9 +68,9 @@
 
 # Configure item pipelines
 # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
-#ITEM_PIPELINES = {
-#    'i2p.pipelines.I2PPipeline': 300,
-#}
+ITEM_PIPELINES = {
+    'darknet.pipelines.DarknetPipeline': 300,
+}
 
 # The maximum depth that will be allowed to crawl for any site:
 DEPTH_LIMIT = 3
@@ -80,3 +82,14 @@
 #HTTPCACHE_DIR = 'httpcache'
 #HTTPCACHE_IGNORE_HTTP_CODES = []
 #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
+
+#FEED_STORAGES_BASE = {
+#    '': 'darknet.exportutils.CustomFileFeedStorage',
+#    'file': 'darknet.exportutils.CustomFileFeedStorage'
+#}
+
+# CUSTOM CONFIGURATION
+PATH_ONGOING_SPIDERS =  "darknet/spiders/ongoing/"
+PATH_FINISHED_SPIDERS = "darknet/spiders/finished/"
+PATH_LOG = '../logs/'
+PATH_DATA = '../data/'
diff --git a/crawler/i2p/i2p/items.html → crawler/darknet/items.html b/crawler/i2p/i2p/items.html → crawler/darknet/items.html
@@ -9,7 +9,7 @@
 <td valign=bottom>&nbsp;<br>
 <font color="#ffffff" face="helvetica, arial">&nbsp;<br><big><big><strong>items</strong></big></big></font></td
 ><td align=right valign=bottom
-><font color="#ffffff" face="helvetica, arial"><a href=".">index</a><br><a href="file:///C:/users/alberto/dropbox/universidad/5.%20tfg/tfg_crawler_i2p/crawler/i2p/i2p/items.py">c:\users\alberto\dropbox\universidad\5. tfg\tfg_crawler_i2p\crawler\i2p\i2p\items.py</a></font></td></tr></table>
+><font color="#ffffff" face="helvetica, arial"><a href="i2p">index</a><br><a href="file:///C:/users/alberto/dropbox/universidad/5.%20tfg/tfg_crawler_i2p/crawler/i2p/i2p/items.py">c:\users\alberto\dropbox\universidad\5. tfg\tfg_crawler_i2p\crawler\i2p\i2p\items.py</a></font></td></tr></table>
     <p><tt>#&nbsp;-*-&nbsp;coding:&nbsp;utf-8&nbsp;-*-</tt></p>
 <p>
 <table width="100%" cellspacing=0 cellpadding=2 border=0 summary="section">

diff --git a/crawler/i2p/i2p/items.py → crawler/darknet/items.py b/crawler/i2p/i2p/items.py → crawler/darknet/items.py
@@ -7,15 +7,18 @@
 
 import scrapy
 
-class I2P_spider_state(scrapy.Item):
-	
+class Darknet_spider_state(scrapy.Item):
+
 	'''
 	EN: Item that represents the state of the spider.
 	SP: Item que representa el estado del spider.
-	'''	
-	
-	eepsite = scrapy.Field()
+	'''
+
+	darksite = scrapy.Field()
 	visited_links = scrapy.Field()
-	non_visited_links = scrapy.Field()
 	language = scrapy.Field()
-	extracted_eepsites = scrapy.Field()
+	extracted_darksites = scrapy.Field()
+	total_darksite_pages = scrapy.Field()
+	title = scrapy.Field()
+	size_main_page = scrapy.Field()
+	main_page_tokenized_words = scrapy.Field()
diff --git a/crawler/i2p/i2p/middlewares.html → crawler/darknet/middlewares.html b/crawler/i2p/i2p/middlewares.html → crawler/darknet/middlewares.html
@@ -9,7 +9,7 @@
 <td valign=bottom>&nbsp;<br>
 <font color="#ffffff" face="helvetica, arial">&nbsp;<br><big><big><strong>middlewares</strong></big></big></font></td
 ><td align=right valign=bottom
-><font color="#ffffff" face="helvetica, arial"><a href=".">index</a><br><a href="file:///C:/users/alberto/dropbox/universidad/5.%20tfg/tfg_crawler_i2p/crawler/i2p/i2p/middlewares.py">c:\users\alberto\dropbox\universidad\5. tfg\tfg_crawler_i2p\crawler\i2p\i2p\middlewares.py</a></font></td></tr></table>
+><font color="#ffffff" face="helvetica, arial"><a href="i2p">index</a><br><a href="file:///C:/users/alberto/dropbox/universidad/5.%20tfg/tfg_crawler_i2p/crawler/i2p/i2p/middlewares.py">c:\users\alberto\dropbox\universidad\5. tfg\tfg_crawler_i2p\crawler\i2p\i2p\middlewares.py</a></font></td></tr></table>
     <p><tt>#&nbsp;-*-&nbsp;coding:&nbsp;utf-8&nbsp;-*-</tt></p>
 <p>
 <table width="100%" cellspacing=0 cellpadding=2 border=0 summary="section">