Skip to content

vishals79/web-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Generic Web Crawler in Java.

Introduction

A simple generic web crawler to download emails from a website.

  • Crawl from the start point.
  • A page is identified as an email on the basis of filter (A page should contain "From", "Date" and "Subject" for it to be considered as an email.)
  • Emails are downloaded in user provided path.
  • If a page is redirected to another domain or outside of parent location, that page is not picked up for crawling.
  • Program can survive internet connection loss and can resume from last run.

How to crawl

  • Checkout the project, compile and create a jar.
  • Run the jar e.g. java -jar jar-name.jar start-point path-to-download-emails e.g. java -jar advancedCrawler-0.0.1.one-jar.jar http://test-site.com /home/user/DownloadEmails

Configuration

application.properties
## Input
* base.URL = URL to download emails.
* download.directory = Directory path to save emails.
* recovery.dir = Directory name to save backup files to recover from internet connection loss and resume from last run.


## Thread
* min.threads = minimum number of threads.
* max.threads = maximum number of threads.
* wait.time = Delay time. Invigilator will wait for 'waitTime' secs to check the difference in url queue size.

log4j.properties
* log4j.appender.file.File= Path to save log.

Class Diagram

alt text

Flow Diagram

Main Flow

alt text

URLProcessor Manager Flow

alt text

URLProcessor Worker Flow

alt text

Invigilator Flow

alt text

Recovery Manager Flow

alt text

Recovery Worker Flow

alt text

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages