Release v0.7.0 · VIDA-NYU/ache

There were more than 100 commits since the last release 0.6.0 in July 8. Following are some of the improvements.

ACHE is now simpler to use and to configure:

Added more specific configuration samples for focused crawling and in-depth website crawling
Stopwords are now an optional parameter, and a embedded stopword list is used by default
Added utility tools for working with CDR (Common Data Repository) files
Added utility to print frontier links along with relevance scores
Added configuration for HTTP connection pool size

ACHE is faster: we fixed synchronization and parallelism issues that led to improvements of crawler efficiency of 980% (a simple benchmark available at #56).

ACHE is more resilient due fix of bugs related to:

Extraction of malformed URLs during HTML parsing
Failures due to handling of URLs with IPv4 addresses
Failure to train the linking classifier for certain configuration values
Corruption of binary data improperly stored in strings

URL normalization added for links extracted from web pages, so less duplicate content will be fetched

Cleaned log messages and added logging of structured data in CSV files regarding:

Download requests
Links selected to be downloaded

Added detailed software metrics that allows better monitoring and detection of problems. Added metrics include shows counts, 1, 5 and 15-minute rates, mean, median, and 75%, 95%, 98% and 99% percentiles for

URL fetch time
Download page processing time
Current download queue size
Current processing and pending downloads in queue

ACHE has an improved data management:

Added new page repository that stores multiple pages in rolling compressed files
Added a new alternative database backend based on Facebook's RocksDB key-value store that improves efficiency and JVM memory management.

Some stability problems were solved, such as:

Limiting size of downloader thread-pool queue sizes
Properly close repository files during crawler shutdown
Avoid start crawler shutdown multiple times

Other minor improvement such as:

Migrated code base to Java 8
More refactoring, code cleaning, and tests (coverage 44%)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.7.0