Skip to content

v0.7.0

Compare
Choose a tag to compare
@aecio aecio released this 27 Nov 16:38
· 618 commits to master since this release

There were more than 100 commits since the last release 0.6.0 in July 8. Following are some of the improvements.

ACHE is now simpler to use and to configure:

  • Added more specific configuration samples for focused crawling and in-depth website crawling
  • Stopwords are now an optional parameter, and a embedded stopword list is used by default
  • Added utility tools for working with CDR (Common Data Repository) files
  • Added utility to print frontier links along with relevance scores
  • Added configuration for HTTP connection pool size

ACHE is faster: we fixed synchronization and parallelism issues that led to improvements of crawler efficiency of 980% (a simple benchmark available at #56).

ACHE is more resilient due fix of bugs related to:

  • Extraction of malformed URLs during HTML parsing
  • Failures due to handling of URLs with IPv4 addresses
  • Failure to train the linking classifier for certain configuration values
  • Corruption of binary data improperly stored in strings

URL normalization added for links extracted from web pages, so less duplicate content will be fetched

Cleaned log messages and added logging of structured data in CSV files regarding:

  • Download requests
  • Links selected to be downloaded

Added detailed software metrics that allows better monitoring and detection of problems. Added metrics include shows counts, 1, 5 and 15-minute rates, mean, median, and 75%, 95%, 98% and 99% percentiles for

  • URL fetch time
  • Download page processing time
  • Current download queue size
  • Current processing and pending downloads in queue

ACHE has an improved data management:

  • Added new page repository that stores multiple pages in rolling compressed files
  • Added a new alternative database backend based on Facebook's RocksDB key-value store that improves efficiency and JVM memory management.

Some stability problems were solved, such as:

  • Limiting size of downloader thread-pool queue sizes
  • Properly close repository files during crawler shutdown
  • Avoid start crawler shutdown multiple times

Other minor improvement such as:

  • Migrated code base to Java 8
  • More refactoring, code cleaning, and tests (coverage 44%)