Skip to content

v0.8.0

Compare
Choose a tag to compare
@aecio aecio released this 27 Apr 18:16
· 560 commits to master since this release

We are pleased to announce version 0.8.0 of ACHE Focused Crawler.

This release includes a more complete and reorganized documentation (available at http://ache.readthedocs.io/en/latest/) and a new REST API for real-time crawler monitoring.

Following is the detailed log of major changes since last version.

  • Added frontier load time metrics (issue #59)
  • Update some library versions on build.gradle
  • Update gradle wrapper to version 3.2.1
  • Added Dockerfile
  • Added connection timeouts to BingSearchAzureAPI
  • Changed seed finder to use SimpleHttpFetcher
  • Added option to configure a custom user agent string
  • Added option of not starting console reporter in MetricsManager
  • Change set_version script to work on MacOS
  • Updated test dependency (Jetty) to version 9.3.6
  • Rewrite all CLI programs using only airline library
  • Shutdown crawler and log errors on any error (any Throwable)
  • Simple WekaTargetClassifier refactoring
  • Added argument --seedsPath to specify the directory to store the seed file in SeedFinder command
  • Replaced the deprecated installApp by installDist gradle command in conda.recipe
  • Fixed type of links extracted from sitemaps
  • REST API for real-time metrics monitoring (issue #67)
  • Remove dependency on linkclassifier.features file from LinkClassifierBreadthSearch (issue #65)
  • Create an initial version of web-based crawler dashboard for visualization of system metrics (issue #68)
  • Avoid creating empty files when not necessary in FilesTargetRepository
  • Added Memex CDRv3 support
  • Added Elasticsearch indexer to AcheToCdrFileExporter and rename it to AcheToCdrExporter
  • Capture exceptions and retry on failures during ElasticSearch bulk indexing
  • Refactoring of TargetClassifierFactory
  • Added command annotation to MigrateToFilesTargetRepository tool
  • Added a simple in-memory duplicate detection tool
  • Added a new regex-based target classifier that matches multiple fields (issue #69)
  • Created an initial version of documentation using the documentation generation system Sphinx and published documentation online at http://ache.readthedocs.io/ (issue #66)
  • Added additional system descriptions and a scaffold for missing documentation (issue #66)
  • Added badge with link to documentation in README.md (issue #66)
  • Added an index to page-classifiers documentation page
  • Improved documentation on page classifiers
  • Added a tool to run a classifier over a file content
  • Adjusted regex matcher to use DOTALL mode (issue #69)
  • Rename test file correctly
  • Write a CSV with queries, classification result, and URLs (issue #71)
  • Moved SeedFinder documentation from wiki to Sphinx documentation