v0.8.0
We are pleased to announce version 0.8.0 of ACHE Focused Crawler.
This release includes a more complete and reorganized documentation (available at http://ache.readthedocs.io/en/latest/) and a new REST API for real-time crawler monitoring.
Following is the detailed log of major changes since last version.
- Added frontier load time metrics (issue #59)
- Update some library versions on build.gradle
- Update gradle wrapper to version 3.2.1
- Added Dockerfile
- Added connection timeouts to BingSearchAzureAPI
- Changed seed finder to use SimpleHttpFetcher
- Added option to configure a custom user agent string
- Added option of not starting console reporter in MetricsManager
- Change set_version script to work on MacOS
- Updated test dependency (Jetty) to version 9.3.6
- Rewrite all CLI programs using only airline library
- Shutdown crawler and log errors on any error (any Throwable)
- Simple WekaTargetClassifier refactoring
- Added argument --seedsPath to specify the directory to store the seed file in SeedFinder command
- Replaced the deprecated installApp by installDist gradle command in conda.recipe
- Fixed type of links extracted from sitemaps
- REST API for real-time metrics monitoring (issue #67)
- Remove dependency on linkclassifier.features file from LinkClassifierBreadthSearch (issue #65)
- Create an initial version of web-based crawler dashboard for visualization of system metrics (issue #68)
- Avoid creating empty files when not necessary in FilesTargetRepository
- Added Memex CDRv3 support
- Added Elasticsearch indexer to AcheToCdrFileExporter and rename it to AcheToCdrExporter
- Capture exceptions and retry on failures during ElasticSearch bulk indexing
- Refactoring of TargetClassifierFactory
- Added command annotation to MigrateToFilesTargetRepository tool
- Added a simple in-memory duplicate detection tool
- Added a new regex-based target classifier that matches multiple fields (issue #69)
- Created an initial version of documentation using the documentation generation system Sphinx and published documentation online at http://ache.readthedocs.io/ (issue #66)
- Added additional system descriptions and a scaffold for missing documentation (issue #66)
- Added badge with link to documentation in README.md (issue #66)
- Added an index to page-classifiers documentation page
- Improved documentation on page classifiers
- Added a tool to run a classifier over a file content
- Adjusted regex matcher to use DOTALL mode (issue #69)
- Rename test file correctly
- Write a CSV with queries, classification result, and URLs (issue #71)
- Moved SeedFinder documentation from wiki to Sphinx documentation