v0.9.0
We are pleased to announce version 0.9.0 of ACHE Focused Crawler! We also recently reached the milestone of 100+ starts on GitHub, 55+ forks, and 1000+ commits in the current git repository. We would like to thanks all users for the feedback we have received in the past year.
This is a large release and it brings many improvements to the documentation and several new features. Following is a detailed log of major changes since last version:
- Fixed multiple bugs and handling of exceptions
- Several improvements made to ACHE documentation
- Allow use of multiple data formats simultaneously (issue #92)
- Added new data storage format using the standard WARC format (issue #64)
- Added new data storage format using Apache Kafka (issue #123)
- Re-crawling of sitemaps.xml files using fixed time intervals (issue #73)
- Allow configuration of cookies in ache.yml (issue #81)
- Allow configuration of full User-Agent string
- Fixed memory issues that would cause OutOfMemoryError (issue #63)
- Support for robots exclusion protocol a.k.a. robots.txt (issue #46)
- Added new HTTP fetcher implementation using okhttp3 library with support to multiple SSL cipher suites
- Non-HTML pages are no longer parsed as HTML
- Training of new link classifiers (Online Learning) in a background thread (issue #76)
- Added REST API endpoint to stop crawler
- Added REST API endpoint to add new seeds to the crawl
- Added documentation for the REST API
- Persist run-time crawl metrics across crawler restarts (issue #101)
- Added support to per-domain wildcard link filters (issue #121)
- Add more detailed metrics for HTTP response codes (issue #120)
- Changed referrer policies in the search dashboard for better security
- Added various configuration options for timeouts in both fetcher implementations (issue #122)
- Added support for Basic HTTP authentication in the web interface (issue #129)
- Added REST API endpoints to supporting monitoring using Prometheus.io (issue #128)
- Add page relevance metrics for better monitoring (issue #119)
- Add parameters for elasticsearch index and type names through the
/startCrawl
REST API (issue #107) - Support for serving web interface from non-root path (issue #137)
- Added button to stop crawler in web user interface (issue #139)
- Upgraded searchkit library to 2.2.0 which supports Elasticsearch 5.x
- Upgrade crawler-commons library to version 0.8
Notice: that there were breaking changes in some data formats:
- Repositories for relevant and irrelevant pages are now stored in the same folder (or same Elasticsearch index) and page entries include new properties to identify pages as relevant or irrelevant according to the target page classifier output. Double check the data formats documentation page and make sure you make appropriate changes if needed.