Skip to content

Releases: VIDA-NYU/ache

0.15.0

12 Mar 23:24
326f067
Compare
Choose a tag to compare

We are pleased to announce version 0.15.0 of ACHE Crawler!

This version includes several dependency updates and fixes a robots.txt serialization bug that only happens when the robots.txt feature is enabled. This fix may cause data backward incompatibility of previous crawls that use robots.txt. We also plan to upgrade Elasticsearch support in the next version, so this version may be the last version to support legacy Elasticsearch versions (e.g., <6.x).

Following is a detailed log of the changes since the last version:

  • Bump okhttp from 3.14.0 to 4.9.3
  • Bump jackson-* libraries from 2.13.1 to 2.13.3
  • Bump logback-classic from 1.2.9 to 1.2.11
  • Bump slf4j-api from 1.7.32 to 1.7.36
  • Bump RoaringBitmap from 0.9.23 to 0.9.27
  • Bump metrics-* libraries from 4.2.7 to 4.2.17
  • Bump aws-java-sdk-s3 from 1.12.131 to 1.12.225
  • Remove aws-java-sdk-s3 dependency from main project
  • Add support for Elasticsearch 7.x and 8.x indexing (#282)
  • Bump jetty-server from 9.4.44.v20210927 to 9.4.48.v20220622
  • Bump kryo-serializers from 0.42 to 0.43
  • Bump RoaringBitmap from 0.9.27 to 0.9.39
  • Bump tika-parsers from 1.18 to 1.28.4
  • Bump gradle-node-plugin to version 3.5.1 and node.js to 18.14.2
  • Migrate tests from jUnit 4 to 5
  • Migrate test assertions from Hamcrest to AssertJ
  • Bump org.apache.httpcomponents:httpclient from 4.5.13 to 4.5.14
  • Bump ch.qos.logback:logback-classic from 1.2.+ to 1.4.5
  • Fix robots.txt serialization bug
  • Bump jackson-* libraries from 2.13.3 to 2.14.2
  • Bump org.apache.commons:commons-lang3 from 3.4 to 3.12.0
  • Bump org.apache.commons:commons-compress from 1.21 to 1.22
  • Bump org.apache.kafka:kafka-clients from 3.2.0 to 3.4.0
  • Bump com.squareup.okhttp3:okhttp from 4.9.3 to 4.10.0

v0.14.0

07 Feb 02:26
0.14.0
8176952
Compare
Choose a tag to compare

We are pleased to announce version 0.14.0 of ACHE Crawler!

Following is a detailed log of the changes since the last version:

  • Remove support for CDR 3.1 format in Kafka target repository
  • Move tools and memex packages to the ache-tools sub-project
  • Moved forked crawler-commons classes to a separate sub-project
  • Remove tika dependency from ache and crawler-commons sub-project
  • Synchronize crawler-commons/http-fetcher with the upstream library
  • Setup gradle build using GitHub Actions
  • Build docker image with multi-arch support (amd64, arm64)
  • Upgrade build to Gradle 7.3.3
  • Upgrade gradle-node-plugin to version 3.0.1
  • Upgrade ache-dashboard npm dependencies
  • Pin slf4j-api version to 1.7.32
  • Bump airline from 0.8 to 0.9
  • Bump aws-java-sdk-s3 from 1.12.129 to 1.12.131
  • Bump crawler-commons from 1.1 to 1.2
  • Bump com.github.kt3k.coveralls from 2.10.2 to 2.12.0
  • Bump commons-codec from 1.10 to 1.15
  • Bump commons-compress from 1.12 to 1.21
  • Bump commons-lang3 from 3.4 to 3.12.0
  • Bump commons-validator from 1.6 to 1.7
  • Bump guava from 20.0 to 23.0
  • Bump jetty-server from 9.3.6.v20151106 to 9.4.44.v20210927
  • Bump kryo from 4.0.0 to 4.0.2
  • Bump kafka-clients from 0.11.0.1 to 3.0.0
  • Bump logback-classic from 1.1.+ to 1.2.9
  • Bump mockito-core from 1.10.+ to 4.2.0
  • Bump npm from 6.14.10 to 8.3.0
  • Bump rocksdbjni from 6.2.2 to 6.25.3
  • Bump RoaringBitmap from 0.7.8 to 0.9.23
  • Bump smile-core from 1.5.0 to 1.5.3
  • Bump lucene-analyzers-common from 7.3.1 to 8.10.1
  • Bump webarchive-commons from 1.1.8 to 1.1.9
  • Bump jsoup from 1.10.3 to 1.14.3
  • Bump junit from 4.12 to 4.13.2
  • Bump jackson-* libraries from 2.8.5 to 2.13.1
  • Bump metrics-* libraries from 3.1.3 to 4.2.7
  • Replace SparkJava framework (unmaintained) by Javalin 4.2.0
  • Add timeout configurations for the TOR fetcher
  • Update and improve the documentation
  • Change documentation theme to sphinx_material
  • Add support to HTTP BASIC auth for Elasticsearch data format

v0.13.0

07 Jan 20:46
ab6bf7f
Compare
Choose a tag to compare

We are pleased to announce version 0.13.0 of ACHE Crawler!

Following is a detailed log of the changes since the last version:

  • Upgrade gradle-node-plugin to version 2.2.4
  • Upgrade gradle wrapper to version 6.6.1
  • Upgrade crawler-commons to version 1.1
  • Reorganized gradle module directory structure
  • Rename root package to achecrawler
  • Use multi-stage build to reduce Docker image size
  • Refactor Elasticsearch repository and make it wait until the server ready
  • Upgrade npm dependencies

v0.12.0

18 Jan 17:58
Compare
Choose a tag to compare

We are pleased to announce version 0.12.0 of ACHE Crawler!

Following is a detailed log of the changes since the last version:

  • Upgrade crawler-commons dependency to version 0.9
  • Removed Elasticsearch transport-client-based repository
  • Removed Elasticsearch 1.4.4 binaries dependency
  • Added DumpDataFromElasticsearch tool for dumping documents from Elasticsearch
    repositories
  • Added configuration for minimum relevance in link selectors
  • Added configuration for selecting whether should re-crawl sitemaps and
    robots.txt links
  • Added documentaion about relevance_threshold parameters to the target page
    classifiers documentation page
  • Added support for crawling via HTTP proxy in okhttp3 fetcher (by @maqzi)
  • Added tracking of more HTTP error messages (301, 302, 3xx, 402) (by @maqzi)
  • Upgrade crawler-commons library to version 1.0
  • Upgrade commons-validator library to version 1.6
  • Upgrade okhttp3 library to version 3.14.0
  • Fix issue #177: Links from recent TLDs are considered invalid
  • Upgrade RocksDB dependency (rocksdbjni) to version 6.2.2
  • Added error code details to RocksDB exception logs
  • Upgrade gradle-node-plugin to version 1.3.1
  • Upgrade npm version to 6.10.2
  • Upgrade ache-dashboard npm dependencies
  • Upgrade gradle wrapper to version 5.6.1
  • Update Dockerfile to use openjdk:11-jdk (Java 11)
  • Added content_type field to RegexTargetClassifier
  • Change default link classifier to LinkClassifierBreadthSearch
  • Update io.airlift:airline dependency to version 0.8
  • Update gradle build script to use new plugins DSL
  • Update coverals gradle plugin to version 2.9.0
  • Update searchkit to version ^2.4.0

v0.11.0

01 Jun 18:56
Compare
Choose a tag to compare

We are pleased to announce version 0.11.0 of ACHE Crawler! Besides several technical improvements, we are really glad to announce the very first ACHE release under the Apache License 2 (APLv2).

Following is a detailed log of the major changes since the last version:

  • Removed dependency on Weka and reimplemented all machine-learning code using SMILE.
  • Added option to skip cross-validation on ache buildModel command
  • Added option to configure max number of features on ache buildModel command
  • Changed license from GNU GPL to Apache 2.0
  • Added tool (ache run ReplayCrawl) to replay old crawls using a new configuration file
  • Added near-duplicate page detection using min-hashing and LSH
  • Support ELASTIC format in Kafka data format (issue #155)
  • Upgrade react-scripts to get rid of vulnerable transitive dependency (hoek:4.2.0)
  • Upgrade npm version to 5.8.0 on gradle build script
  • Changed smile target page classifier to use Platt's scaling only when the
    parameter 'relevance_threshold' is provided in the pageclassifier.yml file.
  • Added Ansible scripts for automatic deployment
  • Added RocksDB-based target repository (RocksDBTargetRepository)
  • Fixed bug in ache-dashboard that prevented reloading search page on the browser
    page refresh (issue #163)
  • Support Elasticsearch 6.x (issue #158)

v0.10.0

16 Jan 06:21
Compare
Choose a tag to compare

We are pleased to announce version 0.10.0 of ACHE Crawler! This release contains very important changes, which include support for running multiple crawlers in a single server (multi-tenancy), and the start of our migration to Apache License 2 (APLv2).

Following is a detailed log of the major changes since last version:

  • Upgraded gradle-node plugin to version 1.2.0
  • Removed BerkeleyDB dependency (issue #143)
  • Allow for running multiple crawlers in a single server (issue #103)
  • REST API endpoints modified to support multiple crawlers (issue #103)
  • Web interface modified to support multiple crawlers (issue #103)
  • Display more metrics in crawler monitoring page
  • Upgrade RocksDB (org.rocksdb:rocksdbjni) to version 5.8.7 (issue #142)
  • Upgraded build script plugin "gradle-node" to version 1.2.0
  • Upgraded javascript dependencies from crawler web-interface:
    • react to version 16.2.0
    • react-vis to version 1.7.9
    • searchkit to version 2.3.0
    • npm to version 5.6.0
  • Allow cookies be modified dynamically via REST API endpoint (issue #114)
  • Added crawlerId field to JSON output of target repositories to track provenance of crawled pages

v0.9.0

07 Nov 19:47
Compare
Choose a tag to compare

We are pleased to announce version 0.9.0 of ACHE Focused Crawler! We also recently reached the milestone of 100+ starts on GitHub, 55+ forks, and 1000+ commits in the current git repository. We would like to thanks all users for the feedback we have received in the past year.

This is a large release and it brings many improvements to the documentation and several new features. Following is a detailed log of major changes since last version:

  • Fixed multiple bugs and handling of exceptions
  • Several improvements made to ACHE documentation
  • Allow use of multiple data formats simultaneously (issue #92)
  • Added new data storage format using the standard WARC format (issue #64)
  • Added new data storage format using Apache Kafka (issue #123)
  • Re-crawling of sitemaps.xml files using fixed time intervals (issue #73)
  • Allow configuration of cookies in ache.yml (issue #81)
  • Allow configuration of full User-Agent string
  • Fixed memory issues that would cause OutOfMemoryError (issue #63)
  • Support for robots exclusion protocol a.k.a. robots.txt (issue #46)
  • Added new HTTP fetcher implementation using okhttp3 library with support to multiple SSL cipher suites
  • Non-HTML pages are no longer parsed as HTML
  • Training of new link classifiers (Online Learning) in a background thread (issue #76)
  • Added REST API endpoint to stop crawler
  • Added REST API endpoint to add new seeds to the crawl
  • Added documentation for the REST API
  • Persist run-time crawl metrics across crawler restarts (issue #101)
  • Added support to per-domain wildcard link filters (issue #121)
  • Add more detailed metrics for HTTP response codes (issue #120)
  • Changed referrer policies in the search dashboard for better security
  • Added various configuration options for timeouts in both fetcher implementations (issue #122)
  • Added support for Basic HTTP authentication in the web interface (issue #129)
  • Added REST API endpoints to supporting monitoring using Prometheus.io (issue #128)
  • Add page relevance metrics for better monitoring (issue #119)
  • Add parameters for elasticsearch index and type names through the /startCrawl REST API (issue #107)
  • Support for serving web interface from non-root path (issue #137)
  • Added button to stop crawler in web user interface (issue #139)
  • Upgraded searchkit library to 2.2.0 which supports Elasticsearch 5.x
  • Upgrade crawler-commons library to version 0.8

Notice: that there were breaking changes in some data formats:

  • Repositories for relevant and irrelevant pages are now stored in the same folder (or same Elasticsearch index) and page entries include new properties to identify pages as relevant or irrelevant according to the target page classifier output. Double check the data formats documentation page and make sure you make appropriate changes if needed.

v0.8.0

27 Apr 18:16
Compare
Choose a tag to compare

We are pleased to announce version 0.8.0 of ACHE Focused Crawler.

This release includes a more complete and reorganized documentation (available at http://ache.readthedocs.io/en/latest/) and a new REST API for real-time crawler monitoring.

Following is the detailed log of major changes since last version.

  • Added frontier load time metrics (issue #59)
  • Update some library versions on build.gradle
  • Update gradle wrapper to version 3.2.1
  • Added Dockerfile
  • Added connection timeouts to BingSearchAzureAPI
  • Changed seed finder to use SimpleHttpFetcher
  • Added option to configure a custom user agent string
  • Added option of not starting console reporter in MetricsManager
  • Change set_version script to work on MacOS
  • Updated test dependency (Jetty) to version 9.3.6
  • Rewrite all CLI programs using only airline library
  • Shutdown crawler and log errors on any error (any Throwable)
  • Simple WekaTargetClassifier refactoring
  • Added argument --seedsPath to specify the directory to store the seed file in SeedFinder command
  • Replaced the deprecated installApp by installDist gradle command in conda.recipe
  • Fixed type of links extracted from sitemaps
  • REST API for real-time metrics monitoring (issue #67)
  • Remove dependency on linkclassifier.features file from LinkClassifierBreadthSearch (issue #65)
  • Create an initial version of web-based crawler dashboard for visualization of system metrics (issue #68)
  • Avoid creating empty files when not necessary in FilesTargetRepository
  • Added Memex CDRv3 support
  • Added Elasticsearch indexer to AcheToCdrFileExporter and rename it to AcheToCdrExporter
  • Capture exceptions and retry on failures during ElasticSearch bulk indexing
  • Refactoring of TargetClassifierFactory
  • Added command annotation to MigrateToFilesTargetRepository tool
  • Added a simple in-memory duplicate detection tool
  • Added a new regex-based target classifier that matches multiple fields (issue #69)
  • Created an initial version of documentation using the documentation generation system Sphinx and published documentation online at http://ache.readthedocs.io/ (issue #66)
  • Added additional system descriptions and a scaffold for missing documentation (issue #66)
  • Added badge with link to documentation in README.md (issue #66)
  • Added an index to page-classifiers documentation page
  • Improved documentation on page classifiers
  • Added a tool to run a classifier over a file content
  • Adjusted regex matcher to use DOTALL mode (issue #69)
  • Rename test file correctly
  • Write a CSV with queries, classification result, and URLs (issue #71)
  • Moved SeedFinder documentation from wiki to Sphinx documentation

v0.7.0

27 Nov 16:38
Compare
Choose a tag to compare

There were more than 100 commits since the last release 0.6.0 in July 8. Following are some of the improvements.

ACHE is now simpler to use and to configure:

  • Added more specific configuration samples for focused crawling and in-depth website crawling
  • Stopwords are now an optional parameter, and a embedded stopword list is used by default
  • Added utility tools for working with CDR (Common Data Repository) files
  • Added utility to print frontier links along with relevance scores
  • Added configuration for HTTP connection pool size

ACHE is faster: we fixed synchronization and parallelism issues that led to improvements of crawler efficiency of 980% (a simple benchmark available at #56).

ACHE is more resilient due fix of bugs related to:

  • Extraction of malformed URLs during HTML parsing
  • Failures due to handling of URLs with IPv4 addresses
  • Failure to train the linking classifier for certain configuration values
  • Corruption of binary data improperly stored in strings

URL normalization added for links extracted from web pages, so less duplicate content will be fetched

Cleaned log messages and added logging of structured data in CSV files regarding:

  • Download requests
  • Links selected to be downloaded

Added detailed software metrics that allows better monitoring and detection of problems. Added metrics include shows counts, 1, 5 and 15-minute rates, mean, median, and 75%, 95%, 98% and 99% percentiles for

  • URL fetch time
  • Download page processing time
  • Current download queue size
  • Current processing and pending downloads in queue

ACHE has an improved data management:

  • Added new page repository that stores multiple pages in rolling compressed files
  • Added a new alternative database backend based on Facebook's RocksDB key-value store that improves efficiency and JVM memory management.

Some stability problems were solved, such as:

  • Limiting size of downloader thread-pool queue sizes
  • Properly close repository files during crawler shutdown
  • Avoid start crawler shutdown multiple times

Other minor improvement such as:

  • Migrated code base to Java 8
  • More refactoring, code cleaning, and tests (coverage 44%)

v0.6.0

09 Jul 02:38
Compare
Choose a tag to compare

We are pleased to announce version 0.6.0 of ACHE Focused Crawler. Here we list the major changes since last version.

New features, improvements and bug fixes:

  • Implementation of SeedFinder algorithm, which leverages search engine's APIs to automatically create a large and diverse seed URL set to start to bootstrap the crawler.
  • Added flexible way to different handlers for different types of links, which will allow to have different extractors for each content type such as HTML, media files, XML sitemaps, etc.
  • Support for sitemap.xml protocol, which allows the crawler automatically discover all links along with some metadata specified by webmasters.
  • More bug fixes and code refactoring.
  • More unit tests and integration tests (coverage raised to 42%)