Skip to content

Latest commit

 

History

History
9 lines (5 loc) · 727 Bytes

README.md

File metadata and controls

9 lines (5 loc) · 727 Bytes

cc-extraction-framework

Java framework for extracting and processing isA-pairs from CommonCrawl data. Contains classes for distributed extraction, entity disambiguation of tuple entities, building local taxonomies from isA-pairs and merging these into bigger global taxonomies. Functionality for exporting tuples and taxonomy graphs also included.

Dependencies

This framework is compatible with extractor classes from the Web Data Commons Extraction Framework. It therefore includes dependencies which can be resolved by importing the WCD Extraction Framework.

To store data in SQLite Databases, this project relies on sqlite-jdbc, which needs to be imported in order to use it.