Incremental-KGC is a Python package that allows to efficiently materialize subsequent versions of knowledge graphs. It utilizes the previous version of a snapshot of the data source and the knowledge graph to materialize only the modified data. Currently it supports both additions and deletions in the data source.
The functionality is provided through the load_kg()
function, which has the following arguments:
mapping_file
: The path of the mapping file.snapshot_file
: The path of the snapshot file.aux_data_path
: The path of the auxiliary directory.old_graph
: Ardflib.Graph
that contains the version previously generated, orNone
.engine
: The name of the mapping engine to materialize the graph. Currently onlymorph
andrdfizer
are supported.method
: eitherdisk
ormemory
. Ifdisk
, the auxiliary data is stored in the disk, under theaux_data_path
directory. Ifmemory
, the auxiliary data is stored in memory.memory
is only supported whenengine
ismorph
.mapping_optimization
: IfTrue
(default), the mappings are reduced to contain only the rules from the data sources that are updated.
The behavior depends on whether it is the first version or not.
When materializing for the first time, the argument old_graph
should be None
. An example is shown below.
g = load_kg(mapping_file='mapping.csv.ttl',
snapshot_file='snapshot_file',
aux_data_path='./.aux',
old_graph=None,
method='disk',
engine='morph',
mapping_optimization=True)
This call to load_kg()
returns the new version of the knowledge graph as a rdflib.Graph
instance. It also saves the snapshot to snapshot_file
, which is necessary for further updates.
For subsequent updates, the argument old_graph
should be a rdflib.Graph
instance that was generated from the same mapping file. An example is shown below. Note that the graph g
from the previous snippet is passed as old_graph
.
g = load_kg(mapping_file='mapping.csv.ttl',
snapshot_file='snapshot_file',
aux_data_path='./.aux',
old_graph=g,
method='disk',
engine='morph',
mapping_optimization=True)
Currently only CSV data sources are supported. If you are interested in supporting more, please read the Contributing section.
Currently only the Morph-KGC and SDM-RDFizer mapping engines are supported. If you are interested in supporting more, please read the Contributing section.
If you are interested in contributing to the project, please make sure to open a pull request with the changes, the corresponding tests and documentation. The following sections describe the process of supporting new data source types and mapping engines, however any additional improvement or extension is welcome.
To support a new data source, the first step is to think a Python object that can represent one file. For instance, a CSV file can be represented as a pandas.Dataframe
. Then, the following functions need to be expanded. Please read the documentation of each function before writing changes.
-
_process_source()
. The function must return three Python objects, each one with the following information.- The set of current data, which represents the data from the current version of the data source.
- The set of new data, which represents the data present in the data source but not in the snapshot.
- The set of removed data, which represents the data present in the snapshot but not in the data source.
-
_save_data_to_file()
. This function must serialize the Python object used to represent each source.
In order to support a new mapping engine, the function _materialize_set()
must be expanded. The function should return a rdflib.Graph
containing the generated triples. Note that if the new mapping engine is not written in python, it could be possible to run a script with subprocess.run
and then read the output triples with rdflib
.
Additionally, in the case the engine is a Python library, in load_kg()
, the import statement should be added.
Tests are placed under the test/
directory. Feel free to add new tests and create new directories.