PyTorch implementation of paper Expanding Domain-specific Knowledge Graphs with Unknown Facts([NLDB2023]) accepted by NLDB 2023.
Many knowledge graphs have been created to support intelligent applications, such as search engines and recommendation systems. Some domain-specific knowledge graphs contain similar contents in nature (e.g., the FreeBase contains information about actors and movies which are the core of the IMDB). Adding relevant facts or triples from one knowledge graph into another domain-specific knowledge graph is key to expanding the coverage of the knowledge graph. The facts from one knowledge graph may contain unknown entities or relations that do not occur in the existing knowledge graphs, but it doesn't mean that these facts are not relevant and hence can not be added to an existing domain-specific knowledge graph. However, adding irrelevant facts will violate the inherent nature of the existing knowledge graph. In other words, the facts that conform to the subject matter of the existing domain-specific knowledge graph only can be added. Therefore, it is vital to filter out irrelevant facts in order to avoid such violations. This paper presents an embedding method called UFD to compute the relevance of the unknown facts to an existing domain-specific knowledge graph so that the relevant new facts from another knowledge graph can be added to the existing domain-specific knowledge graph. A new dataset, called UFD-303K, is created for evaluating unknown fact detection. The experiments show that our embedding method is very effective at distinguishing and adding relevant unknown facts to the existing knowledge graph.
# install virtual environment module
python3 -m pip install --user virtualenv
# create virtual environment
python3 -m venv env_name
source env_name/bin/activate
# install python packages
pip install requests
pip install -r requirements.txt
python main.py --task_name kg --do_train --do_link_predict --data_dir ./data/FB15K237 --pre_process_data ./pre_process_data --bert_model bert-base-cased --max_seq_length 300 --train_batch_size 32 --learning_rate 5e-5 --num_train_epochs 5.0 --output_dir ./output_FB15K237/ --gradient_accumulation_steps 4 --eval_batch_size 32
--do_train
: set this flag to train model.
-- --do_link_predict
: set this flag to do test.
--data_dir
: the path of dataset,
--bert_model
: the path/type of the pre-trained model.
--max_seq_length
: maximum sequence length of input text.
--train_batch_size
: batch size for training.
--learning_rate
: learning rate for training.
--gradient_accumulation_steps
: number of gradient accumulation steps.
--eval_batch_size
: batch size for evaluation.
--pre_process_data
: the path of pre-processed data.
--output_dir
: the path of output directory.
Download data from here
This paper has been accepted by the 28th International Conference on Natural Language & Information Systems (NLDB 2023). The published version can be viewed by this link. If you use any code from our repo in your paper, pls cite:
@InProceedings{mhu2023ufd,
author="Hu, Miao
and Lin, Zhiwei
and Marshall, Adele",
editor="M{\'e}tais, Elisabeth
and Meziane, Farid
and Sugumaran, Vijayan
and Manning, Warren
and Reiff-Marganiec, Stephan",
title="Expanding Domain-Specific Knowledge Graphs with Unknown Facts",
booktitle="Natural Language Processing and Information Systems",
year="2023",
publisher="Springer Nature Switzerland",
address="Cham",
pages="352--364"
}
Feel free to contact MiaoHu ([email protected]), if you have any further questions.