Skip to content

Enhancing Complex Question Answering over Knowledge Graphs through Evidence Pattern Retrieval, WWW 2024

License

Notifications You must be signed in to change notification settings

nju-websoft/EPR-KGQA

Repository files navigation

EPR-KGQA: Complex Questions Answering over Knowledge Graph via Evidence Pattern Retrieval

Project for the WWW'24 paper: Enhancing Complex Question Answering over Knowledge Graphs through Evidence Pattern Retrieval

image image image image

Table of Contents

Overview

  • EPR-KGQA is
    • an information retrieval style KGQA system to explicitly model the structural dependency via evidence pattern retrieval.
    • the best-performing method under the supervision of only question-answer pairs on ComplexWebquestions (as of 2024-02).
      • a method that does not rely on manually annotated formal queries or relation paths.
image
Facts about the question “What country, containing Stahuis, does Germany border?”. The correct answer Netherlands is underlined. The noisy answer Austria does not contain Stahuis, but the names of relations connecting them express similar meanings. Systems insensitive to structural dependencies may be confused by the noises.

Evidence pattern retrieval (EPR)

  • We implement EPR by indexing the atomic adjacency pattern (AP) of resource pairs. image

  • We enumerate the combinations of retrieved APs to construct candidate evidence patterns (EP). image

  • Candidate EPs are scored using the BERT-base model, and the best one is selected to extract a subgraph for answer reasoning.

Experimental Results

image

The best results of IR methods are in bold, and the second-best results are underlined. denotes that the method requires gold query annotation of all training questions. denotes few-shot methods.

Project Organization

  • config.py
  • config_CWQ.yaml
  • config_WebQSP.yaml
  • preprocess
    • adjacent_info_prepare.py
    • heuristic_path_search.py
    • do_preprocess.py
  • atomic_pattern_retrieval
    • generate_positive_rr_aps_by_cached_paths.py
    • generate_training_data_for_ap_retrieval.py
    • train_biencoder.sh
    • biencoder
      • biencoder.py
      • run_biencoder.py
      • faiss_indexer.py
      • biencoder_inference.py
  • evidence_pattern_retrieval
    • ep_size_threshold.py
    • ep_construction.py
    • generate_candidate_eps.py
    • generate_ep_ranking_data.py
    • train_ep_ranking.sh
    • predict_ep_ranking.sh
    • BERT_Ranker
      • model_config.py
      • BertRanker.py
      • train_bert_ranker.py
  • subgraph_extraction
    • subgraph_extraction.py
    • convert_to_nsm_input.py
  • my_utils
    • fact.py
    • freebase.py
    • data_item.py
    • io_utils.py
    • logger.py
    • rel_base.py
    • ap_utils.py
    • ep_utils.py
  • data
    • dataset
      • CWQ
        • ComplexWebQuestions_train.json
        • ComplexWebQuestions_dev.json
        • ComplexWebQuestions_test.json
        • CWQ_full_with_int_id.jsonl
      • WebQSP
        • train_simple.jsonl
        • dev_simple.jsonl
        • test_simple.jsonl
        • WebQSP.train.json
        • WebQSP.test.json
    • cache
      • relation_info_fb.json
      • type_info_fb.json
      • rel_conn_fb.jsonl
      • rr_aps_fb.json
      • rr_aps_forward_reverse_dict.json
      • rr_aps_tag_dict.json
      • CWQ
        • cached_paths.jsonl
      • WebQSP
        • cached_paths.jsonl
    • CWQ
      • ap_retrieval
      • ep_retrieval
      • subgraph_extraction
    • WebQSP
      • ap_retrieval
      • ep_retrieval
      • subgraph_extraction
  • NSM_H

Reproducing the Results

If you encounter any difficulties in reproducing the results, please feel free to reach out to me via email ([email protected]), and I can provide you with the specific data you need.

For a quick start, we have uploaded critical models and data. You can download them in this link.

It should be noted that our method involves a large number of queries on the knowledge graph and training multiple models during implementation. Due to the possibility of query timeouts and slight differences in the models trained each time (due to different graphics card models, etc.), the final reproduced results may have slight differences from those in the paper.

Freebase SetUp

Setup Freebase: Both datasets use Freebase as the knowledge graph (We use the official data dump of Freebase from here). You may refer to Freebase Setup or Virtuoso Guide to set up a Virtuoso triplestore service. After starting your virtuoso service, please modify odbc and sparqlwrapper in file my_utils/freebase.py (for large query by odbc) and evidence_pattern_retrieval/generate_ep_ranking_data.py (for small query in sparqlwrapper) respectively.

Conda Environment

We have exported the required dependencies for the project to requirements. txt, Therefore, you only need to follow these steps to create the environment required for this project.

First, use Conda to create virtual environment EPR-KGQA,

conda create -n EPR_KGQA python=3.7

and activate it:

conda activate EPR_KGQA

Then use pip to install the dependent packages based on requirements.txt.

pip install -r requirements.txt

Preprocessing

In the preprocessing stage, we query the relation information(data/cache/relation_info_fb.json), type information(data/cache/type_info_fb.json), and adjacent relation information(data/cache/rel_conn_fb.jsonl) of the Freebase knowledge base. And for CWQ and WebQSP, query the path between topic entities and answer as supervision information(data/cache/CWQ/cached_paths.jsonl and data/cache/WebQSP/cached_paths.jsonl).

You can download the above cached files from this link, and place them in the corresponding paths. Then the following preprocessing steps can be skipped.

CWQ

cd EPR_KGQA
export PYTHONPATH=.
python preprocess/do_preprocess.py

WebQSP

cd EPR_KGQA
export PYTHONPATH=.
python preprocess/do_preprocess.py --config config_WebQSP.yaml

Atomic Pattern Retrieval

We achieve EPR through the indexing and retrieval of atomic patterns. We train a biencoder and build faiss index based on the trained model to retrieve candidate RR-APs, and query ER-APs by topic entities.

We have uploaded the RR-APs for Freebase data/cache/rr_aps_fb.json to this link, you can download it and place it in the corresponding path in order to save time.

CWQ

Train Biencoder: We have uploaded the trained model to this link, place it to the corresponding path and then the following steps for training can be skipped.

cd EPR_KGQA
export PYTHONPATH=.
python atomic_pattern_retrieval/generate_positive_rr_aps_by_cached_paths.py
python atomic_pattern_retrieval/generate_training_data_for_ap_retrieval.py
chmod +x atomic_pattern_retrieval/train_biencoder.sh
sh -x atomic_pattern_retrieval/train_biencoder.sh CWQ

For biencoder inference, run the following command:

cd EPR_KGQA
export PYTHONPATH=.
python atomic_pattern_retrieval/biencoder/biencoder_inference.py

WebQSP

Train Biencoder: We have uploaded the trained model to this link, place it to the corresponding path and then the following steps for training can be skipped.

cd EPR_KGQA
export PYTHONPATH=.
python atomic_pattern_retrieval/generate_positive_rr_aps_by_cached_paths.py --config config_WebQSP.yaml
python atomic_pattern_retrieval/generate_training_data_for_ap_retrieval.py --config config_WebQSP.yaml
chmod +x atomic_pattern_retrieval/train_biencoder.sh 
sh -x atomic_pattern_retrieval/train_biencoder.sh WebQSP

For biencoder inference, run the following command:

cd EPR_KGQA
export PYTHONPATH=.
python atomic_pattern_retrieval/biencoder/biencoder_inference.py --config config_WebQSP.yaml

Evidence Pattern Retrieval

CWQ

For evidence pattern ranking, We have uploaded the trained model to this link, place it to the corresponding path and then the step for training ranking model sh -x evidence_pattern_retrieval/train_ep_ranking.sh CWQcan be skipped. (And you can also comment out the code related to generating training data in the file evidence_pattern_retrieval/generate_ep_ranking_data.py, leaving only the code for generating prediction data.)

In order to quick start, we upload the name of topic entities (data/CWQ/ep_retrieval/CWQ_entity_name_dict.json) (used when generating training data) in this link, download it and put it in the corresponding path can save the time of query.

cd EPR_KGQA
export PYTHONPATH=.
python evidence_pattern_retrieval/generate_candidate_eps.py # ep construction
python evidence_pattern_retrieval/generate_ep_ranking_data.py
chmod +x evidence_pattern_retrieval/train_ep_ranking.sh
sh -x evidence_pattern_retrieval/train_ep_ranking.sh CWQ
chmod +x evidence_pattern_retrieval/predict_ep_ranking.sh
CUDA_VISIBLE_DEVICES=0 sh -x evidence_pattern_retrieval/predict_ep_ranking.sh CWQ test 7 100 # ds_tag, split, epoch, topk 
CUDA_VISIBLE_DEVICES=0 sh -x evidence_pattern_retrieval/predict_ep_ranking.sh CWQ dev 7 100 # ds_tag, split, epoch, topk 
CUDA_VISIBLE_DEVICES=0 sh -x evidence_pattern_retrieval/predict_ep_ranking.sh CWQ train 7 100 # ds_tag, split, epoch, topk
CUDA_VISIBLE_DEVICES=0 sh -x evidence_pattern_retrieval/predict_ep_ranking.sh CWQ test 7 80# ds_tag, split, epoch, topk 

WebQSP

For evidence pattern ranking, We have uploaded the trained model to this link, place it to the corresponding path and then the step for training ranking model sh -x evidence_pattern_retrieval/train_ep_ranking.sh WebQSPcan be skipped. (And you can also comment out the code related to generating training data in the file evidence_pattern_retrieval/generate_ep_ranking_data.py, leaving only the code for generating prediction data.)

In order to quick start, we upload the name of topic entities (data/WebQSP/ep_retrieval/WebQSP_entity_name_dict.json) (used when generating training data) in this link, download it and put it in the corresponding path can save the time of query.

cd EPR_KGQA
export PYTHONPATH=.
python evidence_pattern_retrieval/generate_candidate_eps.py --config config_WebQSP.yaml
python evidence_pattern_retrieval/generate_ep_ranking_data.py --config config_WebQSP.yaml
chmod +x evidence_pattern_retrieval/train_ep_ranking.sh 
sh -x evidence_pattern_retrieval/train_ep_ranking.sh WebQSP # about 2 hours
chmod +x evidence_pattern_retrieval/predict_ep_ranking.sh
CUDA_VISIBLE_DEVICES=0 sh -x evidence_pattern_retrieval/predict_ep_ranking.sh WebQSP test 6 100 # ds_tag, split, epoch, topk 
CUDA_VISIBLE_DEVICES=0 sh -x evidence_pattern_retrieval/predict_ep_ranking.sh WebQSP dev 6 100 # ds_tag, split, epoch, topk 
CUDA_VISIBLE_DEVICES=0 sh -x evidence_pattern_retrieval/predict_ep_ranking.sh WebQSP train 6 100 # ds_tag, split, epoch, topk
CUDA_VISIBLE_DEVICES=0 sh -x evidence_pattern_retrieval/predict_ep_ranking.sh WebQSP test 6 80 # ds_tag, split, epoch, topk 

Subgraph Extraction

In the subgraph extraction module, we extract subgraphs based on EP and convert them into the input format of NSM.

CWQ

cd EPR_KGQA
export PYTHONPATH=.
python subgraph_extraction/subgraph_extraction.py
python subgraph_extraction/convert_to_nsm_input.py

WebQSP

cd EPR_KGQA
export PYTHONPATH=.
python subgraph_extraction/subgraph_extraction.py --config config_WebQSP.yaml
python subgraph_extraction/convert_to_nsm_input.py --config config_WebQSP.yaml

NSM Reasoning

In the answer reasoning module, we use NSM as the reasoner to obtain the final answer(s) based on subgraphs.

CWQ

First download files from this link, and place them in the corresponding paths, and then run the following commands:

cd NSM_H
export PYTHONPATH=.
chmod +x ../answer_reasoning/train_nsm.sh
sh -x ../answer_reasoning/train_nsm.sh CWQ
chmod +x ../answer_reasoning/predict_nsm.sh
sh -x ../answer_reasoning/predict_nsm.sh CWQ

WebQSP

First download files from this link, and place them in the corresponding paths, and then run the following commands:

cd NSM_H
export PYTHONPATH=.
chmod +x ../answer_reasoning/train_nsm.sh
sh -x ../answer_reasoning/train_nsm.sh WebQSP
chmod +x ../answer_reasoning/predict_nsm.sh
sh -x ../answer_reasoning/predict_nsm.sh WebQSP

Citation

@inproceedings{epr-kgqa,
  author = {Ding, Wentao and Li, Jinmao and Luo, Liangchuan and Qu, Yuzhong},
  title = {Enhancing Complex Question Answering over Knowledge Graphs through Evidence Pattern Retrieval},
  year = {2024},
  booktitle = {Proceedings of the ACM Web Conference 2024},
  series = {WWW '24}
}

Acknowledgements

Our project uses WSDM2021_NSM (the Neural State Machine for KBQA) as the answer reasoner.