Skip to content

Latest commit

 

History

History
170 lines (147 loc) · 9.08 KB

README.md

File metadata and controls

170 lines (147 loc) · 9.08 KB

Update 08/2023

***** Migrating our models and data to HuggingFace *****

Due to the restricted access of the Azure blob storage, we are migrating our models and data to HuggingFace. Please use the following command to download our models and data. The resource keys are the same as in download_data.py.

python download_data_hf.py --resource resource_key --output_dir your_output_dir 

Update 06/2023

***** Adding code for COS *****

The code and model for the paper "Chain-of-Skills: A Configurable Model for Open-domain Question Answering" (ACL 2023) is added. See full paper here

Update

***** Adding code for CORE *****

The code and data for the paper "Open-domain Question Answering via Chain of Reasoning over Heterogeneous Knowledge" (Findings of EMNLP 2022) is added. See full paper here

Open Domain Question Answering with A Unified Knowledge Interface

This repository contains the code and data for the paper "Open Domain Question Answering with A Unified Knowledge Interface" (ACL 2022). See full paper here

Note that our code is adpated from DPR repo and plms-graph2text

Knowledge Sources

We provide the following knowledge resources, you can copy the link below and do "wget link" to download them, whole: each table is a unit, chunked: each table is splited into chunks of approximately 100 tokens

Source Format Whole Chunked
Table Raw link link
Table Verbalized link link
KB Verbalized link link

Enviroments

Our verbalizer and retriever require different environments so you should build two separate environments, for DPR part

cd DPR
pip install .

For verbalizer, the code has been tested on Python 3.8, Pytorch 1.7.1 and Transformers 3.3.1, pytorch-lightning 0.9.0, you can install the required packages by

cd Verbalizer
pip install -r requirements.txt

Verbalizing your own knowledge resource

You can first download our trained verbalizer using this link. Then you would need to prepare your data to the format similar to Verbalizer/data/test.source Then run verbalizer with

bash generate.sh your_data_folder verbalizer_ckpt output_dir 1 your_data_filename 10 1 0 

See generate.sh for more information on the arguments. In short, our verbalizer uses data parallel, i.e. a process is spawned to use 1 GPU to work on 1 shard of data independently. After generation is done, run following for beam selection

python post_processing.py --verbalizer_output output_file_from_previous_step --verbalizer_input your_data_filename 

If you would like to re-train a new verbalizer, run

bash train.sh data t5-large T-F_ID-T 1 5

Retriever data and models

To download trained retriever models and the encoded knowledge sources, use the download_data.py

python download_data.py --resource model.retriever.nq_v_tables --output_dir your_output_dir 

Check the resource_map in download_data.py for more information. We provide the following 3 models on NQ (metrics on NQ test set)

Knowledge Sources Format Resource key R20 R100 EM
Text+Table Raw nq_raw_tables 86.9 91.9 54.7
Text+Table Verbalized nq_v_tables 87.0 91.7 55.2
Text+Table+KB Verbalized nq_v_all 85.6 91.2 55.1

If you do not already have the chunked wikipedia passages, you can download it using this link. It's the same copy provided by DPR repo. To run inference on the retriever

python dense_retriever.py model_file=downloaded_model_file qa_dataset=[nq_test] \
    ctx_datatsets=[dpr_wiki,verbalized_table] encoded_ctx_files=[download_path/wiki*,download_path/verbalized*] \
    out_file=[output_location] 

Note that you will need to update the file paths in the /DPR/conf/ctx_sources/default_sources.yaml and /DPR/conf/datasets/retriever_default.yaml If you have small CPU RAM, you can add the validation_workers=1 argument.

Training your own retriever

If you would like to train your own retriever model, you can download the retriever training data using download_data.py, we provide the training data for the above 3 settings on NQ and for WebQ with all verbalized knowledge. To run retriever training,

python -m torch.distributed.launch --nproc_per_node=8 train_dense_encoder.py train=biencoder_nq  
    train_datasets=[nq_train_bm25,nq_train_dpr_v_tab_index,nq_train_v_tab_bm25,nq_train_v_tab_dpr] dev_datasets=[nq_dev_v_tab_index]
    output_dir=your_output_dir
    checkpoint_file_name=dpr_biencoder

For more information on the train datasets keys, check the /DPR/conf/datasets/encoder_train_default.yaml, note that you will also need to update the corresponding file paths.

Task Knowledge Sources Format Resource key Train set keys Dev set keys
NQ Text+Table Raw nq_raw_tables nq_train_bm25,nq_train_dpr_raw_tab_index,nq_train_raw_tab_bm25,nq_train_raw_tab_dpr nq_dev_raw_tab_index
NQ Text+Table Verbalized nq_v_tables nq_train_bm25,nq_train_dpr_v_tab_index,nq_train_v_tab_bm25,nq_train_v_tab_dpr nq_dev_v_tab_index
NQ Text+Table+KB Verbalized nq_v_all nq_train_bm25,nq_train_dpr_v_all_index,nq_train_v_tab_bm25,nq_train_v_kb_bm25 nq_dev_v_all_index
WebQ Text+Table+KB Verbalized webq_v_all webq_train,webq_train_v_kb_bm25,webq_train_v_tab_bm25,webq_train_dpr_v_all_index webq_dev_v_all_index

Training on NQ takes about 1.5 days to finish. After training is done, you can generate the encoded embeddings with following,

python generate_dense_embeddings.py model_file=your_best_model_checkpoint ctx_src=verbalized_table \
    out_file=your_output_location batch_size=2048 shard_id=0 num_shards=1 gpu_id=0 num_gpus=1 

This script runs with data parallel, i.e. each processing will work on 1 shard of data independently, thus you can run multiple processes to work on different pieces of data, if you have multiple GPUs.

Training your own reader

If you are only interested in training a reader model, we also provide the reader training data (i.e. retriever results) for NQ in above 3 settings. Again, you can download them using download_data.py. Note that we did not use the reader model implemented under the DPR folder and we used UnitedQA instead.

Cite

@inproceedings{ma-etal-2022-open,
    title = "Open Domain Question Answering with A Unified Knowledge Interface",
    author = "Ma, Kaixin  and
      Cheng, Hao  and
      Liu, Xiaodong  and
      Nyberg, Eric  and
      Gao, Jianfeng",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl-long.113",
    doi = "10.18653/v1/2022.acl-long.113",
    pages = "1605--1620",
}

@inproceedings{ma-etal-2022-open-domain,
    title = "Open-domain Question Answering via Chain of Reasoning over Heterogeneous Knowledge",
    author = "Ma, Kaixin  and
      Cheng, Hao  and
      Liu, Xiaodong  and
      Nyberg, Eric  and
      Gao, Jianfeng",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.findings-emnlp.392",
    pages = "5360--5374",
}

@inproceedings{ma-etal-2023-chain,
    title = "Chain-of-Skills: A Configurable Model for Open-Domain Question Answering",
    author = "Ma, Kaixin  and
      Cheng, Hao  and
      Zhang, Yu  and
      Liu, Xiaodong  and
      Nyberg, Eric  and
      Gao, Jianfeng",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-long.89",
    doi = "10.18653/v1/2023.acl-long.89",
    pages = "1599--1618",
}