- Download Wiki Corpus hosted by DPR
mkdir data
cd data
# Download Jsonl corpus converted and hosted by Tevatron
wget https://huggingface.co/datasets/Tevatron/wikipedia-nq-corpus/blob/main/corpus.jsonl.gz
gunzip corpus.jsonl.gz
- Download Training datasets from DPR-NQ
mkdir nq
# Get train set
wget https://dl.fbaipublicfiles.com/dpr/data/retriever/biencoder-nq-train.json.gz
wget https://dl.fbaipublicfiles.com/dpr/data/retriever/biencoder-nq-dev.json.gz
python convert_hn_format.py --input_path biencoder-nq-train.json.gz --output_path nq/nq-train.jsonl
python convert_hn_format.py --input_path biencoder-nq-dev.json.gz --output_path nq/nq-dev.jsonl
# Get test set
wget "https://huggingface.co/datasets/Tevatron/wikipedia-nq/resolve/main/nq-test.jsonl.gz?download=true" -O "nq/nq-test.jsonl.gz"
gunzip nq/nq-test.jsonl.gz
- Download Training datasets from DPR-TQA
mkdir trivia
# Get train set
wget https://dl.fbaipublicfiles.com/dpr/data/retriever/biencoder-trivia-train.json.gz
wget https://dl.fbaipublicfiles.com/dpr/data/retriever/biencoder-trivia-dev.json.gz
python convert_hn_format.py --input_path biencoder-trivia-train.json.gz --output_path trivia/trivia-train.jsonl
python convert_hn_format.py --input_path biencoder-trivia-dev.json.gz --output_path trivia/trivia-dev.jsonl
# Get test set
wget "https://huggingface.co/datasets/Tevatron/wikipedia-trivia/resolve/main/trivia-test.jsonl.gz?download=true" -O "trivia/trivia-test.jsonl.gz"
gunzip trivia/trivia-test.jsonl.gz
Finetuning on QA datasets involves a two-stage pipeline
- s1: BM25 negs
- s2: BM25 negs + Mined negatives from s1
These pipelines are bootstraped in train_dpr_nq.sh
and train_dpr_trivia.sh
. The pre-trained checkpoint on Wikipedia and BookCorpus is released in bowdpr/bowdpr_wiki. Assume the download checkpoints are already placed in examples/results/$MODEL_NAME/model
(You can set $MODEL_NAME
to any name as you wish), please execute the fine-tuning pipelines by just run:
bash train_dpr_nq.sh $MODEL_NAME
Or
bash train_dpr_trivia.sh $MODEL_NAME
We have released the fine-tuned NQ and Trivia retriever to Huggingface. Please execute the following script to test the retrieval performances.
# Save the scores of retrieval results to this folder. Change to any temporary folder as you wish
mkdir -p results/nq
bash test_dpr_nq.sh bowdpr/bowdpr_wiki_nqft results/nq
mkdir -p results/trivia
bash test_dpr_nq.sh bowdpr/bowdpr_wiki_triviaft results/trivia