This repository can be only used for supplying scoring functionality to Our model for FEVER competition in 2018.
I have already modified requirements.txt
so that it does not install unnecessary modules.
This is the PyTorch implementation of the FEVER pipeline baseline described in the NAACL2018 paper: FEVER: A large-scale dataset for Fact Extraction and VERification.
Unlike other tasks and despite recent interest, research in textual claim verification has been hindered by the lack of large-scale manually annotated datasets. In this paper we introduce a new publicly available dataset for verification against textual sources, FEVER: Fact Extraction and VERification. It consists of 185,441 claims generated by altering sentences extracted from Wikipedia and subsequently verified without knowledge of the sentence they were derived from. The claims are classified as Supported, Refuted or NotEnoughInfo by annotators achieving 0.6841 in Fleiss κ. For the first two classes, the annotators also recorded the sentence(s) forming the necessary evidence for their judgment. To characterize the challenge of the dataset presented, we develop a pipeline approach using both baseline and state-of-the-art components and compare it to suitably designed oracles. The best accuracy we achieve on labeling a claim accompanied by the correct evidence is 31.87%, while if we ignore the evidence we achieve 50.91%. Thus we believe that FEVER is a challenging testbed that will help stimulate progress on claim verification against textual sources
The baseline model constists of two components: Evidence Retrieval (DrQA) + Textual Entailment (Decomposable Attention).
- Visit http://fever.ai to find out more about the shared task and download the data.
This was tested and evaluated using the Python 3.6 verison of Anaconda 5.0.1 which can be downloaded from anaconda.com
Mac OSX users may have to install xcode before running git or installing packages (gcc may fail). See this post on StackExchange
Support for Windows operating systems is not provided.
To train the Decomposable Attention models, it is highly recommended to use a GPU. Training will take about 3 hours on a GTX 1080Ti whereas training on a CPU will take days. We offer a pre-trained model.tar.gz that can be downloaded. To use the pretrained model, simply replace any path to a model.tar.gz file with the path to the file you downloaded. (e.g. logs/da_nn_sent/model.tar.gz
could become ~/Downloads/model.tar.gz
)
- v0.2 - updated the Information Retrieval component to use a modified version of DrQA that allows multi-threaded document/sentence retrieval. This yields a >10x speed-up the in IR stage of the pipeline as I/O waits are no longer blocking computation of TF*IDF vectors
- v0.1 - original implementation (tagged as naacl2018)
Create a virtual environment for FEVER with Python 3.6 and activate it.
conda create -n fever python=3.6
source activate fever
Manually Install PyTorch (version 3) (different distributions should follow instructions from pytorch.org)
conda install pytorch=0.3.1 torchvision -c pytorch
Clone the repository
git clone https://github.com/sheffieldnlp/fever-baselines
cd fever-baselines
Install requirements (run export LANG=C.UTF-8
if installation of DrQA fails)
pip install -r requirements.txt
Download the FEVER dataset from our website into the data directory
mkdir data
mkdir data/fever-data
#To replicate the paper, download paper_dev and paper_test files. These are concatenated for the shared task
wget -O data/fever-data/train.jsonl https://s3-eu-west-1.amazonaws.com/fever.public/train.jsonl
wget -O data/fever-data/dev.jsonl https://s3-eu-west-1.amazonaws.com/fever.public/paper_dev.jsonl
wget -O data/fever-data/test.jsonl https://s3-eu-west-1.amazonaws.com/fever.public/paper_test.jsonl
#To train the model for the shared task (the test set will be released in July 2018)
wget -O data/fever-data/dev.jsonl https://s3-eu-west-1.amazonaws.com/fever.public/shared_task_dev.jsonl
wget -O data/fever-data/test.jsonl https://s3-eu-west-1.amazonaws.com/fever.public/shared_task_test.jsonl
Download pretrained GloVe Vectors
wget http://nlp.stanford.edu/data/wordvecs/glove.6B.zip
unzip glove.6B.zip -d data/glove
gzip data/glove/*.txt
The data preparation consists of three steps: downloading the articles from Wikipedia, indexing these for the Evidence Retrieval and performing the negative sampling for training .
Download the pre-processed Wikipedia articles from our website and unzip it into the data folder.
wget https://s3-eu-west-1.amazonaws.com/fever.public/wiki-pages.zip
unzip wiki-pages.zip -d data
Construct an SQLite Database and build TF-IDF index (go grab a coffee while this runs)
PYTHONPATH=src python src/scripts/build_db.py data/wiki-pages data/fever/fever.db
PYTHONPATH=src python src/scripts/build_tfidf.py data/fever/fever.db data/index/
Sample training data for the NotEnoughInfo class. There are two sampling methods evaluated in the paper: using the nearest neighbour (similarity between TF-IDF vectors) and random sampling.
#Using nearest neighbor method
PYTHONPATH=src python src/scripts/retrieval/document/batch_ir_ns.py --model data/index/fever-tfidf-ngram=2-hash=16777216-tokenizer=simple.npz --count 1 --split train
PYTHONPATH=src python src/scripts/retrieval/document/batch_ir_ns.py --model data/index/fever-tfidf-ngram=2-hash=16777216-tokenizer=simple.npz --count 1 --split dev
And random sampling
#Using random sampling method
PYTHONPATH=src python src/scripts/dataset/neg_sample_evidence.py data/fever/fever.db
Model 1: Multilayer Perceptron (expected oracle dev set performance: 62.27%)
#If using a GPU, set
export GPU=1
#If more than one GPU,
export CUDA_DEVICE=0 #(or any CUDA device id. default is 0)
# Using nearest neighbor sampling method for NotEnoughInfo class (better)
PYTHONPATH=src python src/scripts/rte/mlp/train_mlp.py data/fever/fever.db data/fever/train.ns.pages.p1.jsonl data/fever/dev.ns.pages.p1.jsonl --model ns_nn_sent --sentence true
#Or, using random sampled data for NotEnoughInfo (worse)
PYTHONPATH=src python src/scripts/rte/mlp/train_mlp.py data/fever/fever.db data/fever/train.ns.rand.jsonl data/fever/dev.ns.rand.jsonl --model ns_rand_sent --sentence true
Model 2: Decomposable Attention (expected dev set performance: 77.97%)
#if using a CPU, set
export CUDA_DEVICE=-1
#if using a GPU, set
export CUDA_DEVICE=0 #or cuda device id
# Using nearest neighbor sampling method for NotEnoughInfo class (better)
PYTHONPATH=src python src/scripts/rte/da/train_da.py data/fever/fever.db config/fever_nn_ora_sent.json logs/da_nn_sent --cuda-device $CUDA_DEVICE
#Or, using random sampled data for NotEnoughInfo (worse)
PYTHONPATH=src python src/scripts/rte/da/train_da.py data/fever/fever.db config/fever_rs_ora_sent.json logs/da_rs_sent --cuda-device $CUDA_DEVICE
Score:
PYTHONPATH=src python src/scripts/score.py --predicted logs/da_nn_sent_test --actual data/fever-data/test.jsonl
Model 1: Multi-layer perceptron
PYTHONPATH=src python src/scripts/rte/mlp/eval_mlp.py data/fever/fever.db --model ns_nn_sent --sentence true --log logs/mlp_nn_sent
Model 2: Decomposable Attention
PYTHONPATH=src python src/scripts/rte/da/eval_da.py data/fever/fever.db logs/da_nn_sent/model.tar.gz data/fever/dev.ns.pages.p1.jsonl
PYTHONPATH=src python src/scripts/retrieval/ir.py --db data/fever/fever.db --model data/index/fever-tfidf-ngram=2-hash=16777216-tokenizer=simple.npz --in-file data/fever-data/dev.jsonl --out-file data/fever/dev.sentences.p5.s5.jsonl --max-page 5 --max-sent 5
PYTHONPATH=src python src/scripts/retrieval/ir.py --db data/fever/fever.db --model data/index/fever-tfidf-ngram=2-hash=16777216-tokenizer=simple.npz --in-file data/fever-data/test.jsonl --out-file data/fever/test.sentences.p5.s5.jsonl --max-page 5 --max-sent 5
For legacy evidence retrieval (including NLTK-based retrieval, see the readme in naacl2018
tag)
Model 1: Multi-layer perceptron
PYTHONPATH=src python src/scripts/rte/mlp/eval_mlp.py data/fever/fever.db data/fever/dev.sentences.p5.s5.jsonl --model ns_nn_sent --sentence true --log logs/mlp_nn_sent_dev
PYTHONPATH=src python src/scripts/rte/mlp/eval_mlp.py data/fever/fever.db data/fever/test.sentences.p5.s5.jsonl --model ns_nn_sent --sentence true --log logs/mlp_nn_sent_test
Model 2: Decomposable Attention
PYTHONPATH=src python src/scripts/rte/da/eval_da.py data/fever/fever.db logs/da_nn_sent/model.tar.gz data/fever/dev.sentences.p5.s5.jsonl --log logs/da_nn_sent_dev
PYTHONPATH=src python src/scripts/rte/da/eval_da.py data/fever/fever.db logs/da_nn_sent/model.tar.gz data/fever/test.sentences.p5.s5.jsonl --log logs/da_nn_sent_test
Score:
PYTHONPATH=src python src/scripts/score.py --predicted_labels logs/da_nn_sent_test --predicted_evidence data/fever/test.sentences.p5.s5.jsonl --actual data/fever-data/test.jsonl
Prepare Submission for Codalab:
PYTHONPATH=src python src/scripts/prepare_submission.py --predicted_labels logs/da_nn_sent_test --predicted_evidence data/fever/test.sentences.p5.s5.jsonl --out_file predictions.jsonl
zip submission.zip predictions.jsonl