MediaWiki AddLink Extension Model and API
This repository contains the necessary code for the link recommendation model for Wikipedia articles. This consists of code to, both, train the model (including all the necessary pre-processing) and how to query the model to get link recommendations for an individual articles. The method is context-free and can be scaled to (virtually) any language, provided that we have enough existing links to learn from.
Once the model and all the utility files are computed (see "Training the model" below), they can be loaded and used to build an API to add new links to a Wikipedia page automatically. For this we have the following utilites
- command-line tool:
python addlink-query_links.py -l de -p Garnet_Carter
This will return all recommended links for a given page (-p) in a given wiki (-l). You can also specify the threshold for the probability of the link (-t, default=0.9)
- interactive notebook:
addlink-query_notebook.ipynb
This allows you to inspect the recommendations in a notebook.
Notes:
-
currently this works only on stat1008 in the analytics cluster as the underlying data from the trained model is available only locally there
-
we need set up a python virtual environment:
virtualenv -p /usr/bin/python3 venv/
source venv_query/bin/activate
pip install -r requirements.txt
This contains only the packages required for querying the model and is thus lighter than the environment for training the model (see below)
- on the stat-machines, make sure you have the http-proxy set up https://wikitech.wikimedia.org/wiki/HTTP_proxy
- you might have to install the following nltk-package manually:
python -m nltk.downloader punkt
To load, the API will need to pre-compute the some datasets for each target language.
It is essential to follow these steps sequentially because some scripts may require the output of previous ones.
You can run the pipeline for a given language (change the variable LANG
)
./run-pipeline.sh
Notes:
- we need set up a python virtual environment:
virtualenv -p /usr/bin/python3 venv/
source venv/bin/activate
pip install -r requirements_train.txt
- some parts in the script rely on using the spark cluster using a specific conda-environment from a specific stat-machine (stat1008).
- on the stat-machines, make sure you have the http-proxy set up https://wikitech.wikimedia.org/wiki/HTTP_proxy
- you might have to install the following nltk-package manually:
python -m nltk.downloader punkt
This is the main dictionary to find candidates and mentions; the bigger, the better (barring memory issues) for English, this is a ~2G pickle file.
compute with:
PYSPARK_PYTHON=python3.7 PYSPARK_DRIVER_PYTHON=python3.7 spark2-submit --master yarn --executor-memory 8G --executor-cores 4 --driver-memory 2G ./scripts/generate_anchor_dictionary_spark.py $LANG
store in:
./data/<LANG>/<LANG>.anchors.pkl
- normalising link-titles (e.g. capitalize first letter) and anchors (lowercase the anchor-string) via
scripts/utils.py
- for candidate links, we resolve redirects and only keep main-namespace articles
This also adds the two following helper-dictionaries
./data/<LANG>/<LANG>.pageids.pkl
- this is a dictionary of all main-namespace and non-redirect articles with the mapping of {page_title:page_id}
./data/<LANG>/<LANG>.redirects.pkl
- this is a dictionary of all main-namespace and redirect articles with the mapping {page_title:page_title_rd}, where page_title_rd is the title of the redirected-to article.
Note that the default setup uses the spark-cluster from stat1008 (in order to use the anaconda-wmf newpyter setup. This is necessary for filtering the anchor-dictionary by link-probability. Alternatively, one can run:
python ./scripts/generate_anchor_dictionary.py <LANG>
This models semantic relationship. Get it from: https://github.com/wikipedia2vec/wikipedia2vec then run:
wikipedia2vec train --min-entity-count=0 --dim-size 50 --pool-size 10 "/mnt/data/xmldatadumps/public/"$LANG"wiki/latest/"$LANG"wiki-latest-pages-articles.xml.bz2" "./data/"$LANG"/"$LANG".w2v.bin"
store in
./data/<LANG>/<LANG>.w2v.bin
We filter only those vectors from articles in the main-namespace that are not redirects by running
python filter_dict_w2v.py $LANG
and storing the resulting dictionary as a pickle
./data/<LANG>/<LANG>.w2v.filtered.pkl
This models how current Wikipedia readers navigate through Wikipedia.
compute via:
PYSPARK_PYTHON=python3.7 PYSPARK_DRIVER_PYTHON=python3.7 spark2-submit --master yarn --executor-memory 8G --executor-cores 4 --driver-memory 2G ./scripts/generate_features_nav2vec-01-get-sessions.py -l $LANG
- gets reading sessions from webrequest from 1 week (this can be changed)
python ./scripts/generate_features_nav2vec-02-train-w2v.py -l $LANG -rfin True
- fits a word2vec-model with 50 dimensions (this and other hyperparameters can also be changed)
This will generate an embedding for in
./data/<LANG>/<LANG>.nav.bin
We filter only those vectors from articles in the main-namespace that are not redirects by running
python filter_dict_nav.py $LANG
and storing the resulting dictionary as a pickle
./data/<LANG>/<LANG>.nav.filtered.pkl
There is a backtesting dataset to a) test the accuracy of the model, and b) train the model. We mainly want to extract fully formed and linked sentences as our ultimate ground truth.
compute with:
python ./scripts/generate_backtesting_data.py $LANG
Datasets are then stored in:
./data/<LANG>/training/sentences_train.csv
./data/<LANG>/testing/sentences_test.csv
We need dataset with features and training labels (true link, false link)
compute with:
python ./scripts/generate_training_data.py <LANG>
This is going to generate a file to be stored here:
./data/<LANG>/training/link_train.csv
This is the main prediction model it takes (Page_title, Mention, Candidate Link) and produces a probability of linking.
compute with:
python ./scripts/generate_addlink_model.py <LANG>
store in:
./data/<LANG>/<LANG>.linkmodel.bin
Evaluate the prediction algorithm on a set of sentences in the training set using micro-precision and micro-recall.
compute with (first 10000 sentences):
python generate_backtesting_eval.py -l $LANG -nmax 10000
store in:
./data/<LANG>/<LANG>.backtest.eval
The pickle-dictionaries (anchors, pageids, redirects, w2v,nav) are converted to sqlite-databases using the sqlitedict-package in order to reduce memory-footprint when reading these dictionaries when getting link-recommendations for individual articles.
computed via
python ./scripts/generate_sqlite_data.py $LANG
stored in
./data/<LANG>/<LANG>.anchors.sqlite
./data/<LANG>/<LANG>.pageids.sqlite
./data/<LANG>/<LANG>.redirects.sqlite
./data/<LANG>/<LANG>.w2v.filtered.sqlite
./data/<LANG>/<LANG>.nav.filtered.sqlite