Skip to content

ArneDefauw/MAD

Repository files navigation

MAD

Misalignment detection (MAD)

Repository contains code for the paper: Misalignment detection for web scraped corpora: a supervised regression approach.

Dependencies

python3
numpy
scipy
scikit-learn
pandas
pytorch (tested with 0.4)
InferSent (git clone https://github.com/facebookresearch/InferSent.git )

Dockerfile can be found in docker_mad/Dockerfile

Labeled dataset

Labeled dataset used to train supervised regression model:
DATA/supervised_dataset/labeled.tsv

A tab separated file, with at each line:
source /t target /t source_2_cross /t target_2_cross /t source_2_en /t target_2_en /t MAD_score

Notebook

Notebook where machine learning pipeline is explained:
https://github.com/ArneDefauw/MAD/blob/master/1-MAD_notebook.ipynb

Pre-trained MAD model

MAD/models/estimator_SVC.save

Using MAD

MAD.py

example:

if en.sentences is a file containing English sentences; ga.sentences is file containing Irish sentences; and ga.sentences.translated contains translation of ga.sentences to English, then we can calculate MAD score for each sentence pair via the command:

python MAD.py \

--path_src en.sentences \

--path_tgt ga.sentences \

--path_tgt_translated ga.sentence.translated \

--output_folder /MAD

Code will create the file MAD_scores with at each line the MAD score for the corresponding sentence pair.

Data used for intrinsic evaluation

The folder 'MAD/Gold_standard' contains the data used for intrinsic evaluation on alignment Gold Standard (EN-FR and EN-GA).

The folders 'documents_en_fr' and 'documents_en_ga' contain the 13 respectively 11 documents used for creation of alignment Gold Standards. The file 'url_keys_matched' is a tab separated file containing the doc id, the url of the English documents, url of the foreign language documents and the similarity score produced by Malign (https://github.com/paracrawl/Malign)

The files en_fr_GS.en | en_fr_GS.fr and en_ga_GS.en | en_ga_GS.ga contain the Gold standards for EN-FR respectively EN-GA.

'corpus_en_fr' and 'corpus_en_ga' are tab separated files containing the alignments produced by Hunalign with at each line:

url_src /t url_tgt /t src /t tgt /t BiCleaner_score /t MAD_score

Data used for extrinsic evaluation

Data used for training of our NMT engines, apart from the open source baseline training data described in the paper, can be found here:
https://github.com/ArneDefauw/MAD/releases/download/v1.0.0/corpus_en_fr.gz
https://github.com/ArneDefauw/MAD/releases/download/v1.0.0/corpus_en_ga.gz

They are tab separated files with at each line

url_src /t url_tgt /t src /t tgt /t tgt_translated_to_src /t BiCleaner_score /t MAD_score