MAD

Misalignment detection (MAD)

Repository contains code for the paper: Misalignment detection for web scraped corpora: a supervised regression approach.

Dependencies

python3
numpy
scipy
scikit-learn
pandas
pytorch (tested with 0.4)
InferSent (git clone https://github.com/facebookresearch/InferSent.git )

Dockerfile can be found in docker_mad/Dockerfile

Labeled dataset

Labeled dataset used to train supervised regression model:
DATA/supervised_dataset/labeled.tsv

A tab separated file, with at each line:
source /t target /t source_2_cross /t target_2_cross /t source_2_en /t target_2_en /t MAD_score

Notebook

Notebook where machine learning pipeline is explained:
https://github.com/ArneDefauw/MAD/blob/master/1-MAD_notebook.ipynb

Pre-trained MAD model

MAD/models/estimator_SVC.save

Using MAD

MAD.py

example:

if en.sentences is a file containing English sentences; ga.sentences is file containing Irish sentences; and ga.sentences.translated contains translation of ga.sentences to English, then we can calculate MAD score for each sentence pair via the command:

python MAD.py \

--path_src en.sentences \

--path_tgt ga.sentences \

--path_tgt_translated ga.sentence.translated \

--output_folder /MAD

Code will create the file MAD_scores with at each line the MAD score for the corresponding sentence pair.

Data used for intrinsic evaluation

The folder 'MAD/Gold_standard' contains the data used for intrinsic evaluation on alignment Gold Standard (EN-FR and EN-GA).

The folders 'documents_en_fr' and 'documents_en_ga' contain the 13 respectively 11 documents used for creation of alignment Gold Standards. The file 'url_keys_matched' is a tab separated file containing the doc id, the url of the English documents, url of the foreign language documents and the similarity score produced by Malign (https://github.com/paracrawl/Malign)

The files en_fr_GS.en | en_fr_GS.fr and en_ga_GS.en | en_ga_GS.ga contain the Gold standards for EN-FR respectively EN-GA.

'corpus_en_fr' and 'corpus_en_ga' are tab separated files containing the alignments produced by Hunalign with at each line:

url_src /t url_tgt /t src /t tgt /t BiCleaner_score /t MAD_score

Data used for extrinsic evaluation

Data used for training of our NMT engines, apart from the open source baseline training data described in the paper, can be found here:
https://github.com/ArneDefauw/MAD/releases/download/v1.0.0/corpus_en_fr.gz
https://github.com/ArneDefauw/MAD/releases/download/v1.0.0/corpus_en_ga.gz

They are tab separated files with at each line

url_src /t url_tgt /t src /t tgt /t tgt_translated_to_src /t BiCleaner_score /t MAD_score

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MAD

Dependencies

Labeled dataset

Notebook

Pre-trained MAD model

Using MAD

Data used for intrinsic evaluation

Data used for extrinsic evaluation

About

Releases 1

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
DATA/supervised_dataset		DATA/supervised_dataset
Gold_standard		Gold_standard
docker_mad		docker_mad
models		models
1-MAD_notebook.ipynb		1-MAD_notebook.ipynb
MAD.py		MAD.py
README.md		README.md
number_match.py		number_match.py

ArneDefauw/MAD

Folders and files

Latest commit

History

Repository files navigation

MAD

Dependencies

Labeled dataset

Notebook

Pre-trained MAD model

Using MAD

Data used for intrinsic evaluation

Data used for extrinsic evaluation

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages