This repository contains the code for the paper:

Phonetic and Visual Priors for Decipherment of Informal Romanization
Maria Ryskina, Matthew R. Gormley, Taylor Berg-Kirkpatrick
ACL 2020

Please contact mryskina@cs.cmu.edu for any questions.

This implementation relies on the OpenFst library and the OpenGrm Ngram library.

Requirements

g++ >= 5.4.0
OpenFst == 1.7.X (originally implemented with 1.7.4)
OpenGrm == 1.3.X (originally implemented with 1.3.8)

You can install the dependencies automatically by building a Conda enviroment from environment.yml, contributed by Michele Corazza (@ashmikuz).

Data

The data files for Russian and Arabic must be stored in the ./data/ru/ and ./data/ar/ directories respectively.

Russian: ru.tgz contains the full preprocessed romanized Russian dataset, including the symbol tables and priors. The language model data file is a preprocessed version of the vktexts.txt file from the social media segment of the Taiga Corpus; the rest of the data is collected by the authors.

Arabic: ar.tgz contains only the files for the symbol tables and priors. The BOLT Egyptian Arabic SMS/Chat and Transliteration dataset used in this paper is distributed by LDC; a script to preprocess the LDC data into the required format will be added shortly.

Usage

Run make to build the code. If the OpenFst and OpenGrm libraries are installed in a location other than default (/usr/local/), you need to specify the correct include (-I) and lib (-L) paths in the makefile.

To reproduce the supervised experiments described in the paper, run:

./decipher --dataset {ar|ru} --supervised

To reproduce the unsupervised experiments:

./decipher --dataset {ar|ru} --freeze-at 100 [--prior {phonetic|visual|combined}]

To see the full usage statement and the command line option descriptions, run:

./decipher --help

Reference

@inproceedings{ryskina2020phonetic,
 title={Phonetic and Visual Priors for Decipherment of Informal Romanization},
 author={Ryskina, Maria and Gormley, Matthew R. and Berg-Kirkpatrick, Taylor},
 booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
 year={2020}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Requirements

Data

Usage

Reference

Files

README.md

Latest commit

History

README.md

File metadata and controls

Requirements

Data

Usage

Reference