Skip to content

Latest commit

 

History

History
56 lines (40 loc) · 2.51 KB

README.md

File metadata and controls

56 lines (40 loc) · 2.51 KB

This repository contains the code for the paper:

Phonetic and Visual Priors for Decipherment of Informal Romanization
Maria Ryskina, Matthew R. Gormley, Taylor Berg-Kirkpatrick
ACL 2020

Please contact [email protected] for any questions.

This implementation relies on the OpenFst library and the OpenGrm Ngram library.

Requirements

  • g++ >= 5.4.0
  • OpenFst == 1.7.X (originally implemented with 1.7.4)
  • OpenGrm == 1.3.X (originally implemented with 1.3.8)

You can install the dependencies automatically by building a Conda enviroment from environment.yml, contributed by Michele Corazza (@ashmikuz).

Data

The data files for Russian and Arabic must be stored in the ./data/ru/ and ./data/ar/ directories respectively.

Russian: ru.tgz contains the full preprocessed romanized Russian dataset, including the symbol tables and priors. The language model data file is a preprocessed version of the vktexts.txt file from the social media segment of the Taiga Corpus; the rest of the data is collected by the authors.

Arabic: ar.tgz contains only the files for the symbol tables and priors. The BOLT Egyptian Arabic SMS/Chat and Transliteration dataset used in this paper is distributed by LDC; a script to preprocess the LDC data into the required format will be added shortly.

Usage

Run make to build the code. If the OpenFst and OpenGrm libraries are installed in a location other than default (/usr/local/), you need to specify the correct include (-I) and lib (-L) paths in the makefile.

To reproduce the supervised experiments described in the paper, run:

./decipher --dataset {ar|ru} --supervised

To reproduce the unsupervised experiments:

./decipher --dataset {ar|ru} --freeze-at 100 [--prior {phonetic|visual|combined}]

To see the full usage statement and the command line option descriptions, run:

./decipher --help

Reference

@inproceedings{ryskina2020phonetic,
 title={Phonetic and Visual Priors for Decipherment of Informal Romanization},
 author={Ryskina, Maria and Gormley, Matthew R. and Berg-Kirkpatrick, Taylor},
 booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
 year={2020}
}