This repository contains the dataset, helpers, and systems comparison for our paper on Arabic Text Diacritization:
"Arabic Text Diacritization Using Deep Neural Networks", Ali Fadel, Ibraheem Tuffaha, Bara' Al-Jawarneh, and Mahmoud Al-Ayyoub, ICCAIS 2019.
- train.txt - Contains 50,000 lines of diacritized Arabic text which can be used as training dataset
- val.txt - Contains 2,500 lines of diacritized Arabic text which can be used as validation dataset
- test.txt - Contains 2,500 lines of diacritized Arabic text which can be used as testing dataset
- constants
- ARABIC_LETTERS_LIST.pickle - Contains list of Arbaic letters
- CLASSES_LIST.pickle - Contains list of all possible classes
- DIACRITICS_LIST.pickle - Contains list of all diacritics
- count_characters.py - Counts the number of Arabic letters and diacritics in a file
- count_fathatan.py - Counts the number of fathatan occurrences before and after Alif in all files from a folder
- diacritization_stat.py - Calculates DER and WER using the gold data and the predicted output
- diacritics_rate_extractor.py - Keeps lines with p% diacritics to Arabic characters rate or more in all files from a folder
- file_lookup.py - Searches for a line in all files from a folder
- fix_fathatan.py - Changes after-Alif fathatan to before-Alit fathatan in a file
- remove_diacritics.py - Removes diacritics from a file
- transliteration.py - Converts a file from Arabic text to Buckwalter transliteration and vice-versa
- pre_process_tashkeela_corpus.ipynb - Pre-process Tashkeela Corpus data
- ali-soft - Contains some bugs that exist in Ali-Soft system
- farasa - Contains Farasa system output, fixed output, and DER/WER statistics
- harakat - Contains Harakat system testing script, output, fixed output, and DER/WER statistics
- madamira - Contains MADAMIRA system output, fixed output, and DER/WER statistics
- mishkal - Contains Mishkal system output, fixed output, and DER/WER statistics
- shakkala - Contains Shakkala system data splitting script, output, fixed output, and DER/WER statistics
- tashkeela_model - Contains Tashkeela-Model system output, fixed output, and DER/WER statistics for each n-gram model provided by them
Note: All codes in this repository tested on Ubuntu 18.04
The project is available as open source under the terms of the MIT License.