A toolbox for phonologizing French infant-directed speech corpora

These scripts allow to clean orthographic transcriptions of corpora from CHILDES and transform them into an approximate phonological transcription based on the Lexique380 French dictionary with the addition of several French phonological rules: obligatory liaison, liquid deletion, schwa insertion (optional), resyllabification (enchaînement) and je-devoicing. All scripts are written in Python 2.

Input files

This phonologizer works with CHILDES orthographic transcriptions of speech written in CHAT format. In order to process a corpus, you must download the collection of *.cha files for the given corpus from CHILDES, and store them in the directory corpora/corpus_name/ under a new subdirectory called raw/. We do not provide the *.cha files for the corpora we have processed here, but they can be downloaded from https://childes.talkbank.org/.

Main scripts

The phonologizer is separated into three scripts: clean_corpus.py, phonologize.py and compile.py, to be used in that order.

Cleaning

The cleaning script clean_corpus.py takes the orthographic transcriptions located under each directory corpora/corpus_name/raw/ and cleans them up, returning a simplified transcription with no annotations. This transcription is a concatenation of all the *.cha files found in the directory. During this step, utterances from children are filtered out.

The output of this script is a single file called extract.txt, located under corpora/corpus_name/clean/.

Phonologizing

The script phonologize.py takes the cleaned-up files extract.txt from each corpus in the directory corpora/ and produces an approximate phonological transcription based on the Lexique380 French dictionary, with the addition of the following French phonological rules: obligatory liaison, liquid deletion, schwa insertion (optional), resyllabification (enchaînement) and je-devoicing. For more information on these rules please read the accompanying paper Carbajal et al (2018).pdf.

As the rules are applied in a chain, an output file is produced after each rule, indicating the rules that have been applied so far with a letter: _L for liaison, _D for liquid deletion, _S for schwa insertion and _E for enchaînement + je-devoicing. For instance, the output file with all rules except schwa insertion applied is called phonologized_L_D_E.txt. These files will be saved in the directory output/corpus_name/. Additionally, lists of applied cases (and, for liaison only, also rejected cases) will be printed after each rule for debugging. These can be found in the same directory as the phonologized output.

Auxiliary files containing lists of words and dictionaries necessary for processing the phonological rules are contained in the directory auxiliary/.

Compiling corpora

To obtain one final compiled corpus composed of multiple corpora, you should use the compile.py script. This script will gather all output files of a specified kind (e.g., orthographic or phonological transcriptions) located in the directory output/ and concatenate them to obtain one single corpus. This script allows you to define several parameters, such as the age range of the children at the moment of the recording, the inclusion of file info at the beginning of each utterance, as well as some final modifications to the phonological transcription, namely the merging of the rounded and unrounded front nasals vowels (not contrastive in many varieties of French) and the removal of geminates across word boundaries.

The resulting compilation will be saved in the directory compiled_corpus/.

Phonologized corpus of French infant-directed speech

Using these scripts, we built a phonologized corpus of infant-directed speech based on CHILDES corpora, called corpus_phono_L_D_E_0y0m_2y0m.txt, which can be found in the directory compiled_corpus/. In this corpus we applied liaison, liquid deletion, enchaînement, and “je”-devoicing, but not schwa insertion. Furthermore, we merged the nasal vowels mentioned before, but we did not remove geminates. For further details regarding the included corpora and the processing steps, please read the documentation file Carbajal et al (2018).pdf.

Citing this work

If you use the toolbox and/or the sample phonologized corpus, please cite the following paper:

Carbajal, M.J., Bouchon, C., Dupoux, E. & Peperkamp, S. (2018) A toolbox for phonologizing French infant-directed speech corpora. Retrieved from: https://github.com/juliacarbajal/french_phonologizer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A toolbox for phonologizing French infant-directed speech corpora

Input files

Main scripts

Cleaning

Phonologizing

Compiling corpora

Phonologized corpus of French infant-directed speech

Citing this work

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 169 Commits
auxiliary		auxiliary
compiled_corpus		compiled_corpus
corpora		corpora
dic_scripts		dic_scripts
output		output
test_corpus		test_corpus
Carbajal et al (2018).pdf		Carbajal et al (2018).pdf
README.md		README.md
clean_corpus.py		clean_corpus.py
compile.py		compile.py
phonologize.py		phonologize.py
recode.py		recode.py

ericgcc/french_phonologizer

Folders and files

Latest commit

History

Repository files navigation

A toolbox for phonologizing French infant-directed speech corpora

Input files

Main scripts

Cleaning

Phonologizing

Compiling corpora

Phonologized corpus of French infant-directed speech

Citing this work

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages