punkProse

Punctuation generation for speech transcripts using lexical, syntactic and prosodic features.

Modification on forked repository (by reducing training to one stage and addition of more word-level prosodic features). This version lets use any combination of word-aligned features.

Prosodically annotated files are in proscript format (https://github.com/alpoktem/proscript). For example data and extraction scripts see: https://github.com/alpoktem/ted_preprocess

How does it perform?

English punctuation model was trained from a prosodically annotated TED corpus consisting of 1038 talks (155174 sentences). Link to dataset: http://hdl.handle.net/10230/33981

Punctuation generation accuracy with respect to human transcription:

PUNCTUATION	PRECISION	RECALL	F-SCORE
Comma (,)	61.3	48.9	54.4
Question Mark (?)	71.8	70.6	71.2
Period (.)	82.6	83.5	83.0
Overall	73.7	67.3	70.3

These scores are obtained with a model trained with leveled pause duration and mean f0 features together with word and POS tags.

Example Run

Requirements:
- Python 3.x
- Numpy
- Theano
- yaml

Data directory (path $datadir) should look like the output folder (data/corpus) in https://github.com/alpoktem/ted_preprocess. Vocabularies and sampled training/testing/development sets are stored here.

Sample run explained here is provided in run.sh.

Training

Training is done on sequenced data stored in train_samples under $datadir.

Dataset features to train with are given with the flag -f. Other training parameters are specified through the parameters.yaml file. To train with word, pause, POS and mean f0:

modelId="mod_word-pause-pos-mf0"

python main.py -m $modelId -f word -f pause_before -f pos -f f0_mean -p parameters.yaml

Testing

Testing is done on proscript data using punctuator.py. Either single <input-file> or <input-directory> is given as input using -i or -d respectively. Even if there's punctuation information on this data, it is ignored. Predictions for each file in the $test_samples directory are put into $out_preditions directory. Input files should contain the parameters that the model was trained with.

model_name="Model_single-stage_""$modelId""_h100_lr0.05.pcl"

python punctuator.py -m Model_single-stage_mod_word-pause-pos-mf0_h100_lr0.05.pcl -d $test_samples -o $out_predictions

Scoring the testing output:

Predictions are compared with groundtruth data using error_calculator.py. It either takes two files to compare or two directories containing groundtruth/prediction files. Use -r for reducing punctuation marks.

python error_calculator.py -g $groundtruthData -p $out_predictions -r

Citing

More details can be found in the publication: https://link.springer.com/chapter/10.1007/978-3-319-68456-7_11

This work can be cited as:

@inproceedings{punkProse,
	author = {Alp Oktem and Mireia Farrus and Leo Wanner},
	title = {Attentional Parallel RNNs for Generating Punctuation in Transcribed Speech},
	booktitle = {5th International Conference on Statistical Language and Speech Processing SLSP 2017},
	year = {2017},
	address = {Le Mans, France}
}

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
audio-samples		audio-samples
LICENSE		LICENSE
README.md		README.md
error_calculator.py		error_calculator.py
main.py		main.py
models.py		models.py
parameters.yaml		parameters.yaml
punctuator.py		punctuator.py
run.sh		run.sh
utilities.py		utilities.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

punkProse

How does it perform?

Example Run

Training

Testing

Scoring the testing output:

Citing

About

Releases

Packages

Languages

License

alpoktem/punkProse

Folders and files

Latest commit

History

Repository files navigation

punkProse

How does it perform?

Example Run

Training

Testing

Scoring the testing output:

Citing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages