This repository provides a PyTorch implementation of the paper Phoneme Boundary Detection using Learnable Segmental Features.
Phoneme Boundary Detection using Learnable Segmental Features
Felix Kreuk, Yaniv Sheena, Joseph Keshet, Yossi Adi
45th International Conference on Acoustics, Speech, and Signal Processing ICASSP 2020
Phoneme boundary detection plays an essential first step for a variety of speech processing applications such as speaker diarization, speech science, keyword spotting, etc. In this work, we propose a neural architecture coupled with a parameterized structured loss function to learn segmental representations for the task of phoneme boundary detection. First, we evaluated our model when the spoken phonemes were not given as input. Results on the TIMIT and Buckeye corpora suggest that the proposed model is superior to the baseline models and reaches state-of-the-art performance in terms of F1 and R-value. We further explore the use of phonetic transcription as additional supervision and show this yields minor improvements in performance but substantially better convergence rates. We additionally evaluate the model on a Hebrew corpus and demonstrate such phonetic supervision can be beneficial in a multi-lingual setting.
If you find this implementation useful, please consider citing our work:
@inproceedings{kreuk2020phoneme,
title={Phoneme boundary detection using learnable segmental features},
author={Kreuk, Felix and Sheena, Yaniv and Keshet, Joseph and Adi, Yossi},
booktitle={ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={8089--8093},
year={2020},
organization={IEEE}
}
loguru==0.4.1
boltons==20.0.0
pandas==1.0.0
pytorch-lightning==0.6.0
SoundFile==0.10.3.post1
test-tube==0.7.5
torch==1.4.0
torchaudio==0.4.0
torchvision==0.4.2
tqdm==4.42.1
git clone https://github.com/felixkreuk/SegFeat.git
cd SegFeat
The dataloader in dataloader.py
assumes the dataset is structured as follows:
timit_directory
│
└───val
│ │ X.wav
│ └─ X.phn
│
└───test
│ │ Y.wav
│ └─ Y.phn
│
└───train
│ Z.wav
└─ Z.phn
Where X.wav
is a raw waveform signal, and X.phn
is its' corresponding phoneme boundaries labeld with the following format:
0 9640 h#
9640 11240 sh
11240 12783 iy
12783 14078 hv
14078 16157 ae
16157 16880 dcl
...
Where the two numbers each line represent the onset of offset of the phoneme (in samples), and the last element represents the phoneme identity.
python main.py --wav_path /path/to/timit/dataset --dataset timit --delta_feats --dist_feats
- If
--ckpt /path/to/model.ckpt
is present, then the training will resume from the given checkpoint. - Testing will begin when training finishes (max
epochs
is reached or when training is stopped via early-stopping). - For more details regarding possible run arguments, please see
python main.py --help
:
usage: main.py [-h] [--wav_path WAV_PATH] [--dataset {timit,buckeye}]
[--run_dir RUN_DIR] [--exp_name EXP_NAME]
[--load_ckpt LOAD_CKPT] [--gpus GPUS] [--devrun]
[--devrun_size DEVRUN_SIZE] [--lr LR] [--optimizer OPTIMIZER]
[--momentum MOMENTUM] [--epochs EPOCHS] [--batch_size N]
[--dropout DROPOUT] [--seed SEED] [--patience PATIENCE]
[--gamma GAMMA] [--overfit OVERFIT]
[--val_percent_check VAL_PERCENT_CHECK]
[--val_check_interval VAL_CHECK_INTERVAL]
[--val_ratio VAL_RATIO] [--rnn_input_size RNN_INPUT_SIZE]
[--rnn_hidden_size RNN_HIDDEN_SIZE] [--rnn_dropout RNN_DROPOUT]
[--birnn] [--rnn_layers RNN_LAYERS]
[--min_seg_size MIN_SEG_SIZE] [--max_seg_size MAX_SEG_SIZE]
[--max_len MAX_LEN] [--feats {mfcc,mel,spect}] [--random_trim]
[--delta_feats] [--dist_feats] [--normalize]
[--bin_cls BIN_CLS] [--phn_cls PHN_CLS] [--n_fft N_FFT]
[--hop_length HOP_LENGTH] [--n_mels N_MELS] [--n_mfcc N_MFCC]
segmentation
optional arguments:
-h, --help show this help message and exit
--wav_path WAV_PATH
--dataset {timit,buckeye}
--run_dir RUN_DIR directory for saving run outputs (logs, ckpt, etc.)
--exp_name EXP_NAME experiment name
--load_ckpt LOAD_CKPT
path to a pre-trained model, if provided, training
will resume from that point
--gpus GPUS
--devrun dev run on a dataset of size `devrun_size`
--devrun_size DEVRUN_SIZE
size of dataset for dev run
--lr LR initial learning rate
--optimizer OPTIMIZER
--momentum MOMENTUM momentum
--epochs EPOCHS upper epoch limit
--batch_size N batch size
--dropout DROPOUT dropout probability value
--seed SEED random seed
--patience PATIENCE patience for early stopping
--gamma GAMMA gamma margin
--overfit OVERFIT gamma margin
--val_percent_check VAL_PERCENT_CHECK
how much of the validation set to check
--val_check_interval VAL_CHECK_INTERVAL
validation check every K epochs
--val_ratio VAL_RATIO
precentage of validation from train
--rnn_input_size RNN_INPUT_SIZE
number of inputs
--rnn_hidden_size RNN_HIDDEN_SIZE
RNN hidden layer size
--rnn_dropout RNN_DROPOUT
dropout
--birnn BILSTM, if define will be biLSTM
--rnn_layers RNN_LAYERS
number of lstm layers
--min_seg_size MIN_SEG_SIZE
minimal size of segment, examples with segments
smaller than this will be ignored
--max_seg_size MAX_SEG_SIZE
see `min_seg_size`
--max_len MAX_LEN maximal size of sequences
--feats {mfcc,mel,spect}
type of acoustic features to use
--random_trim if this flag is on seuqences will be randomly trimmed
--delta_feats if this flag is on delta features will be added
--dist_feats if this flag is on the euclidean features will be
added (see paper)
--normalize flag to normalize features
--bin_cls BIN_CLS coefficient of binary classification loss
--phn_cls PHN_CLS coefficient of phoneme classification loss
--n_fft N_FFT n_fft for feature extraction
--hop_length HOP_LENGTH
hop_length for feature extraction
--n_mels N_MELS number of mels
--n_mfcc N_MFCC number of mfccs
To run a test epoch run the following command:
python main.py --wav_path /path/to/timit/ --dataset timit --delta_feats --dist_feats --load_ckpt segmentor.ckpt --test