Scalability and efficiency are desired in neural speech codecs, which supports a wide range of bitrates for applications on various devices. We propose a collaborative quantization (CQ) scheme to jointly learn the codebook of LPC coefficients and the corresponding residuals. CQ does not simply shoehorn LPC to a neural network, but bridges the computational capacity of advanced neural network models and traditional, yet efficient and domain-specific digital signal processing methods in an integrated manner. We demonstrate that CQ achieves much higher quality than its predecessor at 9 kbps with even lower model complexity. We also show that CQ can scale up to 24 kbps where it outperforms AMR-WB and Opus. As a neural waveform codec, CQ models are with less than 1 million parameters, significantly less than many other generative models.
Please consider citing our papers if this helps.
@inproceedings{zhen2020cq,
author={Kai Zhen and Mi Suk Lee and Jongmo Sung and Seungkwon Beack and Minje Kim},
title={{Efficient And Scalable Neural Residual Waveform Coding with Collaborative Quantization}},
year=2020,
booktitle={Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2020},
doi={10.1109/ICASSP40776.2020.9054347}
url={https://ieeexplore.ieee.org/document/9054347}
}
@inproceedings{Zhen2019,
author={Kai Zhen and Jongmo Sung and Mi Suk Lee and Seungkwon Beack and Minje Kim},
title={{Cascaded Cross-Module Residual Learning Towards Lightweight End-to-End Speech Coding}},
year=2019,
booktitle={Proc. Interspeech 2019},
pages={3396--3400},
doi={10.21437/Interspeech.2019-1816},
url={http://dx.doi.org/10.21437/Interspeech.2019-1816}
}
- Project Page - I: https://saige.sice.indiana.edu/research-projects/neural-audio-coding/
- Project Page - II: http://kaizhen.us/collaborative-quantization
- utilities.py: supporting functions for Hann windowing, waveform segmentation, and objective measure calculation
- lpc_utilities.py: LPC analyzer, synthesizer and related functions implemented in Python
- neural_speech_coding_module.py: model configuration, training and evaluation for one neural codec
- cmrl.py: model training and evaluation with multiple neural codecs
- loss_terms_and_measures: loss functions and others to calculate objective measures such as pesq
- nn_core_operator.py: some fundamental operations such as convolution and quantization
- constants.py: definitions on the frame size, sample rate and other initializations
- main.py: the entry file
The experiment is conducted on TIMIT corpus. https://catalog.ldc.upenn.edu/LDC93S1
python main.py --learning_rate_tanh 0.0002 # the learning rate for the 1st codec
--learning_rate_greedy_followers '0.00002 0.000002' # the learning rate for the added codecs and finetuning
--epoch_tanh 200 # the epoch for the 1st codec
--epoch_greedy_followers '50 50' # the epoch for the added codecs and finetuning
--batch_size 128
--num_resnets 2 # number of neural codecs involved
--training_mode 4 # see main.py for specifications
--base_model_id '1993783' # used for finetuning and evaluation
--from_where_step 2 # used for finetuning and evaluation
--suffix '_greedy_all_' # the suffix of the name of the model to be saved
--bottleneck_kernel_and_dilation '9 9 100 20 1 2' # configuration of the ResNet block
--save_unique_mark 'follower_all' # the name of the model to be saved
--the_strides '2' # the stride value for the down sampling CNN layer
--coeff_term '60 10 10 0' # coefficients for the loss terms
--res_scalar 1.0
--pretrain_step 2 # number of pretrained step with no quantization
--target_entropy 2.2 # target entropy
--num_bins_for_follower '32 32' # number of quantization bins
--is_cq 1 # is collaborative quantization being enabled
python main.py --training_mode 0 # the base_model_id will need to be set correctly, other settings do not need to be changed
Our work is built upon several recent publications on end-to-end speech coding, trainable quantizer and LPCNet.
- [1] Douglas O’Shaughnessy, “Linear predictive coding,” IEEE potentials, vol. 7, no. 1, pp. 29–32, 1988.
- [2] J.-M. Valin and J. Skoglund, “LPCNet: Improving neural speech synthesis through linear prediction,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2019.
- [3] S. Kankanahalli, “End-to-end optimized speech coding with deep neural networks,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018.
- [4] E. Agustsson, F. Mentzer, M. Tschannen, L. Cavigelli, R. Timofte, L. Benini, and L. V. Gool, “Soft-to-hard vector quantization for end-to-end learning compressible representations,” in Advances in Neural Information Processing Systems (NIPS), 2017, pp. 1141–1151.
Some of the code is borrowed from https://github.com/sri-kankanahalli/autoencoder-speech-compression