Papers accepted to ICASSP 2021 in term of voice conversion (VC)
- ICASSP2021_paper_list-VC
- VC
- Zero-shot and low-resource VC
- cross-lingual VC
- Toolkit
- Dataset
- Reading Note
- 1. Fragmentvc: any-to-any voice conversion by end-to-end extracting and fusing fine-grained voice fragments with attention.
- 2. Maskcyclegan-vc: learning non-parallel voice conversion with filling in frames.
- 3. PPG-based singing voice conversion with adversarial representation learning.
- 4. Again-vc: a one-shot voice conversion using activation guidance and adaptive instance normalization.
- 5. Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset.
- 6. Towards natural and controllable cross-lingual voice conversion based on neural tts model and phonetic posteriorgram
- 7. End-to-end lyrics recognition with voice to singing style transfer.
- Maskcyclegan-vc: learning non-parallel voice conversion with filling in frames. (paper,page)
- Non-autoregressive sequence-to-sequence voice conversion.
- Non-parallel many-to-many voice conversion by knowledge transfer from a text-to-speech model.
- Non-parallel many-to-many voice conversion using local linguistic tokens.
- PPG-based singing voice conversion with adversarial representation learning.(paper,demo)
- Again-vc: a one-shot voice conversion using activation guidance and adaptive instance normalization. (paper,code)
- Any-to-One Sequence-to-Sequence Voice Conversion using Self-Supervised Discrete Speech Representations. (paper,Espnet parameter)
- Fragmentvc: any-to-any voice conversion by end-to-end extracting and fusing fine-grained voice fragments with attention. (paper,code)
- End-to-end lyrics recognition with voice to singing style transfer. (paper,demo)
- One-shot voice conversion based on speaker aware module
- Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset. (paper,code)
- Towards low-resource stargan voice conversion using weight adaptive instance normalization. (paper,code)
- Zero-shot voice conversion with adjusted speaker embeddings and simple acoustic features.
- Extending parrotron: an end-to-end, speech conversion and speech recognition model for atypical speech.
- Zero-shot voice conversion with adjusted speaker embeddings and simple acoustic features.
- Multi-task wavernn with an integrated architecture for cross-lingual voice conversion
- Towards natural and controllable cross-lingual voice conversion based on neural tts model and phonetic posteriorgram (paper)
- crank: an open-source software for nonparallel voice conversion based on vector-quantized variational autoencoder (paper,code)
- EVC: multi-speaker and multi-lingual emotional speech. (parallel voice conversion dataset) 包含中英文各10个说话人,共计350条平行语料。每句话的平局时长为2.9s. 情感类别包括:1) happy, 2) sad, 3) neutral, 4) angry, and 5) surprise.
1. Fragmentvc: any-to-any voice conversion by end-to-end extracting and fusing fine-grained voice fragments with attention.
- 概要:通过Wav2Vec获取source speaker说话人无关的语音内容特征,利用cross-attention的形式从target speaker的语音特征中获取说话人信息,并在decode阶段重构具有source speaker内容和target speaker音色的梅尔普。通过两阶段训练的方式,可以在不利用disentangle等策略和平行语料的条件下,仅通过L1损失实现模型的训练。
- code: https://github.com/yistLin/FragmentVC
- 利用在Librispeech上预训练的wav2vec提取source speaker的语义内容特征。
- Target encoder由一维卷积和激活函数ReLU构成。
- Extractor: Transformer with self-attention and cross-attention。三个Extractor和对应的三个Conn1d构成了stack式的连接。
- Smoother: Transformer with only self-attention. Smoother和Extractor中的feed-forward layer由一层一维卷积代替。
- 由于Wav2Vec不可避免的保留了一些源说话人的信息,作者在Extractor 1中去掉了残差连接,认为这样可以尽可能剔除源说话人的信息。
- 训练策略: 第一阶段:利用同一个人的同一句话同时作为target和source的输入,以训练模型从Wav2Vec特征重构mel谱特征的能力。第二阶段:target和source的输入依旧来自于同一个人,不同的是,target的输入是10个语音片段的拼接,而source的输入只是一句话。需要注意的是,在开始阶段,source的输入来自于输入target的十句话中的一句,随着训练的进行,逐渐增大来自这十句以外的概率,并最终使得source的输入是来自target十句的概率为0.
- 概要:在CycleGAN-VC2基础上,借鉴Bert及image inpainting的训练方法,对source speech添加mask, 并训练conversion网络对mask的区域进行fill.
- 文中尝试了几种不同的mask概率,包括:1)固定概率 2)在某一个概率范围内的随机选择. 实验发现,mask在概率[0,50]随机选择时效果最好。
- 文中尝试了几种不同的mask方式,包括:1)连续帧的mask 2)不连续帧的mask 3)对某一个频带范围进行mask 4)离散点的mask. 实验发现连续帧的mask形式,也即示意图中的$m$表现最好。
- 概要: 借助PPG特征获取source singer的文本内容信息。但是由于在SVC中源说话人的韵律、节奏等也是很重要的信息,所以文中有引入了Singer Confusion Module来补充源说话人除了音色以外的信息。
- Singer Confusion Module的训练采用了对抗训练的方式,学习一个singer-indpendent的mel谱特征。为了进一步确保所学特征包含了除音色(singer identity)以外的其他信息,增加了一个Mel-Regressive Representation learning Module。该模块通过将学习到的mel特征和speaker embedding融合并重构原始song.
4. Again-vc: a one-shot voice conversion using activation guidance and adaptive instance normalization.
-
概要:Again-VC在AdaIN-VC的基础上去掉了单独的Speaker Encoder, 转而利用Content Encoder中的instance normalization操作获取均值和方差信息来传递speaker信息。这一speark信息在decode过程中,通过与content embedding进行Adaptive instance normalization (AdaIN) 操作进行speaker信息的传递。
-
关于instance normalization (IN): 对于一个mel谱$Z$, IN操作如下为:
$\operatorname{IN}(Z)=\frac{Z-\mu(Z)}{\sigma(Z)}$ ,其中平均值$\mu$和方差$\sigma$是基于channel wise的。在AdaIN-VC中,作者认为这种时间不变性的参数$\mu$和$\sigma$是可以代表speaker信息的。 -
Adaptive instance nomalization (AdapIN) 可以看作是IN的逆操作,不过这里采用的$\mu$和$\sigma$是目标说话的参数。比如,如果我们希望保留$H$的内容,但将风格迁移到$Z$的风格上,AdapIN的操作可表示为:$\operatorname{AdaIN}(\boldsymbol{H}, \mu(\boldsymbol{Z}), \sigma(\boldsymbol{Z}))=\sigma(\boldsymbol{Z}) \operatorname{IN}(\boldsymbol{H})+\mu(\boldsymbol{Z})$
-
文中有意思的一点是关于激活函数的选择。作者认为,通过在encoder的输出层添加激活函数(如图中左下角的Activations)可以更好的去除content embedding中的源说话人信息。文中对比了几种不同的激活函数,发现sigmoid的效果最好。如下表所示,分别利用content embedding ($C$)和由$\mu$及$\sigma$组成的speaker info
$S$ 训练一个speaker的分类器,理想状态下基于$C$的精度越低越好,而基于$S$的精度应该越高越好:
5. Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset.
- 概要:利用autoenoder的形式对语音的情感信息进行解耦,并借助额外的情感特征提取器获取目标情感信息,从而重构出具有目标情感的语音。这里采用了VAW-GAN的形式。
- Released Data: https://github.com/HLTSingapore/Emotional-Speech-Data
- Code: https://kunzhou9646.github.io/controllable-evc/
- 与VC不同的是,emtional style transfer只transform情感信息,而保留目标说话人的内容和音色(speaker identiy)信息.
- 训练采用了平行语料。
6. Towards natural and controllable cross-lingual voice conversion based on neural tts model and phonetic posteriorgram
- 概要:利用PPGs特征作为桥梁,将cross-lingual voice coversion问题实质上是转换成了source speech - ASR - TTS问题。
- Controllable: 由于该模型是的输入和输出是等长的,所以通过对输入的PPGs进行上采样或着下采样可以起到调节语速的作用。从这个角度上将,作者将之称为Controllable.
- 文中TTS的AM采用了Fastspeech, 所以作者也将文中提出的模型称为FastSpeech-VC. Vocoder 采用了LPCNet.
- 概要: 为了解决端到端歌词转译数据不足的问题,文中提出了一种将自然语音转为歌唱嗓音的一种数据增强的方法(V2S, voice to singing)。具体方法是借助了语音和成系统WORLD, 以歌声的基频f0和自然语音的谱包络和aperiodic参数为输入,生成歌声数据。
- 相比于随机选择歌声和自然语音,作者发现选择基频相近的两组数据可以合成更好的歌声结果。