Skip to content

Voicebank Development

oxygen-dioxide edited this page Oct 8, 2024 · 14 revisions

Precautions

  • The package format for Diffsinger voicebanks and vocoders may change in the future and break the compatibility.
  • All the text files should be encoded in utf-8. If you don't know what it is, please use English letters and digits only.
  • You can only use English letters and digits for file names and folder names.
  • If you edited the voicebank, you don't need to fully reopen OpenUtau to apply the changes you've made. Just refresh singers and reopen the ustx project.
  • If you have run into a bug, feel free to provide feedback. It's suggested to provide a full snapshot of your OpenUtau window, your ustx file and OpenUtau log file.

voicebank packaging

Voicebanks are located at the Singers folder under OpenUtau's directory.

OpenUtau.exe
Singers
└─mysinger     
  ├─dsdur             #Diffsinger timing model folder
  | ├─dur.onnx        #duration model, onnx
  | ├─linguistic.onnx #linguistic encoder model, onnx
  | ├─dsconfig.yaml   
  | ├─dsdict.yaml     #OpenUtau yaml dictionary
  | └─phonemes.txt    #phoneme list, auto generated when exporting your model to onnx
  ├─dspitch           #Diffsinger pitch model folder (optional)
  | ├─pitch.onnx      #pitch model, onnx
  | ├─linguistic.onnx #linguistic encoder model, onnx
  | ├─dsconfig.yaml   
  | ├─dsdict.yaml     #OpenUtau yaml dictionary
  | └─phonemes.txt    #phoneme list       
  ├─dsvariance        #Variance (energy and breathiness) model folder (optional)
  | ├─variance.onnx   #Variance model, onnx
  | ├─linguistic.onnx #linguistic encoder model, onnx
  | ├─dsconfig.yaml   
  | ├─dsdict.yaml     #OpenUtau yaml dictionary
  | └─phonemes.txt    #phoneme list
  ├─dsvocoder         #Vocoder folder
  | ├─vocoder.onnx    #Vocoder model, onnx
  | └─vocoder.yaml
  ├─character.txt     #Basic information of your voicebank
  ├─character.yaml    #Voicebank information for OpenUTAU
  ├─dsconfig.yaml     #Voicebank information for Diffsinger renderer
  ├─phonemes.txt      #phonemes list, auto generated when exporting your model to onnx
  └─acoustic.onnx     #acoustic model in onnx format

character.txt

Only the first line "name" is necessary. All the other lines are optional.

name= Name of your voicebank
image= Voicebank logo (if used, please pack the logo .png or .bmp file into your voicebank)
author= Voicebank author
voice= Voice provider
web= Official website of your voicebank

Example:

name=Zhibin Diffsinger
image=zhibin.png
author=Chisong
voice=Chisong
web=http://zhibin.club/

character.yaml

Don't manually edit this file

text_file_encoding: utf-8
portrait_opacity: 0.67
default_phonemizer: OpenUtau.Core.DiffSinger.DiffSingerPhonemizer
singer_type: diffsinger

dsconfig.yaml

(Note: dsconfig.yaml isn't the config file used in voicebank training. Please follow the format below)

phonemes: phonemes.txt    
acoustic: acoustic.onnx   
vocoder: nsf_hifigan      #Vocoder package name used by your voicebank

# Expressions related configs. You can copy thie part from your training config

# random_pitch_shifting and use_key_shift_embed are related to gender expression. See https://github.com/openvpi/DiffSinger/releases/tag/v1.6.0
# This part is needed only when your voicebank is exported with --expose_gender . 

# random_time_stretching and use_speed_embed are related to velocity expression. See https://github.com/openvpi/DiffSinger/releases/tag/v1.7.0
# This part is needed only when your voicebank is exported with --expose_velocity . 
augmentation_args:
  random_pitch_shifting:
    range: [-5., 5.]
    scale: 1.5
  random_time_stretching:
    domain: log
    range: [0.5, 2.0]
    scale: 1.5
use_key_shift_embed: true
use_speed_embed: true

use_energy_embed: true       # Whether your voicebank supports energy expression
use_breathiness_embed: true  # Whether your voicebank supports breathiness expression
use_voicing_embed: true      # Whether your voicebank supports voicing expression
use_tension_embed: true      # Whether your voicebank supports tension expression

# These 2 lines are needed if your voicebank uses shallow diffusion
use_shallow_diffusion: true
max_depth: 1000 # K_step when training the voicebank

duration model

Create a subfolder named "dsdur" under your voicebank folder containing these files:

  • linguistic.onnx, dur.onnx, phonemes.txt are exported from python code. Note that you can't mix linguistic.onnx and dur.onnx from different voicebanks. You can only copy the whole dsdur folder.

dsdict.yaml

dsdict.yaml is openutau .yaml dictionary.

You can use dict-to-opu.py to convert your diffsinger dictionary to openutau yaml dictionary. Note that the phoneme type guessing doesn't apply to multi-syllable languages like English. Please correct the types for each phoneme manually. usage: python dict-to-opu.py <input> <output>

# symbols part: type of each phoneme.
# type can be vowel, stop, affricate, aspirate, liquid, nasal, fricative and semivowel, but OpenUTAU only cares whether a phoneme is a vowel or not.
symbols:
- symbol: SP
  type: vowel
- symbol: AP
  type: vowel
- symbol: a
  type: vowel
- symbol: h
  type: fricative

# entries: grapheme to phonemes dictionary
- grapheme: SP
  phonemes: [SP]
- grapheme: AP
  phonemes: [AP]
- grapheme: a
  phonemes: [a]
- grapheme: ha
  phonemes: [h, a]

dsconfig.yaml

phonemes: phonemes.txt      #phoneme list
linguistic: linguistic.onnx #linguistic model
dur: dur.onnx               #duration model
hop_size: 512
sample_rate: 44100
predict_dur: true

pitch model (optional)

Create a subfolder named "dspitch" under your voicebank folder containing these files:

  • linguistic.onnx, pitch.onnx, phonemes.txt are exported from python code. Note that you can't mix linguistic.onnx and pitch.onnx from different voicebanks. You can only copy the whole dspitch folder.
  • use the same dsdict.yaml for pitch model and duration model

dsconfig.yaml

phonemes: phonemes.txt      #phoneme list
linguistic: linguistic.onnx #linguistic model
pitch: pitch.onnx           #pitch model
hop_size: 512
sample_rate: 44100
predict_dur: true
use_expr: true              #Include this line if your pitch model supports pitch expressiveness (PEXP)

variance model

If your voicebank support energy (ENE), breathiness (BREC), tension (TENC) or voicing (VOIC) expression, you have to include a variance model in your voicebank.

Create a subfolder named "dsvariance" under your voicebank folder containing these files:

  • linguistic.onnx, variance.onnx, phonemes.txt are exported from python code. Note that you can't mix linguistic.onnx and variance.onnx from different voicebanks. You can only copy the whole dspitch folder.
  • use the same dsdict.yaml

dsconfig.yaml

linguistic: linguistic.onnx
variance: variance.onnx
phonemes: phonemes.txt
hop_size: 512
sample_rate: 44100
predict_dur: true
# Which parameters does your variance model support
predict_energy: true
predict_breathiness: true
predict_voicing: true
predict_tension: true

Custom vocoder

If you finetuned your vocoder, you can ship it with your voicebank.

When rendering, if your voicebank have a dsvocoder subfolder inside it, OpenUtau will load vocoder from this subfolder. Otherwise OpenUtau will load vocoder installed in OpenUtau according to the vocoder's name declared in dsconfig.yaml

Create a subfolder named "dsvocoder" under your voicebank folder containing these files:

  • vocoder.onnx is exported from python code.

vocoder.yaml

model: "vocoder.onnx" #onnx model file name
num_mel_bins: 128
hop_size: 512
sample_rate: 44100

Export DiffSinger Script (.ds)

DiffSinger script (.ds) is the input file for DiffSinger command line inference. If you are a voicebank developer, you can preview your voicebank during training with DiffSinger Script.

Notice: DiffSinger script only contains phoneme and parameters. It can't be edited or converted to singing synthesis project formats. Ds files may be incompatible across diffsinger versions or training config. If you want to share your work, please share .ustx file instead.

How to export diffsinger script

  • Choose the appropriate phonemizer and ensure the project plays correctly in OpenUtau for DiffSinger. If you need to export the "gender" expression, please select a voicebank that supports gender and is trained with the same config with your voicebank.
  • Click "File → Export Project → Export DiffSinger Script"