Phonemizer

The following applies to the phonemizers based on the DiffSinger variance phoneme duration model, including:

DIFFS Located in the General category
DIFFS ZH Mandarin Chinese
DIFFS ZH-YUE Cantonese
DIFFS JA Japanese
DIFFS EN English
DIFFS ES Spanish
DIFFS IT Italian
DIFFS KO Korean
DIFFS PT Portuguese
DIFFS RU Russian

Usage

Users can input lyrics in the following ways:

Input the lyrics directly. By default, the pronunciation of the lyrics in the dictionary is used. If the lyrics are not in the dictionary, the phonemizer will try to interpret the lyrics as a sequence of phonemes. If the lyrics are wrong, the phonemizer will show error.

Lyrics [phoneme sequence]. Where phoneme sequence should be separated by space. The phoneme sequence takes precedence. For example read[r iy d].

[Phoneme sequence]. Separated by space. For example, [r iy d]

File structure

DiffSinger Variance phonemizers and renderers are independent of each other. Theoretically, any UTAU, NNSVS/ENUNU or DiffSinger voicebank developer can make their voicebank support DiffSinger Variance phonemizers. In addition, DiffSinger voicebanks can also use the other phonemizers , which won't affect audio rendering, auto-pitch and the other diffsinger features.

To support DiffSinger Variance phonemizers, you need to create a folder named dsdur inside your voicebank that contains the following files:

dsconfig.yaml
phonemes.txt (list of phonemes)
linguistic.onnx (linguistic encoder model)
dur.onnx (phoneme duration model)
Dictionary files

dsconfig.yaml

phonemes: phonemes.txt      # phoneme list
linguistic: linguistic.onnx # language encoder models
dur: dur.onnx               # phoneme duration model
hop_size: 512
sample_rate: 44100
predict_dur: true           # predict_dur during training

Dictionary file

OpenUtau yaml dictionary includes word to phoneme dictionary and type of each phoneme. The reason for using this dictionary format is that OpenUtau needs information about whether each phoneme is a vowel or a consonant to split syllables in order to support multisyllabic languages.

Use dict-to-opu.py to convert your diffsinger dictionary to openutau yaml dictionary. This converter will guess the type of each phoneme, but the result is only good for monosyllabic languages like Mandarin Chinese and Japanese. For the other languages, the entries part is correct, but you should manually check the type of each phoneme in the symbols part.

The dictionary format is as follows:

# Symbols part: The type of each phoneme. It needs to include all the phonemes supported by the sound source.
# type can be vowel, stop, affricate, aspirate, liquid, nasal, fricative, semivowel.
# OpenUTAU only cares whether a phoneme is vowel or semivowel. In all other cases, the phoneme is treated as a consonant.
symbols:
- symbol: SP
   type: vowel
- symbol: AP
   type: vowel
- symbol: a
   type: vowel
- symbol: h
   type: fricative

# entries part: dictionary from words to phonemes.
- grapheme: SP
   phonemes: [SP]
- grapheme: AP
   phonemes: [AP]
- grapheme: a
   phones: [a]
- grapheme: ha
   phones: [h, a]

Phonemer	Dictionary file name
DIFFS	dsdict.yaml
DIFFS ZH	dsdict-zh.yaml
DIFFS ZH-YEU	dsdict-zh-yue.yaml
DIFFS JA	dsdict-ja.yaml
DIFFS EN	dsdict-en.yaml
DIFFS ES	dsdict-es.yaml
DIFFS IT	dsdict-it.yaml
DIFFS KO	dsdict-ko.yaml
DIFFS PT	dsdict-pt.yaml
DIFFS RU	dsdict-ru.yaml

G2P

Some phonemizers are based on OpenUtau's built-in G2P module. When writing dictionaries for these phonemes, you can define in the replacements section how to convert the phoneme sequences output by OpenUtau G2P into phonemes supported by the voicebank. The format of this section is as follows:

#symbols, entries are the same as above. The symbols part still uses the phoneme set used by the voicebank.

#replacements part: The mapping of the phonemes output by OpenUtau G2P to the phoneme set used by the voicebank.
replacements:
- {from: b, to: B2}
- {from: ch, to: CH2}
- {from: d, to: D2}
- {from: f, to: F2}
- {from: g, to: G2}
- {from: k, to: K2}
- {from: l, to: L2}
- {from: m, to: M2}
- {from: n, to: N2}
- {from: p, to: P2}
- {from: r, to: R2}
- {from: s, to: S2}
- {from: sh, to: SH2}
- {from: t, to: T2}
- {from: v, to: V2}
- {from: w, to: W2}
- {from: y, to: Y2}
- {from: z, to: Z2}
- {from: zh, to: ZH2}
- {from: z, to: Z2}

The symbols part of the dictionary takes precedence over G2P. If a word exists in a dictionary, the pronunciation defined by the dictionary is used instead of the pronunciation given by G2P.

The following are the phoneme sets used by each G2P

DIFFS EN (English)

vowels: aa, ae, ah, ao, aw, ay, eh, er, ey, ih, iy, ow, oy, uh, uw
consonants: b, ch, d, dh, f, g, hh, jh, k, l, m, n, ng, p, r, s, sh, t, th, v, w, y, z, zh

DIFFS ES (Spanish)

vowels: a, e, i, o, u
consonants: b, B, ch, d, D, f, g, G, gn, I, k, l, ll, m, n, p, r, rr, s, t, U, w, x, y, Y, z

DIFFS IT (Italian)

vowels: a, a1, e, e1, EE, i, i1, o, o1, OO, u, u1
consonants: b, d, dz, dZZ, f, g, JJ, k, l, LL, m, n, nf, ng, p, r, rr, s, SS, t, ts, tSS, v, w, y, z

DIFFS PT (Portuguese)

vowels: a, a~, e, e~, E, i, i~, o, o~, O, u, u~
consonants: b, d, dZ, f, g, j, j~, J, k, l, L, m, n, p, r, R, s, S, t, tS, v, w, w~, X, z, Z

DIFFS RU (Russian)

vowels: a, aa, ay, ee, i, ii, ja, je, jo, ju, oo, u, uj, uu, y, yy
consonants: b, bb, c, ch, d, dd, f, ff, g, gg, h, hh, j, k, kk, l, ll, m, mm, n, nn, p, pp, r, rr, s, sch, sh, ss, t, tt, v, vv, z, zh, zz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phonemizer

Usage

File structure

dsconfig.yaml

Dictionary file

G2P

Clone this wiki locally