data perepration #177

akbar20gh · 2018-04-17T16:28:17Z

hi
I want to run eesen on my own dataset. I want to use LSTM-CTC network.
what is data preparation need?
what egs I can use and helpfull?

fmetze · 2018-04-18T15:58:30Z

it depends. you can start from any of the existing data preparation scripts. look at the local/prepare_data.sh scripts in the individual recipes, and see which one you can most easily to your own data - it's impossible to predict without more information which one will be the easiest to adapt.

akbar20gh · 2018-04-18T22:03:15Z

thanks
if I used data and lang used in KALDI, it's work?
what are differences between data preparation of KALDI and EESEN?

and my second question means, after data preparation for running code(train and decode) which egs is useful?

akbar20gh · 2018-05-12T13:38:45Z

no idea about data preparation?
all of data preparation of egs are local. not general

riebling · 2018-05-14T14:44:14Z

You're right, there is no general format that data sets must conform to; for each data format, there are local, non-general preparation steps. Switchboard, WSJ, Tedlium are all different. For your data set, some adaptation of data preparation from existing examples is required, unless it happens to be (or can be processed into being) in the format of one of the examples. However: a generalized example would make a nice addition to Eesen!

Just a few things that may differ between your data and the examples include:

Language
Dictionary / lexicon
Language model
(Human) transcribed audio in text format
Phone or token set (used by Dictionary / lexicon)
Audio format / filenaming convention

As Florian mentioned earlier, without more information (such as above) about your data, we cannot give much additional help. These steps in Tedlium, for example, do data prep:

  # Use the same data preparation script from Kaldi
  local/tedlium_prepare_data.sh --data-dir db/TEDLIUM_release2 || exit 1

  # Construct the phoneme-based lexicon
  local/tedlium_prepare_phn_dict.sh || exit 1;

  # Compile the lexicon and token FSTs
  utils/ctc_compile_dict_token.sh data/local/dict_phn data/local/lang_phn_tmp data/lang_phn || exit 1;

  # Compose the decoding graph
local/tedlium_decode_graph.sh data/lang_phn || exit 1;

In this example, the CMUDict phone set and phonetic pronunciation dictionary are used, as well as the CMUSphinx language model. All included in the TEDLIUM data download.

akbar20gh · 2018-05-14T19:14:27Z

thanks all
as I did
first I create directory data/local/dict_char
data/local/dict_char$ tree
.
├── lexicon.txt
├── units_nosil.txt
└── units.txt

my lexicon.txt is words (space) characters
/data/local/dict_char$ less lexicon.txt

<SPOKEN_NOISE> <SPOKEN_NOISE>

]/ ] /
]/, ] / ,
]/. ] / .
]a,ab ] a , a b

units.txt, is units with numbers
/local/dict_char$ less units.txt
0
<SPOKEN_NOISE> 1
2
3
' 4
, 5
. 6
/ 7

and units_nosil.txt, is units that not silent.

then
give lexicon.txt to utils/sym2int.pl to make lexicon_numbers.txt
utils/sym2int.pl -f 2- data/local/dict_char/units.txt < data/local/dict_char/lexicon.txt > data/local/dict_char/lexicon_numbers.txt

then make lang_char by give data/local/dict_char to utils/ctc_compile_dict_token.sh
utils/ctc_compile_dict_token.sh --dict-type "char" --space-char ""
data/local/dict_char data/local/lang_char_tmp data/lang_char

next
create directory data/local/nist_lm and put language model in ARPA form "lm.arpa.gz" in it
and change
local/decode_graph.sh data/lang_char
to make TLG.fst.

i think it's all of data preparation,maybe!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data perepration #177

data perepration #177

akbar20gh commented Apr 17, 2018 •

edited

Loading

fmetze commented Apr 18, 2018

akbar20gh commented Apr 18, 2018 •

edited

Loading

akbar20gh commented May 12, 2018

riebling commented May 14, 2018

akbar20gh commented May 14, 2018

data perepration #177

data perepration #177

Comments

akbar20gh commented Apr 17, 2018 • edited Loading

fmetze commented Apr 18, 2018

akbar20gh commented Apr 18, 2018 • edited Loading

akbar20gh commented May 12, 2018

riebling commented May 14, 2018

akbar20gh commented May 14, 2018

akbar20gh commented Apr 17, 2018 •

edited

Loading

akbar20gh commented Apr 18, 2018 •

edited

Loading