Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data perepration #177

Open
akbar20gh opened this issue Apr 17, 2018 · 5 comments
Open

data perepration #177

akbar20gh opened this issue Apr 17, 2018 · 5 comments

Comments

@akbar20gh
Copy link

akbar20gh commented Apr 17, 2018

hi
I want to run eesen on my own dataset. I want to use LSTM-CTC network.
what is data preparation need?
what egs I can use and helpfull?

@fmetze
Copy link
Contributor

fmetze commented Apr 18, 2018

it depends. you can start from any of the existing data preparation scripts. look at the local/prepare_data.sh scripts in the individual recipes, and see which one you can most easily to your own data - it's impossible to predict without more information which one will be the easiest to adapt.

@akbar20gh
Copy link
Author

akbar20gh commented Apr 18, 2018

thanks
if I used data and lang used in KALDI, it's work?
what are differences between data preparation of KALDI and EESEN?

and my second question means, after data preparation for running code(train and decode) which egs is useful?

@akbar20gh
Copy link
Author

no idea about data preparation?
all of data preparation of egs are local. not general

@riebling
Copy link
Contributor

You're right, there is no general format that data sets must conform to; for each data format, there are local, non-general preparation steps. Switchboard, WSJ, Tedlium are all different. For your data set, some adaptation of data preparation from existing examples is required, unless it happens to be (or can be processed into being) in the format of one of the examples. However: a generalized example would make a nice addition to Eesen!

Just a few things that may differ between your data and the examples include:

  • Language
  • Dictionary / lexicon
  • Language model
  • (Human) transcribed audio in text format
  • Phone or token set (used by Dictionary / lexicon)
  • Audio format / filenaming convention

As Florian mentioned earlier, without more information (such as above) about your data, we cannot give much additional help. These steps in Tedlium, for example, do data prep:

  # Use the same data preparation script from Kaldi
  local/tedlium_prepare_data.sh --data-dir db/TEDLIUM_release2 || exit 1

  # Construct the phoneme-based lexicon
  local/tedlium_prepare_phn_dict.sh || exit 1;

  # Compile the lexicon and token FSTs
  utils/ctc_compile_dict_token.sh data/local/dict_phn data/local/lang_phn_tmp data/lang_phn || exit 1;

  # Compose the decoding graph
local/tedlium_decode_graph.sh data/lang_phn || exit 1;

In this example, the CMUDict phone set and phonetic pronunciation dictionary are used, as well as the CMUSphinx language model. All included in the TEDLIUM data download.

@akbar20gh
Copy link
Author

thanks all
as I did
first I create directory data/local/dict_char
data/local/dict_char$ tree
.
├── lexicon.txt
├── units_nosil.txt
└── units.txt

my lexicon.txt is words (space) characters
/data/local/dict_char$ less lexicon.txt

<SPOKEN_NOISE> <SPOKEN_NOISE>


]/ ] /
]/, ] / ,
]/. ] / .
]a,ab ] a , a b

units.txt, is units with numbers
/local/dict_char$ less units.txt
0
<SPOKEN_NOISE> 1
2
3
' 4
, 5
. 6
/ 7

and units_nosil.txt, is units that not silent.

then
give lexicon.txt to utils/sym2int.pl to make lexicon_numbers.txt
utils/sym2int.pl -f 2- data/local/dict_char/units.txt < data/local/dict_char/lexicon.txt > data/local/dict_char/lexicon_numbers.txt

then make lang_char by give data/local/dict_char to utils/ctc_compile_dict_token.sh
utils/ctc_compile_dict_token.sh --dict-type "char" --space-char ""
data/local/dict_char data/local/lang_char_tmp data/lang_char

next
create directory data/local/nist_lm and put language model in ARPA form "lm.arpa.gz" in it
and change
local/decode_graph.sh data/lang_char
to make TLG.fst.

i think it's all of data preparation,maybe!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants