-
Notifications
You must be signed in to change notification settings - Fork 343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
data perepration #177
Comments
it depends. you can start from any of the existing data preparation scripts. look at the local/prepare_data.sh scripts in the individual recipes, and see which one you can most easily to your own data - it's impossible to predict without more information which one will be the easiest to adapt. |
thanks and my second question means, after data preparation for running code(train and decode) which egs is useful? |
no idea about data preparation? |
You're right, there is no general format that data sets must conform to; for each data format, there are local, non-general preparation steps. Switchboard, WSJ, Tedlium are all different. For your data set, some adaptation of data preparation from existing examples is required, unless it happens to be (or can be processed into being) in the format of one of the examples. However: a generalized example would make a nice addition to Eesen! Just a few things that may differ between your data and the examples include:
As Florian mentioned earlier, without more information (such as above) about your data, we cannot give much additional help. These steps in Tedlium, for example, do data prep:
In this example, the CMUDict phone set and phonetic pronunciation dictionary are used, as well as the CMUSphinx language model. All included in the TEDLIUM data download. |
thanks all my lexicon.txt is words (space) characters units.txt, is units with numbers and units_nosil.txt, is units that not silent. then then make lang_char by give data/local/dict_char to utils/ctc_compile_dict_token.sh next i think it's all of data preparation,maybe! |
hi
I want to run eesen on my own dataset. I want to use LSTM-CTC network.
what is data preparation need?
what egs I can use and helpfull?
The text was updated successfully, but these errors were encountered: