Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing file 'get_all_titles_from_spoken_wikipedia.py' #9

Open
thomaschhh opened this issue Nov 13, 2023 · 5 comments
Open

Missing file 'get_all_titles_from_spoken_wikipedia.py' #9

thomaschhh opened this issue Nov 13, 2023 · 5 comments

Comments

@thomaschhh
Copy link
Contributor

I am currently looking into the building of the training dataset but it seems like the referenced file is nowhere to be found:

## To generate this file use ${NEMO_PATH}/examples/nlp/spellchecking_asr_customization/evaluation/get_all_titles_from_spoken_wikipedia.py --input_folder en/en/english --output_file spoken_wiki_titles.txt

./build_training_data.sh: 25: /nemo_compatible/scripts/nlp/en_spellmapper/dataset_preparation/NeMo/examples/nlp/spellchecking_asr_customization/evaluation/get_all_titles_from_spoken_wikipedia.py: not found

@bene-ges
Copy link
Owner

it's in /nemo_compatible/scripts/nlp/en_spellmapper/evaluation/get_all_titles_from_spoken_wikipedia.py
fixed the comment

@thomaschhh
Copy link
Contributor Author

That's working, thanks.

I am wondering though where the input_folder is supposed to be.

## To generate this file use ${NEMO_COMPATIBLE_PATH}/scripts/nlp/en_spellmapper/evaluation/get_all_titles_from_spoken_wikipedia.py --input_folder en/en/english --output_file spoken_wiki_titles.txt

@bene-ges
Copy link
Owner

oh, it's the spoken wikipedia folder
it should appear after downloading and unzipping spoken_wikipedia, see this
code

@thomaschhh
Copy link
Contributor Author

Looks like the dataset is no longer available:

WARNING: cannot verify corpora.uni-hamburg.de's certificate, issued by ‘CN=GEANT OV RSA CA 4,O=GEANT Vereniging,C=NL’:
  Issued certificate has expired.
HTTP request sent, awaiting response... 500 Service unavailable (with message)
2023-11-14 11:24:59 ERROR 500: Service unavailable (with message).

@bene-ges
Copy link
Owner

Ok, I put spoken_wiki_titles.txt to the repo, should be sufficient for training
if you later need the full spoken_wikipedia for evaluation, and it still is unavailable, tell me, I will upload my copy to huggingface

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants