Missing file 'get_all_titles_from_spoken_wikipedia.py' #9

thomaschhh · 2023-11-13T10:48:53Z

I am currently looking into the building of the training dataset but it seems like the referenced file is nowhere to be found:

nemo_compatible/scripts/nlp/en_spellmapper/dataset_preparation/build_training_data.sh

Line 23 in 27bce6d

    
           ## To generate this file use ${NEMO_PATH}/examples/nlp/spellchecking_asr_customization/evaluation/get_all_titles_from_spoken_wikipedia.py --input_folder en/en/english --output_file spoken_wiki_titles.txt

./build_training_data.sh: 25: /nemo_compatible/scripts/nlp/en_spellmapper/dataset_preparation/NeMo/examples/nlp/spellchecking_asr_customization/evaluation/get_all_titles_from_spoken_wikipedia.py: not found

The text was updated successfully, but these errors were encountered:

bene-ges · 2023-11-13T11:07:53Z

it's in /nemo_compatible/scripts/nlp/en_spellmapper/evaluation/get_all_titles_from_spoken_wikipedia.py
fixed the comment

thomaschhh · 2023-11-14T08:23:52Z

That's working, thanks.

I am wondering though where the input_folder is supposed to be.

nemo_compatible/scripts/nlp/en_spellmapper/dataset_preparation/build_training_data.sh

Line 23 in 45fdcea

    
           ## To generate this file use ${NEMO_COMPATIBLE_PATH}/scripts/nlp/en_spellmapper/evaluation/get_all_titles_from_spoken_wikipedia.py --input_folder en/en/english --output_file spoken_wiki_titles.txt

bene-ges · 2023-11-14T10:14:30Z

oh, it's the spoken wikipedia folder
it should appear after downloading and unzipping spoken_wikipedia, see this
code

thomaschhh · 2023-11-14T10:26:39Z

Looks like the dataset is no longer available:

WARNING: cannot verify corpora.uni-hamburg.de's certificate, issued by ‘CN=GEANT OV RSA CA 4,O=GEANT Vereniging,C=NL’:
  Issued certificate has expired.
HTTP request sent, awaiting response... 500 Service unavailable (with message)
2023-11-14 11:24:59 ERROR 500: Service unavailable (with message).

bene-ges · 2023-11-14T10:59:25Z

Ok, I put spoken_wiki_titles.txt to the repo, should be sufficient for training
if you later need the full spoken_wikipedia for evaluation, and it still is unavailable, tell me, I will upload my copy to huggingface

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing file 'get_all_titles_from_spoken_wikipedia.py' #9

Missing file 'get_all_titles_from_spoken_wikipedia.py' #9

thomaschhh commented Nov 13, 2023

bene-ges commented Nov 13, 2023

thomaschhh commented Nov 14, 2023

bene-ges commented Nov 14, 2023

thomaschhh commented Nov 14, 2023

bene-ges commented Nov 14, 2023

Missing file 'get_all_titles_from_spoken_wikipedia.py' #9

Missing file 'get_all_titles_from_spoken_wikipedia.py' #9

Comments

thomaschhh commented Nov 13, 2023

bene-ges commented Nov 13, 2023

thomaschhh commented Nov 14, 2023

bene-ges commented Nov 14, 2023

thomaschhh commented Nov 14, 2023

bene-ges commented Nov 14, 2023