Running create_text_dataset.py gets Killed and takes too long #88

vishwa27yvs · 2024-04-07T16:20:37Z

I am trying to generate textual prompts where all the files from the repository are included in the prompt, From the code, I understand I can do so using the following command

python create_text_dataset.py --dataset_name_or_path princeton-nlp/SWE-bench --output_dir ./base_datasets --prompt_style style-2 --file_source all

However, the expected time for the code to run only for the test set is about 9 hours, adding code output below

2024-04-07 12:11:37,130 WARNING Disabling caching
2024-04-07 12:11:39,756 INFO Found {'train', 'dev', 'test'} splits
Adding text inputs:   1%|▊                                                        | 33/2294 [08:06<9:04:44, 14.46s/it]

Is this expected or am I doing something wrong

The text was updated successfully, but these errors were encountered:

vishwa27yvs · 2024-04-08T00:36:51Z

Update: I tried running the same command twice, and both times the process gets killed at 2177/2294 instances, output below.

2024-04-07 12:11:37,130 WARNING Disabling caching
2024-04-07 12:11:39,756 INFO Found {'train', 'dev', 'test'} splits
Adding text inputs:  95%|████████████████████████████████████████████████████▏  | 2176/2294 [4:33:25<27:47, 14.13s/it]create_all_files_benchmark.sh: line 2: 2659349 Killed                  python create_text_dataset.py --dataset_name_or_path princeton-nlp/SWE-bench --output_dir ./base_datasets --prompt_style style-2 --file_source all

I tried it on 2 different machines and output is exactly same (code gets killed at the same instance), so I am not sure if this issue is pertaining to memory or something else. Would be great to know some way to resolve this

VecherVhatuX · 2024-04-09T08:20:44Z

Hi @vishwa27yvs ,
I have provided an explanation regarding the extended duration of the operation in question. You can find this information at the following location: #58 (comment)
It is necessary to rewrite the full pipeline to speed it up

john-b-yang · 2024-06-17T18:09:40Z

Tagging @carlosejimenez here to address this.

vishwa27yvs changed the title ~~Running create_text_dataset.py takes too long~~ Running create_text_dataset.py gets Killed and takes too long Apr 8, 2024

john-b-yang added the inference This issue is related to running inference label Apr 15, 2024

john-b-yang assigned carlosejimenez Jun 17, 2024

john-b-yang added the in progress We are actively working on this issue. label Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running create_text_dataset.py gets Killed and takes too long #88

Running create_text_dataset.py gets Killed and takes too long #88

vishwa27yvs commented Apr 7, 2024

vishwa27yvs commented Apr 8, 2024 •

edited

Loading

VecherVhatuX commented Apr 9, 2024

john-b-yang commented Jun 17, 2024

Running create_text_dataset.py gets Killed and takes too long #88

Running create_text_dataset.py gets Killed and takes too long #88

Comments

vishwa27yvs commented Apr 7, 2024

vishwa27yvs commented Apr 8, 2024 • edited Loading

VecherVhatuX commented Apr 9, 2024

john-b-yang commented Jun 17, 2024

vishwa27yvs commented Apr 8, 2024 •

edited

Loading