Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running create_text_dataset.py gets Killed and takes too long #88

Open
vishwa27yvs opened this issue Apr 7, 2024 · 3 comments
Open
Assignees
Labels
in progress We are actively working on this issue. inference This issue is related to running inference

Comments

@vishwa27yvs
Copy link

I am trying to generate textual prompts where all the files from the repository are included in the prompt, From the code, I understand I can do so using the following command

python create_text_dataset.py --dataset_name_or_path princeton-nlp/SWE-bench --output_dir ./base_datasets --prompt_style style-2 --file_source all

However, the expected time for the code to run only for the test set is about 9 hours, adding code output below

2024-04-07 12:11:37,130 WARNING Disabling caching
2024-04-07 12:11:39,756 INFO Found {'train', 'dev', 'test'} splits
Adding text inputs:   1%|▊                                                        | 33/2294 [08:06<9:04:44, 14.46s/it]

Is this expected or am I doing something wrong

@vishwa27yvs
Copy link
Author

vishwa27yvs commented Apr 8, 2024

Update: I tried running the same command twice, and both times the process gets killed at 2177/2294 instances, output below.

2024-04-07 12:11:37,130 WARNING Disabling caching
2024-04-07 12:11:39,756 INFO Found {'train', 'dev', 'test'} splits
Adding text inputs:  95%|████████████████████████████████████████████████████▏  | 2176/2294 [4:33:25<27:47, 14.13s/it]create_all_files_benchmark.sh: line 2: 2659349 Killed                  python create_text_dataset.py --dataset_name_or_path princeton-nlp/SWE-bench --output_dir ./base_datasets --prompt_style style-2 --file_source all

I tried it on 2 different machines and output is exactly same (code gets killed at the same instance), so I am not sure if this issue is pertaining to memory or something else. Would be great to know some way to resolve this

@vishwa27yvs vishwa27yvs changed the title Running create_text_dataset.py takes too long Running create_text_dataset.py gets Killed and takes too long Apr 8, 2024
@VecherVhatuX
Copy link

Hi @vishwa27yvs ,
I have provided an explanation regarding the extended duration of the operation in question. You can find this information at the following location: #58 (comment)
It is necessary to rewrite the full pipeline to speed it up

@john-b-yang john-b-yang added the inference This issue is related to running inference label Apr 15, 2024
@john-b-yang john-b-yang added the in progress We are actively working on this issue. label Jun 17, 2024
@john-b-yang
Copy link
Member

Tagging @carlosejimenez here to address this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in progress We are actively working on this issue. inference This issue is related to running inference
Projects
None yet
Development

No branches or pull requests

4 participants