Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

about preprocess_raw_data.py #9

Open
lijiashan2020 opened this issue Apr 12, 2022 · 5 comments
Open

about preprocess_raw_data.py #9

lijiashan2020 opened this issue Apr 12, 2022 · 5 comments

Comments

@lijiashan2020
Copy link

When I run the command as follows:

python preprocess_raw_data.py -n_jobs 60 -data dips -graph_nodes residues -graph_cutoff 30 -graph_max_neighbor 10 -graph_residue_loc_is_alphaC -pocket_cutoff 8 -data_fraction 1.0

it can generate six files in the directory /extendplus/jiashan/equidock_public/src/cache/dips_residues_maxneighbor_10_cutoff_30.0_pocketCut_8.0/cv_0
with files

label_test.pkl  ligand_graph_test.bin  receptor_graph_test.bin
label_val.pkl   ligand_graph_val.bin   receptor_graph_val.bin

However, three more files could not be generated successfully, and report errors as follows:

Processing  ./cache/dips_residues_maxneighbor_10_cutoff_30.0_pocketCut_8.0/cv_0/label_frac_1.0_train.pkl
Num of pairs in  train  =  39901
Killed

Could you help me solve this problem?
Thanks!

@octavian-ganea
Copy link
Owner

Generating the full DIPS training data takes a lot of time and you have to check if you have enough resources for it. Can you try generating just a fraction of it first, e.g., -data_fraction 0.1 ?

@lijiashan2020
Copy link
Author

Thank you for your reply! I can successfully run the command by modifying parameters! Thank you very much for help!

@lizhenping
Copy link

Thank you for your reply! I can successfully run the command by modifying parameters! Thank you very much for help!

I run it with 160GB RAM for five hours, still failed get the same error. that's really nedd a huge resources.
mark it hope usefull for others

@lizhenping
Copy link

marke it , i used 25 cpu 400GB RAm processed for 15 hours.

@Octopus125
Copy link

I had the same problem. The main reason for this is insufficient memory. The pre-processing of the training data of DIPS dataset did require a large amount of memory, which I could not complete this at one time with a server with 256G memory.

One way is to batch. /DIPS/data/DIPS/interim/pairs-pruned/pairs-postprocessed-train.txt stores all the PDB files waiting to be pre-processed. So, you can divide the txt file into several parts and preprocessing respectively. After this you just need to merge the generated files together. I divided the training data into two parts and finished the pre-processing successfully with 256G memory server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants