Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how can I create the input features for my own dataset by running the py. file? Is it possible to simply the input to only sequence of the protein? #9

Open
Zkkkkkui opened this issue Sep 30, 2022 · 10 comments

Comments

@Zkkkkkui
Copy link

No description provided.

@FreshAirTonight
Copy link
Owner

Yes. The input feature generation is similar to AF2. A sample shell script is here for individual sequences.

@Zkkkkkui
Copy link
Author

Zkkkkkui commented Oct 1, 2022

Yes. The input feature generation is similar to AF2. A sample shell script is here for individual sequences.

What about this .py file: run_af2c_fea.py, which is also said to be used to get features?

@FreshAirTonight
Copy link
Owner

Yes. The input feature generation is similar to AF2. A sample shell script is here for individual sequences.

What about this .py file: run_af2c_fea.py, which is also said to be used to get features?

The shell script calls this python script you referred to do the job.

@Zkkkkkui
Copy link
Author

Zkkkkkui commented Oct 3, 2022

Yes. The input feature generation is similar to AF2. A sample shell script is here for individual sequences.

What about this .py file: run_af2c_fea.py, which is also said to be used to get features?

The shell script calls this python script you referred to do the job.

Could you please give an example of the script on colab just to generate the features of a protein sequence from uniprot?

@Zkkkkkui
Copy link
Author

Hi! I still have problem with getting the features of my dataset. I am not able to use AF2complex locally and I am not sure how to run these sh. script. I tried to get a feature.pkl file from Alphafold output after doing the protein prediction but when I used it in this colab to predict compex, it always went wrong like this:
KeyError: 'msa'

CalledProcessError Traceback (most recent call last)
in
39
40 # with io.capture_output() as captured:
---> 41 get_ipython().run_line_magic('shell', 'python -u ../run_af2c_mod.py {pred_params}')
42 print(f'DONE! (predictions available on {FLAGS.output_dir}' )
could you explain and help me with that? Thank you!

@FreshAirTonight
Copy link
Owner

The Colab notebook we provided only takes features.pkl files of individual monomers. If you use other AlphaFold notebook to generate input features, make sure that you use the monomer, not multimer, pipeline to generate the input. And tar these pickle files into one single tar ball.

In this example, you have a heterodimer HgcAB composed of two monomers, HgcA and HgcB. Organizes the feature input as the following:

hgc.tar
├── hgc
│   ├── HgcA
│   │   └── features.pkl.gz
│   └── HgcB
│       └── features.pkl.gz

Then tar this folder into a single tarball and upload it to our notebook. Note that our code can take gzipped pickle files directly. It is up to you whether or not to gzip the pickle files before you make the tarball.

After you upload the attached tarball, you may run a test to predict a heterodimer using the target syntax: HgcA/HgcB 433 HgcAB

@Zkkkkkui
Copy link
Author

Thank you for the instruction! I have successfully got the features from other AF notebook(it used 0 sequence template and I am not sure whether it would matter compared to your examples) and did prediction on some protein complexes. However it seems to have a very high false negative rate(the proteins were supposed to be interacting but the output was not), is there any way to improve that?

@FreshAirTonight
Copy link
Owner

Thank you for the instruction! I have successfully got the features from other AF notebook(it used 0 sequence template and I am not sure whether it would matter compared to your examples) and did prediction on some protein complexes. However it seems to have a very high false negative rate(the proteins were supposed to be interacting but the output was not), is there any way to improve that?

Many things to try, such as:

  1. Different DL models if you haven't tried them all, including the monomer DL models
  2. Longer recycles, between 8 to 20
  3. Add structural templates if possible

@fereidoon27
Copy link

@FreshAirTonight
In AlphaFold, MSAs are built using jackhmmer and HHblits. To avoid the extensive data downloads and CPU processing, precomputed MSAs and the feature.pkl file can be used instead.

Due to my limited resources, I'm focusing on the GPU-based second step and considering tools like ColabFold to create the feature.pkl files.

What files and steps are needed to create the feature.pkl file? Is there a tool available on Google Colab or Kaggle for this?

@FreshAirTonight
Copy link
Owner

@fereidoon27 An example of feature generation script run_fea_gen.sh can be found under the example folder. If you have limited resources, consider using the uniprot mode, under which MSA construction uses only the UniProt library (creating dummy files for other sequence library to get around file checking). With this option, you can generate features for hundreds of protein sequences of moderate lengths with a decent workstation (e.g., with 4TB nvme, 24 cores).

You may use precomputed MSAs as well. Under the intended output folder of a protein, create a subfolder named msas and place the MSA files under that folder. The only MSAs required in the uniprot mode is uniprot_hits.sto. Add pdb_hits.sto for templates if you need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants