how can I create the input features for my own dataset by running the py. file? Is it possible to simply the input to only sequence of the protein? #9

Zkkkkkui · 2022-09-30T10:39:27Z

No description provided.

FreshAirTonight · 2022-09-30T13:59:50Z

Yes. The input feature generation is similar to AF2. A sample shell script is here for individual sequences.

Zkkkkkui · 2022-10-01T11:02:43Z

Yes. The input feature generation is similar to AF2. A sample shell script is here for individual sequences.

What about this .py file: run_af2c_fea.py, which is also said to be used to get features?

FreshAirTonight · 2022-10-01T20:16:02Z

Yes. The input feature generation is similar to AF2. A sample shell script is here for individual sequences.

What about this .py file: run_af2c_fea.py, which is also said to be used to get features?

The shell script calls this python script you referred to do the job.

Zkkkkkui · 2022-10-03T09:51:28Z

Yes. The input feature generation is similar to AF2. A sample shell script is here for individual sequences.

What about this .py file: run_af2c_fea.py, which is also said to be used to get features?

The shell script calls this python script you referred to do the job.

Could you please give an example of the script on colab just to generate the features of a protein sequence from uniprot?

Zkkkkkui · 2022-10-10T22:06:27Z

Hi! I still have problem with getting the features of my dataset. I am not able to use AF2complex locally and I am not sure how to run these sh. script. I tried to get a feature.pkl file from Alphafold output after doing the protein prediction but when I used it in this colab to predict compex, it always went wrong like this:
KeyError: 'msa'

CalledProcessError Traceback (most recent call last)
in
39
40 # with io.capture_output() as captured:
---> 41 get_ipython().run_line_magic('shell', 'python -u ../run_af2c_mod.py {pred_params}')
42 print(f'DONE! (predictions available on {FLAGS.output_dir}' )
could you explain and help me with that? Thank you!

FreshAirTonight · 2022-10-11T00:53:01Z

The Colab notebook we provided only takes features.pkl files of individual monomers. If you use other AlphaFold notebook to generate input features, make sure that you use the monomer, not multimer, pipeline to generate the input. And tar these pickle files into one single tar ball.

In this example, you have a heterodimer HgcAB composed of two monomers, HgcA and HgcB. Organizes the feature input as the following:

hgc.tar
├── hgc
│   ├── HgcA
│   │   └── features.pkl.gz
│   └── HgcB
│       └── features.pkl.gz

Then tar this folder into a single tarball and upload it to our notebook. Note that our code can take gzipped pickle files directly. It is up to you whether or not to gzip the pickle files before you make the tarball.

After you upload the attached tarball, you may run a test to predict a heterodimer using the target syntax: HgcA/HgcB 433 HgcAB

Zkkkkkui · 2022-10-13T12:43:15Z

Thank you for the instruction! I have successfully got the features from other AF notebook(it used 0 sequence template and I am not sure whether it would matter compared to your examples) and did prediction on some protein complexes. However it seems to have a very high false negative rate(the proteins were supposed to be interacting but the output was not), is there any way to improve that?

FreshAirTonight · 2022-10-14T02:00:49Z

Thank you for the instruction! I have successfully got the features from other AF notebook(it used 0 sequence template and I am not sure whether it would matter compared to your examples) and did prediction on some protein complexes. However it seems to have a very high false negative rate(the proteins were supposed to be interacting but the output was not), is there any way to improve that?

Many things to try, such as:

Different DL models if you haven't tried them all, including the monomer DL models
Longer recycles, between 8 to 20
Add structural templates if possible

fereidoon27 · 2024-06-22T05:31:29Z

@FreshAirTonight
In AlphaFold, MSAs are built using jackhmmer and HHblits. To avoid the extensive data downloads and CPU processing, precomputed MSAs and the feature.pkl file can be used instead.

Due to my limited resources, I'm focusing on the GPU-based second step and considering tools like ColabFold to create the feature.pkl files.

What files and steps are needed to create the feature.pkl file? Is there a tool available on Google Colab or Kaggle for this?

FreshAirTonight · 2024-06-22T13:52:41Z

@fereidoon27 An example of feature generation script run_fea_gen.sh can be found under the example folder. If you have limited resources, consider using the uniprot mode, under which MSA construction uses only the UniProt library (creating dummy files for other sequence library to get around file checking). With this option, you can generate features for hundreds of protein sequences of moderate lengths with a decent workstation (e.g., with 4TB nvme, 24 cores).

You may use precomputed MSAs as well. Under the intended output folder of a protein, create a subfolder named msas and place the MSA files under that folder. The only MSAs required in the uniprot mode is uniprot_hits.sto. Add pdb_hits.sto for templates if you need.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how can I create the input features for my own dataset by running the py. file? Is it possible to simply the input to only sequence of the protein? #9

how can I create the input features for my own dataset by running the py. file? Is it possible to simply the input to only sequence of the protein? #9

Zkkkkkui commented Sep 30, 2022

FreshAirTonight commented Sep 30, 2022

Zkkkkkui commented Oct 1, 2022

FreshAirTonight commented Oct 1, 2022

Zkkkkkui commented Oct 3, 2022

Zkkkkkui commented Oct 10, 2022

FreshAirTonight commented Oct 11, 2022

Zkkkkkui commented Oct 13, 2022

FreshAirTonight commented Oct 14, 2022

fereidoon27 commented Jun 22, 2024

FreshAirTonight commented Jun 22, 2024

how can I create the input features for my own dataset by running the py. file? Is it possible to simply the input to only sequence of the protein? #9

how can I create the input features for my own dataset by running the py. file? Is it possible to simply the input to only sequence of the protein? #9

Comments

Zkkkkkui commented Sep 30, 2022

FreshAirTonight commented Sep 30, 2022

Zkkkkkui commented Oct 1, 2022

FreshAirTonight commented Oct 1, 2022

Zkkkkkui commented Oct 3, 2022

Zkkkkkui commented Oct 10, 2022

FreshAirTonight commented Oct 11, 2022

Zkkkkkui commented Oct 13, 2022

FreshAirTonight commented Oct 14, 2022

fereidoon27 commented Jun 22, 2024

FreshAirTonight commented Jun 22, 2024