Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SignInt & FTP fix #1989

Merged
merged 34 commits into from
Dec 2, 2024
Merged

SignInt & FTP fix #1989

merged 34 commits into from
Dec 2, 2024

Conversation

karolamik13
Copy link
Contributor

@karolamik13 karolamik13 commented Nov 12, 2024

In this code, we have the Signature Interactions pipeline with Foldseek analysis (provided by Anupam Banerjee) together with the merged pull request #1988 by James Krieger.

I created the following functions as a pipeline of SignInt analysis based on Anupam's code:
(1) runFoldseek() - this function processes a PDB file to extract a specified chain's sequence, searches for
homologous structures using foldseek, and prepares alignment outputs for further analysis

Before using the function, Foldseek needs to be installed, and the PDB database for screening needs to be downloaded:
More information can be found:
https://github.com/steineggerlab/foldseek?tab=readme-ov-file#databasesand
This function will not work under Windows due to the Unix commands (cat, awk, etc.).
Perhaps in the future, it can be adapted to Windows machines.
This command uses createFoldseekAlignment() and extractMultiModelPDB().

(2) createFoldseekAlignment() - Aligns sequences from prot_seq with homologous sequences identified in prot_foldseek,
generating a multiple sequence alignment.

(3) extractMultiModelPDB() - Extracts individual PDB models from multimodel PDB and places them into the pointed directory.

(4) calcSignatureInteractions() - Analyzes protein structures to identify various interactions using InSty.

How to use the code, for example:

  1. Install Feedseek and download the database as explained above (in this example, it is downloaded to ~/Downloads)
  2. In Python or IPython:

from prody import *
runFoldseek('5kqm.pdb', 'A', database_folder='~/Downloads/pdb')
calcSignatureInteractions('shortlisted_resind.msa', './struc_homologs')

Default names for the results are shortlisted_resind.msa, and all homologous PDB structures will be extracted to folder struc_homologs in the working directory by runFoldseek(). Next, calcSignatureInteractions() will analyze the aligned 3D structures of homologous and provide InSty results (including raw data and automatically generated plots).

The folder will contain such files (PNGs, txt):
aligned_structures_extended.pdb
aligned_structures.pdb
HydrogenBonds_consensus.txt
HydrogenBonds_plot_part10.png
HydrogenBonds_plot_part11.png
HydrogenBonds_plot_part12.png
HydrogenBonds_plot_part13.png
HydrogenBonds_plot_part14.png
HydrogenBonds_plot_part15.png
HydrogenBonds_plot_part16.png
HydrogenBonds_plot_part17.png
HydrogenBonds_plot_part18.png
HydrogenBonds_plot_part19.png
HydrogenBonds_plot_part1.png
HydrogenBonds_plot_part20.png
HydrogenBonds_plot_part2.png
HydrogenBonds_plot_part3.png
HydrogenBonds_plot_part4.png
HydrogenBonds_plot_part5.png
HydrogenBonds_plot_part6.png
HydrogenBonds_plot_part7.png
HydrogenBonds_plot_part8.png
HydrogenBonds_plot_part9.png
Hydrophobic_consensus.txt
Hydrophobic_plot_part10.png
Hydrophobic_plot_part11.png
Hydrophobic_plot_part1.png
Hydrophobic_plot_part2.png
Hydrophobic_plot_part3.png
Hydrophobic_plot_part4.png
Hydrophobic_plot_part5.png
Hydrophobic_plot_part6.png
Hydrophobic_plot_part7.png
Hydrophobic_plot_part8.png
Hydrophobic_plot_part9.png
PiCation_consensus.txt
PiCation_plot_part1.png
PiStacking_consensus.txt
PiStacking_plot_part1.png
prot.foldseek
prot_struc.msa
RepulsiveIonicBonding_consensus.txt
RepulsiveIonicBonding_plot_part1.png
SaltBridges_consensus.txt
SaltBridges_plot_part1.png
SaltBridges_plot_part2.png
seq_match_reconfirm.txt
shortlisted.foldseek
shortlisted.msa
shortlisted_resind.msa
struc_homologs

"""Remove empty strings from a list."""
return [elem for elem in row if elem != '']

def log_message(message, level="INFO"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have lots of prody things to do this functionality

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which one would you suggest? I used here LOGGER.info (below). Should I just use any particular function?

def log_message(message, level="INFO"):
"""Log a message with a specified log level."""
LOGGER.info("[{}] {}".format(level, message))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine that doesn’t work but we’ll need to check


# Run foldseek and other commands
subprocess.run([
'foldseek', 'easy-search', 'inp.pdb', database_folder, 'prot.foldseek',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably check that that the database_folder exists before running foldseek and having it fail if it doesn't

Also, how are we handling errors from all these subprocess commands?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added that, but I don't know which problems we might have with the subprocess. Maybe Anupam can say something more about it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamesmkrieger and @karolamik13 I just discussed with Anupam about this PR. I will test and make sure that a few checks are in place for these subprocesses.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Anthony!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Great. Thank you!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamesmkrieger and @karolamik13 I have tested this with ADK and everything is working as expected (except one small thing with the plotting, which I am debugging now). I added in some checks to make sure the subprocesses are covered. Once I am done, I will open a PR for your branch, Karolina and you can have a look at my changes. There was a bug I found and fixes, so that change is necessary to merge in, the other checks are helpful but not essential.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Great!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@karolamik13
Copy link
Contributor Author

I made a typo in the commits. It should be: 3to1 letter code is now replaced by AAMAP

Copy link
Contributor

@jamesmkrieger jamesmkrieger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There doesn’t seem to be any return line?

@karolamik13
Copy link
Contributor Author

Are you referring to merging the main ProDy branch? I wanted to update it to be sure that everything is working fine.

@jamesmkrieger
Copy link
Contributor

Are you referring to merging the main ProDy branch? I wanted to update it to be sure that everything is working fine.

I meant the runDali commit

@karolamik13
Copy link
Contributor Author

Are you referring to merging the main ProDy branch? I wanted to update it to be sure that everything is working fine.

I meant the runDali commit

Ok. In such a case, I don't understand :) I added runDali here in this pull request and not in a separate branch because, we have runFoldseek here as well, and I will have to change the calcSignatureInteractions to include Dali.

@jamesmkrieger
Copy link
Contributor

I meant there’s no object returned by runDali, but I guess that’s ok as it produces files

I’d also suggest calling these functions something more specific, especially for Dali where we already have classes and functions already

@karolamik13
Copy link
Contributor Author

@jamesmkrieger and @atbogetti, now we have runFoldseek, runDali, and runBLAST.
All three to prepare a folder with homolog structures (by sequence or spatial similarity)
Preparation is the following: reach the database, obtain information about PDB codes, filter (if applied), extract particular chains from PDBs, add hydrogens and missing side chains using fixer (pdbfixer or openbabel), and finally align them.
Foldseek creates additional things, while Dali and BLAST create only folders with prepared files.

I will add some improvements to runDali and runBLAST, for example, the optional removal of temporary files and an option to provide Dali or BLAST input files instead of reaching the servers. I also plan to modify calcSignatureInteractions to include Dali and BLAST analysis instead Foldseek. MSA file will be optional. Unless we would like to include your script, James, with extracting sequences? It can be used together with runDali and runBLAST to create an MSA file. What do you think?

@jamesmkrieger
Copy link
Contributor

Dali and BLAST do give alignments/mappings to the reference so we could get a full MSA directly possibly, but I don't know for sure. Using buildPDBEnsemble should also work for Dali, but I don't know if we can use the info from BLAST.

@karolamik13
Copy link
Contributor Author

I have now changed the calcSignatureInteractions() function to include folder analysis only, without the MSA file.
The code needs further reorganization and changes in process_data(). I would remove the fixer and structural alignment directly to runFoldseek() as I did for runDali and runBLAST, but I don't want to modify Anupam's code completely. Currently, calcSignatureInteractions(), when receiving a folder - will compute all interactions for all models in the particular folder (we are giving the path) OR when we are providing a folder and MSA file also, fixer, structural alignment, etc., are performed.

We can discuss in the upcoming days what would be the best.

@jamesmkrieger
Copy link
Contributor

I'd suggest having a wrapper function that includes the fixer, structural superposition, etc. and runs the method specific code after.

I guess you're using a match chains function that takes an MSA as an optional argument, so the user can provide it if they trust it and use some other ProDy method for sequence alignment otherwise?

There's a lot of code here so I haven't managed to look at it all yet. Hopefully I'll have a chance in the second half of next week

@karolamik13
Copy link
Contributor Author

I reorganized runFoldseek() and calcSignatureInteractions().
We still need to modify process_data(), which is used when except the folder we are also providing the MSA file in calcSignatureInteractions(). I might need help with that. If we do not modify it, interactions will be calculated again. But I am not sure whether there is an easy way to just remove the part that re-calculates interactions because bonds (the name of the variable) are then used to read indices for MSA.

Now, to initiate calculations, we can use:

runBLAST('1OL5', 'A', seqid=90)
runDali('5kqm', 'A')
runFoldseek('5kqm.pdb', 'A', database_folder='/home/user_name/Downloads/foldseek/pdb')
[foldseek need to be installed and PDB database need to download into '/home/user_name/Downloads/foldseek/' folder.

Each function will create the folder with downloaded, fixed (by adding missing atoms), and aligned structures.
Next, we are using:

calcSignatureInteractions('struc_homologs')
where 'struc_homologs' is the name of a folder (or full pathway) where runDali/runBlast/runFoldseek downloaded all the structures.
or if we have Foldseek data:
calcSignatureInteractions('struc_homologs',mapping_file='shortlisted.msa')

@karolamik13
Copy link
Contributor Author

I made pretty significant improvements in process_data(), which is a party of calcSignatureInteractions(), and in calcSignatureInteractions(), too, to properly include the MSA file. Unfortunately, if we have an MSA file, some information is obtained from computed interactions. And those are included to generate conservation plots. Therefore, interactions cannot be computed in a similar way as for the folder analysis.

Also, I noticed that in the comment above, I gave the wrong MSA file. That one is (by default) correct:

calcSignatureInteractions('./struc_homologs', mapping_file='shortlisted_resind.msa')

@karolamik13
Copy link
Contributor Author

@atbogetti and @jamesmkrieger, the code is ready for me. Feel free to test it on different structures.
runBLAST, runDali, runFoldseek and than calcSigantureInteractions as shown above.

@atbogetti
Copy link
Member

@karolamik13 I have approved this PR. I tested your most recent changes on PDB=1AKE and it's working well.

@karolamik13
Copy link
Contributor Author

Great. Thank you!

@karolamik13 karolamik13 merged commit f9d6716 into prody:main Dec 2, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants