-
Notifications
You must be signed in to change notification settings - Fork 0
Tutorial
Installation and Configuration steps must be completed before this part.
This tutorial describes how to run ASMC on a family of homologous proteins, named Amine Dehydrogenases (AmDHs), when the active site residues are known (cf. ASMC with user-refined pocket).
A directory named tutorial/
is available at ASMC/docs/ and contains the following input files:
-
ADH4.pdb
: PDB ID 6G1M, chain B. -
DH35.pdb
: PDB ID 6IAU, chain B. -
DHP6.pdb
: PDB ID 6IAQ, chain A. -
MATA.pdb
: PDB ID 7ZBO, chain A. -
pocket.csv
: list of amino acid residues considered as part of the active site. -
sequences.fasta
: a set of 954 protein sequences in FASTA format (950 AmDHs + 4 reference AmDHs).
The last file required is reference_file
which must be written as follows, replacing <path_to_ASMC>
with the path to where the ASMC repository was downloaded:
<path_to_ASMC>/ASMC/docs/tutorial/ADH4.pdb
<path_to_ASMC>/ASMC/docs/tutorial/DH35.pdb
<path_to_ASMC>/ASMC/docs/tutorial/DHP6.pdb
<path_to_ASMC>/ASMC/docs/tutorial/MATA.pdb
NB: if the active site is unknown, please consider this section.
Change the working directory for <path_to_ASMC>/ASMC/docs/tutorial/
and run ASMC with reference_file
, pocket.csv
and sequences.fasta
called with the -r
, -p
and -s
options, respectively.
cd <path_to_ASMC>/ASMC/docs/tutorial/
asmc run --log run_asmc.log --threads 6 -r reference_file -p pocket.csv -s sequences.fasta
The whole process can be verified by checking the file run_asmc.log
.
Once completed, the following output files are available:
-
models.txt
: list of models generated by MODELLER. -
identity_targets_refs.tsv
: identity percentage between each protein sequence and its reference. -
active_site_alignment.fasta
: active site sequences for each protein, in FASTA format. -
groups_0.12_min_5.tsv
: clustering computed by DBSCAN, here witheps=0.12
andmin_samples=5
- automatically computed by DBSCAN if not provided by users. -
GX.fasta
: FASTA file for each DBSCAN group, here withX = [-1:3]
. -
groups_logo.png
: sequence logos for all DBSCAN groups, with the number of sequences per group indicated in the bottom right-hand corner. -
models/
: directory including all the 3D models listed inmodels.txt
. -
pairwise/
: directory including all the structural pairwise alignments, computed by US-align, in FASTA format. -
superposition/
: directory including all the PDB models with 3D coordinates aligned on its reference.
Proteins belonging to group -1 (G-1.fasta
) must be considered as "outliers" since DBSCAN was unable to group them in a cluster >= 5 members. This do not mean these proteins are not interesting and users are advised to consider them.
If a group is wide enough, users can try to generate sub-clusters using the Re-Clustering procedure, by playing with the --eps
parameter.
Several python scripts were designed to further analyze ASMC clusters, more details in the section How to deal with ASMC outputs.