-
Notifications
You must be signed in to change notification settings - Fork 0
File Format
For structure file names, replace all characters such as |
, :
and .
by _
.
For sequence ID in fasta file, you could put any ID but, all |
, :
and .
will be automatically replaced by _
. However, for Uniprot headers like sp|UniqueIdentifier|EntryName
or tr|UniqueIdentifier|EntryName
, the UniqueIdentifier
is extracted and used as sequence ID.
Input of:
- subcommand
run
with the option-r/--ref
- subcommand
identity
with the option-r/--ref-str
Mandatory to run ASMC based on structures.
This file contains one reference structure path per line, e.g:
/home/User/data/RefA.pdb
/home/User/data/RefZ.pdb
Input of:
- subcommand
run
with the option-p/--pocket
Output of:
- subcommand
run
if the option-p/--pocket
isn't provided
File used to indicate the active sites positions. The format is ID,Chain,pos...
, e.g:
RefA,A,55,57,59,77,101,102,129,130,131,145,148
RefZ,A,89,91,93,118,142,143,170,171,172,197,198
If not provided, this file will be built by ASMC.
Input of:
- subcommand
run
with the option-m/--models
- subcommand
identity
with the option-r/--ref-str
*
*only the first column is necessary.
Output of:
- subcommand
run
if the option-s/--seqs
is provided
File built by ASMC if the modelling steps is performed. Otherwise, the file should be built by yourself and provided to the subcommand run
, e.g:
/home/User/data/models/target_1.pdb /home/User/data/RefA.pdb
/home/User/data/models/target_2.pdb /home/User/data/RefZ.pdb
Input of:
- subcommand
run
with the option-a/--active-sites
Output of:
- subcommand
run
if the option-a/--active-sites
isn't provided - subcommand
run
also returns a fasta file for each group which can be used with the option-a/--active-sites
This is simply a fasta file containing all active site sequences to be clustered, e.g:
>A0A015SZL4
MGAPECWKFSRHHEYERD
>A0A015TUY7
MGAPECWKFSRHHEYERD
>A0A017H2J5
MGAPECWEKANLREYKGA
Input of:
- subcommand
run
with the option-M/--msa
The file should contain 2 information if only one reference is used:
- The active site positions in the reference sequence
- The path to the multiple sequence alignment
refA,55,57,59,77,101,102,129,130,131,145,148
/home/User/data/multiple_sequence_alignment.fasta
If they are multiple references, it's necessary to have 1) the pocket positions of each reference and 2) the path to a file similar to identity_targets_refs.tsv (see below)
refA,55,57,59,77,101,102,129,130,131,145,148
RefZ,89,91,93,118,142,143,170,171,172,197,198
/home/User/data/identity_targets_refs.tsv
/home/User/data/multiple_sequence_alignment.fasta
Input of:
- subcommand
identity
with the option-R/--ref-seq
This file contains one reference ID per line, e.g:
RefA
RefZ
Input of:
- subcommand
compare
with the option-f1
and-f2
- subcommand
to_xlsx
with the option-f/--file
Output of:
- subcommand
run
The x
corresponds to the -e/--eps
value of the subcommand run
. By default, the value is auto
so the value is automatically chosen before the clustering, based on the normalised distances distribution.
The y
corresponds to the --min-samples
value of the subcommand run
. By default, the value is auto
so the value is 5
if the number of samples ≤ 1500 and 25
for more.
The format is ID Active_site_sequence Group_id
, e.g:
ID1 ACQGINFIRVDYEIHIGMGGT -1
ID2 SAEGINLMRNSFVQHVGHQGT 0
ID3 SAEGINFVRNSFVQHVGHQGT 0
ID4 SCEGVNFVRVDRLVHVGLIGT 1
ID5 SCEGVNFIRVDRLVHVGLIGT 1
Note: The group numbering starts at 0
and -1
is the ID for the outliers
Input of:
- subcommand
run
as a path in a file to provide to the-M/--msa
, see above - subcommand
compare
with the option-id
Output:
- subcommand
run
if the-s/--seqs
is provided (homology modelling performed by ASMC with MODELLER) - subcommand
identity
Output example for the subcommand identity
:
id1 refA 62.50
id2 refA 68.75
id3 refZ 68.75
id4 refZ 50.00
id5 refZ 62.50
Note: the identity_targets_refs.tsv returned by the subcommand run
have 4 columns, the last column contains the value of --id
option.
Output of:
- subcommand
compare
The format of this file is: ID G1 SEQ1 G2 SEQ2 DIFF REF_ID PERC_ID REF_SEQ
, e.g:
ID G1 SEQ1 G2 SEQ2 DIFF REF_ID PERC_ID REF_SEQ
ID22 0 FGSNLGCYEVFMYP 0 FGSNLGCYEVFMYP 0 REFC 16.81 LPSQLDWYEVMEYP
ID45 0 ILSKVAWFEVFVPG -1 ILS-VAWFEAVIYP 5 REFB 18.14 VLSAAAWYEIIVYP
ID48 0 VGSEVTWYESAMYP 0 VGSSVTWYESAMYP 1 REFD 26.85 LGSQVTWYEIIIYP
ID61 0 IASQMGWYEAIIYP 0 IASQMGWYEAIIYP 0 REFB 39.82 VLSAAAWYEIIVYP
ID67 0 ILSAAAWYEIIVYP 0 ILSAAAWYEIIVYP 0 REFB 51.77 VLSAAAWYEIIVYP
Note: The values in G1 and G2 columns are just the id of the groups for their respective runs. Two 0
don't signify that the members composition is identical for the two groups. However, multiple runs with same parameters on the same active sites alignment always return the same clusters.