File Format

Name convention

For structure file names, replace all characters such as |, : and . by _.

For sequence ID in fasta file, you could put any ID but, all |, : and . will be automatically replaced by _. However, for Uniprot headers like sp|UniqueIdentifier|EntryName or tr|UniqueIdentifier|EntryName, the UniqueIdentifier is extracted and used as sequence ID.

Files

Reference structure file

Input of:

subcommand run with the option -r/--ref
subcommand identity with the option -r/--ref-str

Mandatory to run ASMC based on structures.

This file contains one reference structure path per line, e.g:

/home/User/data/RefA.pdb
/home/User/data/RefZ.pdb

Pocket csv file

Input of:

subcommand run with the option -p/--pocket

Output of:

subcommand run if the option -p/--pocket isn't provided

File used to indicate the active sites positions. The format is ID,Chain,pos..., e.g:

RefA,A,55,57,59,77,101,102,129,130,131,145,148
RefZ,A,89,91,93,118,142,143,170,171,172,197,198

If not provided, this file will be built by ASMC.

Models file

Input of:

subcommand run with the option -m/--models
subcommand identity with the option -r/--ref-str*

*only the first column is necessary.

Output of:

subcommand run if the option -s/--seqs is provided

File built by ASMC if the modelling steps is performed. Otherwise, the file should be built by yourself and provided to the subcommand run, e.g:

/home/User/data/models/target_1.pdb /home/User/data/RefA.pdb
/home/User/data/models/target_2.pdb /home/User/data/RefZ.pdb

Active sites alignment

Input of:

subcommand run with the option -a/--active-sites

Output of:

subcommand run if the option -a/--active-sites isn't provided
subcommand run also returns a fasta file for each group which can be used with the option -a/--active-sites

This is simply a fasta file containing all active site sequences to be clustered, e.g:

>A0A015SZL4
MGAPECWKFSRHHEYERD
>A0A015TUY7
MGAPECWKFSRHHEYERD
>A0A017H2J5
MGAPECWEKANLREYKGA

Input for --msa options

Input of:

subcommand run with the option -M/--msa

The file should contain 2 information if only one reference is used:

The active site positions in the reference sequence
The path to the multiple sequence alignment

refA,55,57,59,77,101,102,129,130,131,145,148
/home/User/data/multiple_sequence_alignment.fasta

If they are multiple references, it's necessary to have 1) the pocket positions of each reference and 2) the path to a file similar to identity_targets_refs.tsv (see below)

refA,55,57,59,77,101,102,129,130,131,145,148
RefZ,89,91,93,118,142,143,170,171,172,197,198
/home/User/data/identity_targets_refs.tsv
/home/User/data/multiple_sequence_alignment.fasta

Reference sequence file

Input of:

subcommand identity with the option -R/--ref-seq

This file contains one reference ID per line, e.g:

RefA
RefZ

groups_x_min_y.tsv

Input of:

subcommand compare with the option -f1 and -f2
subcommand to_xlsx with the option -f/--file

Output of:

subcommand run

The x corresponds to the -e/--eps value of the subcommand run. By default, the value is auto so the value is automatically chosen before the clustering, based on the normalised distances distribution.

The y corresponds to the --min-samples value of the subcommand run. By default, the value is auto so the value is 5 if the number of samples ≤ 1500 and 25 for more.

The format is ID Active_site_sequence Group_id, e.g:

ID1	ACQGINFIRVDYEIHIGMGGT	-1
ID2	SAEGINLMRNSFVQHVGHQGT	0
ID3	SAEGINFVRNSFVQHVGHQGT	0
ID4	SCEGVNFVRVDRLVHVGLIGT	1
ID5	SCEGVNFIRVDRLVHVGLIGT	1

Note: The group numbering starts at 0 and -1 is the ID for the outliers

identity_targets_refs.tsv

Input of:

subcommand run as a path in a file to provide to the -M/--msa, see above
subcommand compare with the option -id

Output:

subcommand run if the -s/--seqs is provided (homology modelling performed by ASMC with MODELLER)
subcommand identity

Output example for the subcommand identity:

id1	refA	62.50
id2	refA	68.75
id3	refZ	68.75
id4	refZ	50.00
id5	refZ	62.50

Note: the identity_targets_refs.tsv returned by the subcommand run have 4 columns, the last column contains the value of --id option.

active_site_checking.tsv

Output of:

subcommand compare

The format of this file is: ID G1 SEQ1 G2 SEQ2 DIFF REF_ID PERC_ID REF_SEQ, e.g:

ID	G1	SEQ1	G2	SEQ2	DIFF	REF_ID	PERC_ID	REF_SEQ
ID22	0	FGSNLGCYEVFMYP	0	FGSNLGCYEVFMYP	0	REFC	16.81	LPSQLDWYEVMEYP
ID45	0	ILSKVAWFEVFVPG	-1	ILS-VAWFEAVIYP	5	REFB	18.14	VLSAAAWYEIIVYP
ID48	0	VGSEVTWYESAMYP	0	VGSSVTWYESAMYP	1	REFD	26.85	LGSQVTWYEIIIYP
ID61	0	IASQMGWYEAIIYP	0	IASQMGWYEAIIYP	0	REFB	39.82	VLSAAAWYEIIVYP
ID67	0	ILSAAAWYEIIVYP	0	ILSAAAWYEIIVYP	0	REFB	51.77	VLSAAAWYEIIVYP

Note: The values in G1 and G2 columns are just the id of the groups for their respective runs. Two 0 don't signify that the members composition is identical for the two groups. However, multiple runs with same parameters on the same active sites alignment always return the same clusters.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File Format

Name convention

Files

Reference structure file

Pocket csv file

Models file

Active sites alignment

Input for --msa options

Reference sequence file

groups_x_min_y.tsv

identity_targets_refs.tsv

active_site_checking.tsv

Clone this wiki locally