Skip to content

File Format

eddy-elisee edited this page Sep 12, 2024 · 10 revisions

Name convention

For structure file names, replace all characters such as |, : and . by _.

For sequence ID in fasta file, you could put any ID but, all |, : and . will be automatically replaced by _. However, for Uniprot headers like sp|UniqueIdentifier|EntryName or tr|UniqueIdentifier|EntryName, the UniqueIdentifier is extracted and used as sequence ID.

Files

Reference structure file

Input of:

  • subcommand run with the option -r/--ref
  • subcommand identity with the option -r/--ref-str

Mandatory to run ASMC based on structures.

This file contains one reference structure path per line, e.g:

/home/User/data/RefA.pdb
/home/User/data/RefZ.pdb

Pocket csv file

Input of:

  • subcommand run with the option -p/--pocket

Output of:

  • subcommand run if the option -p/--pocket isn't provided

File used to indicate the active sites positions. The format is ID,Chain,pos..., e.g:

RefA,A,55,57,59,77,101,102,129,130,131,145,148
RefZ,A,89,91,93,118,142,143,170,171,172,197,198

If not provided, this file will be built by ASMC.

Models file

Input of:

  • subcommand run with the option -m/--models
  • subcommand identity with the option -r/--ref-str*

*only the first column is necessary.

Output of:

  • subcommand run if the option -s/--seqs is provided

File built by ASMC if the modelling steps is performed. Otherwise, the file should be built by yourself and provided to the subcommand run, e.g:

/home/User/data/models/target_1.pdb /home/User/data/RefA.pdb
/home/User/data/models/target_2.pdb /home/User/data/RefZ.pdb

Active sites alignment

Input of:

  • subcommand run with the option -a/--active-sites

Output of:

  • subcommand run if the option -a/--active-sites isn't provided
  • subcommand run also returns a fasta file for each group which can be used with the option -a/--active-sites

This is simply a fasta file containing all active site sequences to be clustered, e.g:

>A0A015SZL4
MGAPECWKFSRHHEYERD
>A0A015TUY7
MGAPECWKFSRHHEYERD
>A0A017H2J5
MGAPECWEKANLREYKGA

Input for --msa options

Input of:

  • subcommand run with the option -M/--msa

The file should contain 2 information if only one reference is used:

  • The active site positions in the reference sequence
  • The path to the multiple sequence alignment
refA,55,57,59,77,101,102,129,130,131,145,148
/home/User/data/multiple_sequence_alignment.fasta

If they are multiple references, it's necessary to have 1) the pocket positions of each reference and 2) the path to a file similar to identity_targets_refs.tsv (see below)

refA,55,57,59,77,101,102,129,130,131,145,148
RefZ,89,91,93,118,142,143,170,171,172,197,198
/home/User/data/identity_targets_refs.tsv
/home/User/data/multiple_sequence_alignment.fasta

Reference sequence file

Input of:

  • subcommand identity with the option -R/--ref-seq

This file contains one reference ID per line, e.g:

RefA
RefZ

groups_x_min_y.tsv

Input of:

  • subcommand compare with the option -f1 and -f2
  • subcommand to_xlsx with the option -f/--file

Output of:

  • subcommand run

The x corresponds to the -e/--eps value of the subcommand run. By default, the value is auto so the value is automatically chosen before the clustering, based on the normalised distances distribution.

The y corresponds to the --min-samples value of the subcommand run. By default, the value is auto so the value is 5 if the number of samples ≤ 1500 and 25 for more.

The format is ID Active_site_sequence Group_id, e.g:

ID1	ACQGINFIRVDYEIHIGMGGT	-1
ID2	SAEGINLMRNSFVQHVGHQGT	0
ID3	SAEGINFVRNSFVQHVGHQGT	0
ID4	SCEGVNFVRVDRLVHVGLIGT	1
ID5	SCEGVNFIRVDRLVHVGLIGT	1

Note: The group numbering starts at 0 and -1 is the ID for the outliers

identity_targets_refs.tsv

Input of:

  • subcommand run as a path in a file to provide to the -M/--msa, see above
  • subcommand compare with the option -id

Output:

  • subcommand run if the -s/--seqs is provided (homology modelling performed by ASMC with MODELLER)
  • subcommand identity

Output example for the subcommand identity:

id1	refA	62.50
id2	refA	68.75
id3	refZ	68.75
id4	refZ	50.00
id5	refZ	62.50

Note: the identity_targets_refs.tsv returned by the subcommand run have 4 columns, the last column contains the value of --id option.

active_site_checking.tsv

Output of:

  • subcommand compare

The format of this file is: ID G1 SEQ1 G2 SEQ2 DIFF REF_ID PERC_ID REF_SEQ, e.g:

ID	G1	SEQ1	G2	SEQ2	DIFF	REF_ID	PERC_ID	REF_SEQ
ID22	0	FGSNLGCYEVFMYP	0	FGSNLGCYEVFMYP	0	REFC	16.81	LPSQLDWYEVMEYP
ID45	0	ILSKVAWFEVFVPG	-1	ILS-VAWFEAVIYP	5	REFB	18.14	VLSAAAWYEIIVYP
ID48	0	VGSEVTWYESAMYP	0	VGSSVTWYESAMYP	1	REFD	26.85	LGSQVTWYEIIIYP
ID61	0	IASQMGWYEAIIYP	0	IASQMGWYEAIIYP	0	REFB	39.82	VLSAAAWYEIIVYP
ID67	0	ILSAAAWYEIIVYP	0	ILSAAAWYEIIVYP	0	REFB	51.77	VLSAAAWYEIIVYP

Note: The values in G1 and G2 columns are just the id of the groups for their respective runs. Two 0 don't signify that the members composition is identical for the two groups. However, multiple runs with same parameters on the same active sites alignment always return the same clusters.