Skip to content

Latest commit

 

History

History
655 lines (577 loc) · 46.6 KB

README.md

File metadata and controls

655 lines (577 loc) · 46.6 KB

codonHomologizer.pl

WHAT IS THIS:

This script, given a set of amino acid sequences and a codon usage table, constructs the underlying genetic sequences to optimize them for crossover events. It takes 2 or more protein sequences, aligns them to optimize codon homology (using a custom amino acid weight matrix as input to a multiple sequence alignment tool (muscle)), and then recodes the amino acid sequences to be more homologous at the DNA level. It utilizes the supplied codon usage table to prefer codons that are more common in a particular organism.

For information on stretchAlign.pl, see below.

DETAILS:

Note, the alignment step is a multiple sequence alignment, but the sequence construction step is pair-wise. Each sequence is produced multiple times, optimized for crossover for every other sequence.

To omit codons from incorporation in the resulting sequence, omit them from the codon usage file (or comment them out - refer to the format for the -c file below).

This script produces "Crossover Metrics" intended to be used to evaluate how good each pair of sequences is in terms of their potential to be involved in crossover events. Here is a listing of the columns in the Crossover Metrics table:

Source             - The file the metrics are based on.
Pair               - Sequential/arbitrary unique pair ID.
Aln Len            - The length of the alignment produced.
Seqs               - Sequence IDs.
%ID                - % identity: identicals/non-gap-length.
%Aligned NTs       - Non-gap aln length / aln length.
Contig Ident >= N  - Sum of contiguous identical NTs >= N.
Rare Cdns(Usg<=N)  - Number of codons incorporated whose usage is <= N.
Rarest Codon       - Codon:usage.  Rarest codon incorporated.
Frame Shift bases  - Number of NTs aligned out of frame.
Seg Aln Score[0-1] - Segment alignment score, a value between 0 and 1 (inclusive).

The Seg Aln Score is only present if the -d option is used to align for example, similar protein structural domains with one another. It can also be used to force stretches of identity to align in pairs of sequences by supplying it a file generated by stretchAlign.pl (provided with this package)

To evaluate a sequence (nt alignment) produced by another tool, you can use the -e option. All this does is add entries to the Crossover Metrics table at the end of the run.

INSTALL

Requires the multiple sequence alignment tool called muscle, available here:

http://www.drive5.com/muscle/downloads.htm

After installing muscle, cd into the codonHomologizer directory and run the following:

perl Makefile.PL
make
sudo make install

USAGE

codonHomologizer.pl -c "input file(s)" [options]

 -c                   REQUIRED Codon usage file.
 -i                   OPTIONAL [stdin if present] Amino acid sequence file.
 --help               OPTIONAL Print general info and file formats.
 --extended           OPTIONAL Print extended usage.

ADVANCED USAGE

codonHomologizer.pl -c "input file(s)" [-i,-f,-o,-a,-s,--aa-aln-suffix,-x,-y,-z,-r,-w,-e,--stretch-reporting-min,--usage-reporting-max,-d,-n,--dir,--outfile,--help,--dry-run]
codonHomologizer.pl -c "input file(s)" [-i,-f,-o,-a,-s,--aa-aln-suffix,-x,-y,-z,-r,-w,-e,--stretch-reporting-min,--usage-reporting-max,-d,-n,--dir,--outfile,--help,--dry-run] < input_file

 -c,--codon-file,     REQUIRED Tab-delimited file containing the single
 --codon-usage-file            letter amino acid (AA) code, codon, and
                               [optional] usage score (see -n).  This
                               script will only use the codons in this file
                               to construct sequences and for the matrix
                               used in the alignment step.  All AAs must be
                               present.  If usage is unimportant, make the
                               usage scores all the same (e.g. 1.0) or
                               supply -n to ignore the usage column.
                               Commented out lines in the file will be
                               ignored.  See --help for more details.
 -i,--aa-file,*       OPTIONAL [stdin if present] Input file(s).  Space
                               separated, globs OK (e.g. $flag "*.input
                               [A-Z].{?,??}.inp").  See --help for file
                               format.  May be supplied on standard in.
                               When standard input detected and -i is given
                               only 1 argument, it will be used as a file
                               name stub for appending outfile suffixes.
                               See --extended --help for advanced usage
                               examples.
                               *No flag required.
 -f,--aa-seq-format   OPTIONAL [auto]{auto, fasta, fastq} Amino acid
                               sequence file format.  Applies to files
                               submitted using -i and -e.
 -o,--dna-suffix,     OPTIONAL [.fna] The extension of the output DNA
 --dna-extension               sequence file that is constructed from and
                               appended to the input amino acid sequence
                               file (see -i).
 -a,--align-codons    OPTIONAL [1] Align the codons in the output DNA
                               sequence file to match the amino acid
                               alignment (see -o).  This is the default
                               behavior.  Use --no-align-codons to generate
                               sequences without gap characters.
 -s,--matrix-suffix,  OPTIONAL [none] The extension of the output protein
 --matrix-extension            weight matrix file that is appended to and
                               constructed from the codon usage file (see
                               -c).  The generation of this file is not
                               required except for re-use and
                               reproducability.  It can be re-used by
                               supplying it using the -w option.
 --aa-aln-suffix      OPTIONAL [none] The aligned amino acid sequence
                               generated by the alignment tool (-x/-y) can
                               be optionally saved to a file.  While this
                               file is not necessary, it can be useful for
                               debugging.  The end goal is a DNA sequence.
                               Note, the default outfile suffix for the DNA
                               output is '.fna'.  If you want to output the
                               amino acid alignment, it is recommended to
                               use '.faa' to differentiate it from the DNA
                               output file.
 -x,--alignment-tool  OPTIONAL [muscle]{muscle, clustalw} The system
                               command line tool to be used in a system
                               call to align the amino acid sequences (see
                               -i).  The path to the executable can be
                               provided by -y and the command line options
                               can be tweaked using -z.
 -y,                  OPTIONAL [auto] The path to the executable selected
 --alignment-exe-path          by -x.  The default is simply the name of
                               the executable selected by -x (e.g. 'muscle'
                               or 'clustalw'), assuming it's in your PATH.
 -z,                  OPTIONAL [auto] Command line options provided to the
 --alignment-opts-str          sequence alignment tool specified by -y.
                               The default is determined by the selection
                               made with the -x option.  For muscle, the
                               default options are [-gapopen -12 -gapextend
                               -3].  For clustalw, the default options are
                               [-pwgapopen -12 -pwgapext -3 -quiet].  Each
                               command has a series of prohibited options
                               that are controlled by the script.

                               Muscle prohibited options: [-in -out -clw
                               -fasta -html -phys -pyi -msf -clwstrict
                               -clwout -clwstrictout -msfout -htmlout
                               -physout -phyiout -matrix]

                               Clustalw prohibited options: [-infile
                               -pwmatrix -matrix -output -outfile -type
                               -interactive]

                               It is possible that other options other than
                               these that are prohibited could cause
                               problems, but the lists here represent only
                               those specific options added by this script
                               or those that directly conflict with those
                               added by this script.
 -r,                  OPTIONAL [0] Treat the input AA sequence file as
 --precomputed-aa-             pre-aligned.  If you already have an amino
 alignment                     acid alignment and want to simply generate
                               nucleotide sequences from them that have as
                               much homology as possible, supply your AA
                               alignment file using -i and supply this
                               option to prevent the script from stripping
                               the alignment characters (e.g. gaps '-') and
                               re-aligning them.
 -w,                  OPTIONAL If you do not want this script to construct
 --precomputed-                a matrix based on homology and codon usage
 weight-matrix                 (using -c) to be used in aligning the amino
                               acid sequences, supply your preferred matrix
                               here.  One reason to do this would be to
                               sacrifice DNA homology to get an alignment
                               that has more biological significance.  The
                               matrix constructed by this script optimizes
                               for homology and ingores biological protein
                               similarity to promote as many crossovers as
                               possible in disparate sequences.  A codon
                               usage table (-c) is still required in order
                               to construct sequences with optimized
                               homology while allowing homologous ties to
                               defer to the more used codons.
 -e,                  OPTIONAL Aligned nucleotide sequence file.  The
 --evaluate-nt-aln-            sequence generated by other tools can be
 file,--nt-file                supplied for evaluation using the Crossover
                               Metrics generated at the end of the script.
                               Only one such file can be supplied, but it
                               may contain any number of aligned sequences.
 --stretch-reporting- OPTIONAL [5,11] After nucleotide sequence
 min                           construction (see -o) or the submission of a
                               nucleotide alignment (see -e), metrics are
                               reported to gauge the effectiveness of the
                               alignment and codon usage to produce
                               sequences that are likely to be involved in
                               crossover events.  Each possible pair of
                               sequences will report the number of
                               alignment positions involved in a contiguous
                               stretch of identity that is equal to or
                               larger than the number(s) supplied to this
                               option.  For example, if this minimum value
                               is 5 and a pair of sequences each has an
                               aligned segment of ATTGCTA (and no other
                               stretches of identity that are 5 or longer),
                               the number reported for 'Contig Ident >= 5'
                               will be 7.  The cutoff must be an unsigned
                               integer.
 --usage-reporting-   OPTIONAL [0.1] After nucleotide sequence construction
 max                           (see -o) or the submission of a nucleotide
                               alignment (see -e), metrics are reported to
                               gauge the effectiveness of the alignment and
                               codon usage to produce sequences that are
                               likely to be involved in crossover events.
                               Generally, codons more highly used are
                               utilized, but lesser used codons can get
                               used if another codon pair between aligned
                               amino acids has more homology.  Each
                               sequence will have reported the number of
                               rare codons that were included for each pair
                               of sequences.  You can set the max cutoff
                               using this option.  Multiple space-delimited
                               values can be supplied.  Any unsigned number
                               is acceptable.
 -d,--segment-file,   OPTIONAL To align segments of each sequence (e.g.
 --domain-file                 structural/secondary domains or any regions
                               that should be aligned together)
                               independently and stitch the global
                               alignment back together, you can provide a
                               segment file containing pairs of coordinates
                               that should be aligned sequentially within
                               each sequence.

                               A companion script (stretchAlign.pl) has
                               been provided in this package that can
                               generate a segment file that forces
                               codonHomologizer to align the largest number
                               of stretches of identity possible.  It
                               generates a file for every pair of
                               sequences.
 -n,                  OPTIONAL [0] If codon usage is not important to you,
 --ignore-codon-usage          you can ignore usage by supplying this
                               option.  Any usage values in the codon usage
                               file (see -c) will be ignored and the weight
                               matrix used in the alignment step and the
                               sequence construction will only be based on
                               homology (and homology placement/flexibility)
                               .  Note, the hierarchy of factors is:
                               homology, homology flexibility, and usage.
                               See --help --extended for more details.
 --dir,--outdir       OPTIONAL [none] All outfiles will be put in this
                               directory.
 --outfile,           OPTIONAL [stdout] Output file(s) - a named outfile
 --output-file                 that is a mutually exclusive alternative to
                               supplying an outfile suffix.  Space
                               separated.
 --help               OPTIONAL [1] Print general info and file format
                               descriptions.  Includes advanced usage
                               examples with --extended.
 --dry-run            OPTIONAL Run without generating output files.
 --verbose            OPTIONAL Verbose mode/level.  (e.g. --verbose 2)
 --quiet              OPTIONAL Quiet mode.
 --overwrite          OPTIONAL Overwrite existing output files.  By
                               default, existing output files will not be
                               over-written.  Supply twice to safely*
                               remove pre-existing output directories (see
                               --outdir).  Mutually exclusive with
                               --skip-existing and --append.
                               *Will not remove a directory containing
                               manually touched files.
 --skip-existing      OPTIONAL Skip existing output files.  Mutually
                               exclusive with --overwrite and --append.
 --append             OPTIONAL Append to existing output files.  Mutually
                               exclusive with --overwrite and
                               --skip-existing.
 --force              OPTIONAL Prevent script-exit upon critical error and
                               continue processing.  Supply twice to
                               additionally prevent skipping the processing
                               of input files that cause errors.  Use this
                               option with extreme caution.  This option
                               will not over-ride over-write protection.
                               See also --overwrite or --skip-existing.
 --header,--noheader  OPTIONAL [On] Print commented script version, date,
                               and command line call to each output file.
 --debug              OPTIONAL Debug mode/level.  (e.g. --debug --debug)
                               Values less than 0 debug the template code
                               that was used to create this script.
 --error-type-limit   OPTIONAL [5] Limits each type of error/warning to
                               this number of outputs.  Intended to
                               declutter output.  Note, a summary of
                               warning/error types is printed when the
                               script finishes, if one occurred or if in
                               verbose mode.  0 = no limit.  See also
                               --quiet.
 --version            OPTIONAL Print version info.  Includes template
                               version with --extended.
 --save-as-default    OPTIONAL Save the command line arguments.  Saved
                               defaults are printed at the bottom of this
                               usage output and used in every subsequent
                               call of this script.  Supplying this flag
                               replaces current defaults with all options
                               that are provided with this flag.  Values
                               are stored in [$defdir].
                               See --help --extended for changing script
                               behavior when no arguments are supplied on
                               the command line.
 --pipeline-mode      OPTIONAL Supply this flag to include the script name
                               in errors, warnings, and debug messages.  If
                               not supplied, the script will try to
                               determine if it is running within a series
                               of piped commands or as a part of a parent
                               script.
 --extended           OPTIONAL Print extended usage/help/version/header
                               (and errors/warnings where noted).  Supply
                               alone for extended usage.  Includes extended
                               version in output file headers.
                               Incompatible with --noheader.  See --help &
                               --version.

INPUT FORMAT: -i, --aa-file

The amino acid sequence file can be in fasta or fastq format. There must be 2 or more sequences. This is an unaligned sequence file. Any alignment characters present such as gap characters ('-') or spaces will be removed. It may be upper or lower case. Codons for any ambiguous nucleotides will be generated as NNN.

INPUT FORMAT: -c, --codon-file, --codon-usage-file

A tab- or space- delimited file of codon usage scores. The first column is the single-character amino acid code, the second column is the codon (3 nucleotides), and the third column us the codon's usage score. The score can be any integer or fraction. The scores will be normalized, so the score column does not need to sum to any value. Additional columns are optional and ignored. Not all codons are required to be present. All amino acids and the stop character ('*') are required to be present. To exclude a codon from incorporation into all recoded sequences, you can either remove that row from the file or comment it out by inserting a '#' at the beginning of the line. Empty lines are permissible. Example:

G	GGG	0.11	Gly
G	GGA	0.21	Gly
G	GGT	0.49	Gly
G	GGC	0.19	Gly
E	GAG	0.29	Glu
E	GAA	0.71	Glu
D	GAT	0.65	Asp
D	GAC	0.35	Asp
...

INPUT FORMAT: -w, --precomputed-weight-matrix

The format of the weight matrix file is the same as is used by sequence alignment tools such as clustalw and muscle. There is a row and column for every single-letter aminco acid code and each cell contains a score representing how good of a match the pair of amino acids is (i.e. how similar they are). The last row and column (represented by '*') contains the minimum score. Examples of these matrices can be found (at the time of this writing) here:

ftp://ftp.ncbi.nih.gov/blast/matrices/

Example (BLOSUM62):

   A  R  N  D  C  Q  E  G  H  I  L  K  M  F  P  S  T  W  Y  V  B  Z  X  *
A  4 -1 -2 -2  0 -1 -1  0 -2 -1 -1 -1 -1 -2 -1  1  0 -3 -2  0 -2 -1  0 -4
R -1  5  0 -2 -3  1  0 -2  0 -3 -2  2 -1 -3 -2 -1 -1 -3 -2 -3 -1  0 -1 -4
N -2  0  6  1 -3  0  0  0  1 -3 -3  0 -2 -3 -2  1  0 -4 -2 -3  3  0 -1 -4
D -2 -2  1  6 -3  0  2 -1 -1 -3 -4 -1 -3 -3 -1  0 -1 -4 -3 -3  4  1 -1 -4
C  0 -3 -3 -3  9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4
Q -1  1  0  0 -3  5  2 -2  0 -3 -2  1  0 -3 -1  0 -1 -2 -1 -2  0  3 -1 -4
E -1  0  0  2 -4  2  5 -2  0 -3 -3  1 -2 -3 -1  0 -1 -3 -2 -2  1  4 -1 -4
G  0 -2  0 -1 -3 -2 -2  6 -2 -4 -4 -2 -3 -3 -2  0 -2 -2 -3 -3 -1 -2 -1 -4
H -2  0  1 -1 -3  0  0 -2  8 -3 -3 -1 -2 -1 -2 -1 -2 -2  2 -3  0  0 -1 -4
I -1 -3 -3 -3 -1 -3 -3 -4 -3  4  2 -3  1  0 -3 -2 -1 -3 -1  3 -3 -3 -1 -4
L -1 -2 -3 -4 -1 -2 -3 -4 -3  2  4 -2  2  0 -3 -2 -1 -2 -1  1 -4 -3 -1 -4
K -1  2  0 -1 -3  1  1 -2 -1 -3 -2  5 -1 -3 -1  0 -1 -3 -2 -2  0  1 -1 -4
M -1 -1 -2 -3 -1  0 -2 -3 -2  1  2 -1  5  0 -2 -1 -1 -1 -1  1 -3 -1 -1 -4
F -2 -3 -3 -3 -2 -3 -3 -3 -1  0  0 -3  0  6 -4 -2 -2  1  3 -1 -3 -3 -1 -4
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4  7 -1 -1 -4 -3 -2 -2 -1 -2 -4
S  1 -1  1  0 -1  0  0  0 -1 -2 -2  0 -1 -2 -1  4  1 -3 -2 -2  0  0  0 -4
T  0 -1  0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1  1  5 -2 -2  0 -1 -1  0 -4
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1  1 -4 -3 -2 11  2 -3 -4 -3 -2 -4
Y -2 -2 -2 -3 -2 -1 -2 -3  2 -1 -1 -2 -1  3 -3 -2 -2  2  7 -1 -3 -2 -1 -4
V  0 -3 -3 -3 -1 -2 -2 -3 -3  3  1 -2  1 -1 -2 -2  0 -3 -1  4 -3 -2 -1 -4
B -2 -1  3  4 -3  0  1 -1  0 -3 -4  0 -3 -3 -2  0 -1 -4 -3 -3  4  1 -1 -4
Z -1  0  0  1 -3  3  4 -2  0 -3 -3  1 -1 -3 -1  0 -1 -3 -2 -2  1  4 -1 -4
X  0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2  0  0 -2 -1 -1 -1 -1 -1 -4
* -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4  1

INPUT FORMAT: -e, --evaluate-nt-aln-file, --nt-file

The aligned nucleotide sequence file can be in fasta or fastq format. There must be 2 or more aligned sequences of the same length and they must encode a protein (i.e. the first character must be in the first 5'->3' frame). It may be upper or lower case.

INPUT FORMAT: -d, --segment-file, --domain-file

A segment file is a tab-delimited file consisting of a series of amino acid coordinates (starting from the first amino acid as coordinate 1) in each protein. Each row of the file is for a different sequence in the input sequence file (see -i). The first column is the sequence ID (which must match all the IDs in the sequence file from the first non-white-space character after the defline character to the next white-space character (or the end of the line)). E.g. a defline of ">id1 my protein" would match a value in the first column of the segment file of "id1". See the usage for option -d for details on how the coordinates are treated.

A segment file contains coordinate pairs for each protein that are aligned with one-another independently. There must be the same number of coordinates on each row. An even or odd number of coordinates are acceptable, but they are treated in pairs, meaning each odd and even columned coordinate is treated as an inclusive pair such that the AAs corresponding to the pair are aligned with that segment [inclusive]. In the event that there are an odd number of coordinate columns, the coordinate in the last column is paired with the last coordinate in the protein.

Gaps between pairs are also aligned in an exclusive coordinate fashion, but if one protein contains a gap and another has no gap between pairs, e.g. 1-10 and 5-15 (where no characters in the first sequence are aligned with the first 4 characters of the second sequence), gap characters are artificially inserted. If the first coordinate (1) or last coordinate of a protein are not included, the first and last segments are aligned such that the end coordinate is inclusive and the inner coordinate is exclusive.

Example:

Mj	24	119	160	256	352	482	487	711
Pf	1	100	152	269	360	544	552	769
Sp	1	148	220	342	405	488	500	796
Hs	38	176	230	371	419	514	516	817

Coordinates aligned:

Mj	1-23	24-119	120-159	160-256	257-351	352-482	483-486	487-711	712-end
Pf	*	1-100	101-151	152-269	270-359	360-544	545-551	552-769	770-end
Sp	*	1-148	149-219	220-342	343-404	405-488	489-499	500-796	797-end
Hs	1-37	38-176	177-229	230-371	372-418	419-514	515-515	516-817	818-end
  • Sequence excluded from segment alignment and filled in with gap characters.

If a coordinate is larger than the protein provided, an error or warning will be printed. If the second coordinate in the last pair is larger than the protein, it will be automatically adjusted to the last possible protein coordinate with a warning. Any other coordinate out of range will generate a fatal error.

Coordinates on each row must be in sequential ascending order.

OUTPUT FORMAT: -o, --dna-suffix, --dna-extension, --outfile, --output-file

Fasta sequence file.

OUTPUT FORMAT: -s, --matrix-suffix, --matrix-extension

The format of the weight matrix file is the same as is used by sequence alignment tools such as clustalw and muscle. There is a row and column for every single-letter aminco acid code and each cell contains a score representing how good of a match the pair of amino acids is (i.e. how similar they are). The last row and column (represented by '*') contains the minimum score. Examples of these matrices can be found (at the time of this writing) here:

ftp://ftp.ncbi.nih.gov/blast/matrices/

Example (BLOSUM62):

   A  R  N  D  C  Q  E  G  H  I  L  K  M  F  P  S  T  W  Y  V  B  Z  X  *
A  4 -1 -2 -2  0 -1 -1  0 -2 -1 -1 -1 -1 -2 -1  1  0 -3 -2  0 -2 -1  0 -4
R -1  5  0 -2 -3  1  0 -2  0 -3 -2  2 -1 -3 -2 -1 -1 -3 -2 -3 -1  0 -1 -4
N -2  0  6  1 -3  0  0  0  1 -3 -3  0 -2 -3 -2  1  0 -4 -2 -3  3  0 -1 -4
D -2 -2  1  6 -3  0  2 -1 -1 -3 -4 -1 -3 -3 -1  0 -1 -4 -3 -3  4  1 -1 -4
C  0 -3 -3 -3  9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4
Q -1  1  0  0 -3  5  2 -2  0 -3 -2  1  0 -3 -1  0 -1 -2 -1 -2  0  3 -1 -4
E -1  0  0  2 -4  2  5 -2  0 -3 -3  1 -2 -3 -1  0 -1 -3 -2 -2  1  4 -1 -4
G  0 -2  0 -1 -3 -2 -2  6 -2 -4 -4 -2 -3 -3 -2  0 -2 -2 -3 -3 -1 -2 -1 -4
H -2  0  1 -1 -3  0  0 -2  8 -3 -3 -1 -2 -1 -2 -1 -2 -2  2 -3  0  0 -1 -4
I -1 -3 -3 -3 -1 -3 -3 -4 -3  4  2 -3  1  0 -3 -2 -1 -3 -1  3 -3 -3 -1 -4
L -1 -2 -3 -4 -1 -2 -3 -4 -3  2  4 -2  2  0 -3 -2 -1 -2 -1  1 -4 -3 -1 -4
K -1  2  0 -1 -3  1  1 -2 -1 -3 -2  5 -1 -3 -1  0 -1 -3 -2 -2  0  1 -1 -4
M -1 -1 -2 -3 -1  0 -2 -3 -2  1  2 -1  5  0 -2 -1 -1 -1 -1  1 -3 -1 -1 -4
F -2 -3 -3 -3 -2 -3 -3 -3 -1  0  0 -3  0  6 -4 -2 -2  1  3 -1 -3 -3 -1 -4
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4  7 -1 -1 -4 -3 -2 -2 -1 -2 -4
S  1 -1  1  0 -1  0  0  0 -1 -2 -2  0 -1 -2 -1  4  1 -3 -2 -2  0  0  0 -4
T  0 -1  0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1  1  5 -2 -2  0 -1 -1  0 -4
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1  1 -4 -3 -2 11  2 -3 -4 -3 -2 -4
Y -2 -2 -2 -3 -2 -1 -2 -3  2 -1 -1 -2 -1  3 -3 -2 -2  2  7 -1 -3 -2 -1 -4
V  0 -3 -3 -3 -1 -2 -2 -3 -3  3  1 -2  1 -1 -2 -2  0 -3 -1  4 -3 -2 -1 -4
B -2 -1  3  4 -3  0  1 -1  0 -3 -4  0 -3 -3 -2  0 -1 -4 -3 -3  4  1 -1 -4
Z -1  0  0  1 -3  3  4 -2  0 -3 -3  1 -1 -3 -1  0 -1 -3 -2 -2  1  4 -1 -4
X  0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2  0  0 -2 -1 -1 -1 -1 -1 -4
* -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4  1
                                                                                                                                                                                                                                                                                                                     * OUTPUT FORMAT: --aa-aln-suffix

The format output by the tool selected by -x/-y.

stretchAlign.pl

WHAT IS THIS:

This script, given a set of very dissimilar amino acid sequences, will locate all (sequentially ordered) stretches of identical residues between every possible pair and output a table of coordinates. The coordinate table can then be supplied to the companion script "codonHomologizer.pl" to force the alignment to align these stretches of identity together. It will preferentially select a combination of stretches that will create the fewest gaps in the final alignment produced by codonHomologizer (ignoring end gaps).

A word of caution: This is an exhaustive & recursive search. There are combinations of parameters that can yield very long runtimes. The search can be very fast by narrowing the relative coordinate window in which to search for identity. A minimum of 3 idential AA residues is enforced, but you can use --force to over-ride the restriction.

DETAILS:

This script finds identical stretches of the requested length only, however if a stretch of identity is some multiple of the requested stretch size, the resulting coordinates in the output table will be merged.

A hash is used in the search. The search is case-insensitive.

This script was created for very dissimilar sequences because existing alignment tools do not align to optimize stretches of identity, yet stretches of identity are required for crossover events to occur.^ If your sequences are very similar, this script is unnecessary.

^ https://www.ncbi.nlm.nih.gov/pubmed/27671930

USAGE (stretchAlign)

stretchAlign.pl -i "input file(s)" [options]

 -i                   REQUIRED [stdin if present] Amino acid sequence file.
 -s                   OPTIONAL [3] Stretch size.
 --help               OPTIONAL Print general info and file formats.
 --extended           OPTIONAL Print extended usage.

ADVANCED USAGE (stretchAlign)

stretchAlign.pl -i "input file(s)" [-s,-g,-l,-r,-m,-t,-p,-f,-o,-x,-e,--dir,--outfile,--help,--dry-run]
stretchAlign.pl -i "input file(s)" [-s,-g,-l,-r,-m,-t,-p,-f,-o,-x,-e,--dir,--outfile,--help,--dry-run] < input_file

 -i,--aa-file,*       REQUIRED [stdin if present] Input file(s).  Space
                               separated, globs OK (e.g. $flag "*.input
                               [A-Z].{?,??}.inp").  See --help for file
                               format.  May be supplied on standard in.
                               When standard input detected and -i is given
                               only 1 argument, it will be used as a file
                               name stub for appending outfile suffixes.
                               See --extended --help for advanced usage
                               examples.
                               *No flag required.
 -s,--stretch-size    OPTIONAL [3] This is the length with which to look
                               for perfect matches between pairs of
                               sequences.  Only a series of matches of this
                               length (or multiples of this length) will be
                               found.  The script must be re-run to look
                               for multiple lengths.
 -g,                  OPTIONAL [auto*] Matches between sequences are only
 --offset-max-global           found if they occur within this relative
                               coordinate range.  E.g. If this value is set
                               to N, a match at coordinate 100 in sequence
                               1 will only be reported if it occurs
                               (/starts) in sequence 2 between coordinates
                               100-N and 100+N.However, matches are forced
                               to be sequential.  In other words, once a
                               match is located, the next match is
                               restricted to everything to the right in
                               each sequence.  Must be greater than or
                               equal to --offset-max-local.  *If the
                               difference between lengths of the sequences
                               is greater than [50], the default range is
                               the lesser of either the sequence size
                               difference or [100].
 -l,                  OPTIONAL [auto*] Each sequential match between
 --offset-max-local            sequences are only allowed to be found if
                               they occur within this coordinate range
                               relative to the positions of the previous
                               match.  E.g. If this value is set to N and
                               the previous match was at positions 100 and
                               200 for sequences 1 and 2 respectively, a
                               match at coordinate 1000 in sequence 1 will
                               only be reported if it occurs (/starts) in
                               sequence 2 between coordinates 1100-N and
                               1100+N.However, matches are forced to be
                               sequential.  In other words, once a match is
                               located, the next match is restricted to
                               everything to the right in each sequence.
                               Must be less than or equal to
                               --offset-max-global.  *The default value is
                               the value of --offset-max-global.
 -r,                  OPTIONAL [1] The maximum proportion of anticipated
 --gap-ratio-max-              gaps to be created between a pair of
 local                         stretches (see -s).  In other words, this is
                               the difference in size of the sequences
                               between a pair of stretches over the longer
                               of the two.  A value between 0 and 1 is
                               recommended, but it can be greater than 1.
                               See -m for an alternative gap control.
 -m,--gap-max         OPTIONAL [infinity] Stretch positions are restricted
                               to a range (see -g and -l), but many gaps
                               can be created in the resulting global
                               alignment after submitting to
                               codonHomologizer.  While this maximum cannot
                               control preceisely how many gaps will be
                               used in the alignment, it can ensure that
                               the sum total of length differences between
                               the stretches (see -s) found is kept under
                               this maximum.
 -t,--time-limit      OPTIONAL [60] Report the best solution found for a
                               pair of sequences after searching for this
                               number of seconds.  Every possible pair will
                               get this many seconds to find the largest
                               number of stretches that meet all the
                               requirements.  See -p to reset this time
                               restriction upon finding a better solution.
                               Set to 0 to search all possible solutions
                               with no time limit.
 -p,                  OPTIONAL [1] As long as the solution keeps improving,
 --improvement-reset           reset the --time-limit.  Only finding
                               solutions with more stretches (see -s) will
                               reset the timer.  Improved solutions with
                               fewer alignment gaps, but the same number of
                               stretches will not reset the timer.
 -f,--aa-seq-format   OPTIONAL [auto]{auto, fasta, fastq} Amino acid
                               sequence file format.  Applies to files
                               submitted using -i.
 -o,--outfile-suffix  OPTIONAL [.seg] The extension of the output
                               tab-delimited coordinate file of matching
                               stretches (to be supplied to
                               codonHomologizer.pl using the -d option).
                               An outfile for every pair of sequences in -i
                               will be created using a numeric ID per pair
                               and prepended to this suffix (e.g. .1.seg,
                               .2.seg,...).
 -x,--aa-pair-suffix  OPTIONAL [.faa] Every pair of sequences used to find
                               matching stretches (see -i and -s) generates
                               a stretch coordinate file (see -o).  If
                               there are more than 2 sequences in the
                               supplied sequence file (see -i), a
                               corresponding fasta file (per pair) will be
                               generated with this extension in order to be
                               supplied with the stretch coordinate file to
                               codonHomologizer using codonHomologizer's -i
                               option.  An outfile for every pair of
                               sequences will be created using a numeric ID
                               per pair and prepended to this suffix (e.g.
                               .1$aa_suffix, .2$aa_suffix,...).  The pair
                               ID will correspond to the pair ID of the
                               --outfile-suffix.
 -e,                  OPTIONAL [0] Pad the coordinates of the found
 --expand-coords-by            stretches by this amount.  This can help
                               coerce the alignment algorithm used in
                               codonHomologizer to abutt non-matching amino
                               acids with the stretch so that selected
                               codons could potentially lengthen the
                               stretch of homology by a nucleotide or 2.
 --dir,--outdir       OPTIONAL [none] All outfiles will be put in this
                               directory.
 --outfile,           OPTIONAL [stdout] Output file(s) - a named outfile
 --output-file                 that is a mutually exclusive alternative to
                               supplying an outfile suffix.  Space
                               separated.
 --help               OPTIONAL [1] Print general info and file format
                               descriptions.  Includes advanced usage
                               examples with --extended.
 --dry-run            OPTIONAL Run without generating output files.
 --verbose            OPTIONAL Verbose mode/level.  (e.g. --verbose 2)
 --quiet              OPTIONAL Quiet mode.
 --overwrite          OPTIONAL Overwrite existing output files.  By
                               default, existing output files will not be
                               over-written.  Supply twice to safely*
                               remove pre-existing output directories (see
                               --outdir).  Mutually exclusive with
                               --skip-existing and --append.
                               *Will not remove a directory containing
                               manually touched files.
 --skip-existing      OPTIONAL Skip existing output files.  Mutually
                               exclusive with --overwrite and --append.
 --append             OPTIONAL Append to existing output files.  Mutually
                               exclusive with --overwrite and
                               --skip-existing.
 --force              OPTIONAL Prevent script-exit upon critical error and
                               continue processing.  Supply twice to
                               additionally prevent skipping the processing
                               of input files that cause errors.  Use this
                               option with extreme caution.  This option
                               will not over-ride over-write protection.
                               See also --overwrite or --skip-existing.
 --header,--noheader  OPTIONAL [On] Print commented script version, date,
                               and command line call to each output file.
 --debug              OPTIONAL Debug mode/level.  (e.g. --debug --debug)
                               Values less than 0 debug the template code
                               that was used to create this script.
 --error-type-limit   OPTIONAL [5] Limits each type of error/warning to
                               this number of outputs.  Intended to
                               declutter output.  Note, a summary of
                               warning/error types is printed when the
                               script finishes, if one occurred or if in
                               verbose mode.  0 = no limit.  See also
                               --quiet.
 --version            OPTIONAL Print version info.  Includes template
                               version with --extended.
 --save-as-default    OPTIONAL Save the command line arguments.  Saved
                               defaults are printed at the bottom of this
                               usage output and used in every subsequent
                               call of this script.  Supplying this flag
                               replaces current defaults with all options
                               that are provided with this flag.  Values
                               are stored in [$defdir].
                               See --help --extended for changing script
                               behavior when no arguments are supplied on
                               the command line.
 --pipeline-mode      OPTIONAL Supply this flag to include the script name
                               in errors, warnings, and debug messages.  If
                               not supplied, the script will try to
                               determine if it is running within a series
                               of piped commands or as a part of a parent
                               script.
 --extended           OPTIONAL Print extended usage/help/version/header
                               (and errors/warnings where noted).  Supply
                               alone for extended usage.  Includes extended
                               version in output file headers.
                               Incompatible with --noheader.  See --help &
                               --version.

INPUT FORMAT: -i, --aa-file

The amino acid sequence file can be in fasta or fastq format. There must be 2 or more sequences. This is an unaligned sequence file. Any alignment characters present such as gap characters ('-') or spaces will be removed. It may be upper or lower case.

OUTPUT FORMAT: -o, --outfile-suffix, --outfile, --output-file

Tab-delimited text file. The first column is the sequence ID, as parsed from the deflines of -i.

OUTPUT FORMAT: -x, --aa-pair-suffix

Fasta format.