This script, given a set of amino acid sequences and a codon usage table, constructs the underlying genetic sequences to optimize them for crossover events. It takes 2 or more protein sequences, aligns them to optimize codon homology (using a custom amino acid weight matrix as input to a multiple sequence alignment tool (muscle)), and then recodes the amino acid sequences to be more homologous at the DNA level. It utilizes the supplied codon usage table to prefer codons that are more common in a particular organism.
For information on stretchAlign.pl, see below.
Note, the alignment step is a multiple sequence alignment, but the sequence construction step is pair-wise. Each sequence is produced multiple times, optimized for crossover for every other sequence.
To omit codons from incorporation in the resulting sequence, omit them from the codon usage file (or comment them out - refer to the format for the -c file below).
This script produces "Crossover Metrics" intended to be used to evaluate how good each pair of sequences is in terms of their potential to be involved in crossover events. Here is a listing of the columns in the Crossover Metrics table:
Source - The file the metrics are based on.
Pair - Sequential/arbitrary unique pair ID.
Aln Len - The length of the alignment produced.
Seqs - Sequence IDs.
%ID - % identity: identicals/non-gap-length.
%Aligned NTs - Non-gap aln length / aln length.
Contig Ident >= N - Sum of contiguous identical NTs >= N.
Rare Cdns(Usg<=N) - Number of codons incorporated whose usage is <= N.
Rarest Codon - Codon:usage. Rarest codon incorporated.
Frame Shift bases - Number of NTs aligned out of frame.
Seg Aln Score[0-1] - Segment alignment score, a value between 0 and 1 (inclusive).
The Seg Aln Score is only present if the -d option is used to align for example, similar protein structural domains with one another. It can also be used to force stretches of identity to align in pairs of sequences by supplying it a file generated by stretchAlign.pl (provided with this package)
To evaluate a sequence (nt alignment) produced by another tool, you can use the -e option. All this does is add entries to the Crossover Metrics table at the end of the run.
Requires the multiple sequence alignment tool called muscle, available here:
http://www.drive5.com/muscle/downloads.htm
After installing muscle, cd into the codonHomologizer directory and run the following:
perl Makefile.PL
make
sudo make install
codonHomologizer.pl -c "input file(s)" [options]
-c REQUIRED Codon usage file.
-i OPTIONAL [stdin if present] Amino acid sequence file.
--help OPTIONAL Print general info and file formats.
--extended OPTIONAL Print extended usage.
codonHomologizer.pl -c "input file(s)" [-i,-f,-o,-a,-s,--aa-aln-suffix,-x,-y,-z,-r,-w,-e,--stretch-reporting-min,--usage-reporting-max,-d,-n,--dir,--outfile,--help,--dry-run]
codonHomologizer.pl -c "input file(s)" [-i,-f,-o,-a,-s,--aa-aln-suffix,-x,-y,-z,-r,-w,-e,--stretch-reporting-min,--usage-reporting-max,-d,-n,--dir,--outfile,--help,--dry-run] < input_file
-c,--codon-file, REQUIRED Tab-delimited file containing the single
--codon-usage-file letter amino acid (AA) code, codon, and
[optional] usage score (see -n). This
script will only use the codons in this file
to construct sequences and for the matrix
used in the alignment step. All AAs must be
present. If usage is unimportant, make the
usage scores all the same (e.g. 1.0) or
supply -n to ignore the usage column.
Commented out lines in the file will be
ignored. See --help for more details.
-i,--aa-file,* OPTIONAL [stdin if present] Input file(s). Space
separated, globs OK (e.g. $flag "*.input
[A-Z].{?,??}.inp"). See --help for file
format. May be supplied on standard in.
When standard input detected and -i is given
only 1 argument, it will be used as a file
name stub for appending outfile suffixes.
See --extended --help for advanced usage
examples.
*No flag required.
-f,--aa-seq-format OPTIONAL [auto]{auto, fasta, fastq} Amino acid
sequence file format. Applies to files
submitted using -i and -e.
-o,--dna-suffix, OPTIONAL [.fna] The extension of the output DNA
--dna-extension sequence file that is constructed from and
appended to the input amino acid sequence
file (see -i).
-a,--align-codons OPTIONAL [1] Align the codons in the output DNA
sequence file to match the amino acid
alignment (see -o). This is the default
behavior. Use --no-align-codons to generate
sequences without gap characters.
-s,--matrix-suffix, OPTIONAL [none] The extension of the output protein
--matrix-extension weight matrix file that is appended to and
constructed from the codon usage file (see
-c). The generation of this file is not
required except for re-use and
reproducability. It can be re-used by
supplying it using the -w option.
--aa-aln-suffix OPTIONAL [none] The aligned amino acid sequence
generated by the alignment tool (-x/-y) can
be optionally saved to a file. While this
file is not necessary, it can be useful for
debugging. The end goal is a DNA sequence.
Note, the default outfile suffix for the DNA
output is '.fna'. If you want to output the
amino acid alignment, it is recommended to
use '.faa' to differentiate it from the DNA
output file.
-x,--alignment-tool OPTIONAL [muscle]{muscle, clustalw} The system
command line tool to be used in a system
call to align the amino acid sequences (see
-i). The path to the executable can be
provided by -y and the command line options
can be tweaked using -z.
-y, OPTIONAL [auto] The path to the executable selected
--alignment-exe-path by -x. The default is simply the name of
the executable selected by -x (e.g. 'muscle'
or 'clustalw'), assuming it's in your PATH.
-z, OPTIONAL [auto] Command line options provided to the
--alignment-opts-str sequence alignment tool specified by -y.
The default is determined by the selection
made with the -x option. For muscle, the
default options are [-gapopen -12 -gapextend
-3]. For clustalw, the default options are
[-pwgapopen -12 -pwgapext -3 -quiet]. Each
command has a series of prohibited options
that are controlled by the script.
Muscle prohibited options: [-in -out -clw
-fasta -html -phys -pyi -msf -clwstrict
-clwout -clwstrictout -msfout -htmlout
-physout -phyiout -matrix]
Clustalw prohibited options: [-infile
-pwmatrix -matrix -output -outfile -type
-interactive]
It is possible that other options other than
these that are prohibited could cause
problems, but the lists here represent only
those specific options added by this script
or those that directly conflict with those
added by this script.
-r, OPTIONAL [0] Treat the input AA sequence file as
--precomputed-aa- pre-aligned. If you already have an amino
alignment acid alignment and want to simply generate
nucleotide sequences from them that have as
much homology as possible, supply your AA
alignment file using -i and supply this
option to prevent the script from stripping
the alignment characters (e.g. gaps '-') and
re-aligning them.
-w, OPTIONAL If you do not want this script to construct
--precomputed- a matrix based on homology and codon usage
weight-matrix (using -c) to be used in aligning the amino
acid sequences, supply your preferred matrix
here. One reason to do this would be to
sacrifice DNA homology to get an alignment
that has more biological significance. The
matrix constructed by this script optimizes
for homology and ingores biological protein
similarity to promote as many crossovers as
possible in disparate sequences. A codon
usage table (-c) is still required in order
to construct sequences with optimized
homology while allowing homologous ties to
defer to the more used codons.
-e, OPTIONAL Aligned nucleotide sequence file. The
--evaluate-nt-aln- sequence generated by other tools can be
file,--nt-file supplied for evaluation using the Crossover
Metrics generated at the end of the script.
Only one such file can be supplied, but it
may contain any number of aligned sequences.
--stretch-reporting- OPTIONAL [5,11] After nucleotide sequence
min construction (see -o) or the submission of a
nucleotide alignment (see -e), metrics are
reported to gauge the effectiveness of the
alignment and codon usage to produce
sequences that are likely to be involved in
crossover events. Each possible pair of
sequences will report the number of
alignment positions involved in a contiguous
stretch of identity that is equal to or
larger than the number(s) supplied to this
option. For example, if this minimum value
is 5 and a pair of sequences each has an
aligned segment of ATTGCTA (and no other
stretches of identity that are 5 or longer),
the number reported for 'Contig Ident >= 5'
will be 7. The cutoff must be an unsigned
integer.
--usage-reporting- OPTIONAL [0.1] After nucleotide sequence construction
max (see -o) or the submission of a nucleotide
alignment (see -e), metrics are reported to
gauge the effectiveness of the alignment and
codon usage to produce sequences that are
likely to be involved in crossover events.
Generally, codons more highly used are
utilized, but lesser used codons can get
used if another codon pair between aligned
amino acids has more homology. Each
sequence will have reported the number of
rare codons that were included for each pair
of sequences. You can set the max cutoff
using this option. Multiple space-delimited
values can be supplied. Any unsigned number
is acceptable.
-d,--segment-file, OPTIONAL To align segments of each sequence (e.g.
--domain-file structural/secondary domains or any regions
that should be aligned together)
independently and stitch the global
alignment back together, you can provide a
segment file containing pairs of coordinates
that should be aligned sequentially within
each sequence.
A companion script (stretchAlign.pl) has
been provided in this package that can
generate a segment file that forces
codonHomologizer to align the largest number
of stretches of identity possible. It
generates a file for every pair of
sequences.
-n, OPTIONAL [0] If codon usage is not important to you,
--ignore-codon-usage you can ignore usage by supplying this
option. Any usage values in the codon usage
file (see -c) will be ignored and the weight
matrix used in the alignment step and the
sequence construction will only be based on
homology (and homology placement/flexibility)
. Note, the hierarchy of factors is:
homology, homology flexibility, and usage.
See --help --extended for more details.
--dir,--outdir OPTIONAL [none] All outfiles will be put in this
directory.
--outfile, OPTIONAL [stdout] Output file(s) - a named outfile
--output-file that is a mutually exclusive alternative to
supplying an outfile suffix. Space
separated.
--help OPTIONAL [1] Print general info and file format
descriptions. Includes advanced usage
examples with --extended.
--dry-run OPTIONAL Run without generating output files.
--verbose OPTIONAL Verbose mode/level. (e.g. --verbose 2)
--quiet OPTIONAL Quiet mode.
--overwrite OPTIONAL Overwrite existing output files. By
default, existing output files will not be
over-written. Supply twice to safely*
remove pre-existing output directories (see
--outdir). Mutually exclusive with
--skip-existing and --append.
*Will not remove a directory containing
manually touched files.
--skip-existing OPTIONAL Skip existing output files. Mutually
exclusive with --overwrite and --append.
--append OPTIONAL Append to existing output files. Mutually
exclusive with --overwrite and
--skip-existing.
--force OPTIONAL Prevent script-exit upon critical error and
continue processing. Supply twice to
additionally prevent skipping the processing
of input files that cause errors. Use this
option with extreme caution. This option
will not over-ride over-write protection.
See also --overwrite or --skip-existing.
--header,--noheader OPTIONAL [On] Print commented script version, date,
and command line call to each output file.
--debug OPTIONAL Debug mode/level. (e.g. --debug --debug)
Values less than 0 debug the template code
that was used to create this script.
--error-type-limit OPTIONAL [5] Limits each type of error/warning to
this number of outputs. Intended to
declutter output. Note, a summary of
warning/error types is printed when the
script finishes, if one occurred or if in
verbose mode. 0 = no limit. See also
--quiet.
--version OPTIONAL Print version info. Includes template
version with --extended.
--save-as-default OPTIONAL Save the command line arguments. Saved
defaults are printed at the bottom of this
usage output and used in every subsequent
call of this script. Supplying this flag
replaces current defaults with all options
that are provided with this flag. Values
are stored in [$defdir].
See --help --extended for changing script
behavior when no arguments are supplied on
the command line.
--pipeline-mode OPTIONAL Supply this flag to include the script name
in errors, warnings, and debug messages. If
not supplied, the script will try to
determine if it is running within a series
of piped commands or as a part of a parent
script.
--extended OPTIONAL Print extended usage/help/version/header
(and errors/warnings where noted). Supply
alone for extended usage. Includes extended
version in output file headers.
Incompatible with --noheader. See --help &
--version.
The amino acid sequence file can be in fasta or fastq format. There must be 2 or more sequences. This is an unaligned sequence file. Any alignment characters present such as gap characters ('-') or spaces will be removed. It may be upper or lower case. Codons for any ambiguous nucleotides will be generated as NNN.
A tab- or space- delimited file of codon usage scores. The first column is the single-character amino acid code, the second column is the codon (3 nucleotides), and the third column us the codon's usage score. The score can be any integer or fraction. The scores will be normalized, so the score column does not need to sum to any value. Additional columns are optional and ignored. Not all codons are required to be present. All amino acids and the stop character ('*') are required to be present. To exclude a codon from incorporation into all recoded sequences, you can either remove that row from the file or comment it out by inserting a '#' at the beginning of the line. Empty lines are permissible. Example:
G GGG 0.11 Gly
G GGA 0.21 Gly
G GGT 0.49 Gly
G GGC 0.19 Gly
E GAG 0.29 Glu
E GAA 0.71 Glu
D GAT 0.65 Asp
D GAC 0.35 Asp
...
The format of the weight matrix file is the same as is used by sequence alignment tools such as clustalw and muscle. There is a row and column for every single-letter aminco acid code and each cell contains a score representing how good of a match the pair of amino acids is (i.e. how similar they are). The last row and column (represented by '*') contains the minimum score. Examples of these matrices can be found (at the time of this writing) here:
ftp://ftp.ncbi.nih.gov/blast/matrices/
Example (BLOSUM62):
A R N D C Q E G H I L K M F P S T W Y V B Z X *
A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4
R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4
N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4
D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4
C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4
Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4
E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4
G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4
H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4
I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4
L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4
K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4
M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4
F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4
S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4
T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4
Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4
V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4
B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4
Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4
X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4
* -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1
The aligned nucleotide sequence file can be in fasta or fastq format. There must be 2 or more aligned sequences of the same length and they must encode a protein (i.e. the first character must be in the first 5'->3' frame). It may be upper or lower case.
A segment file is a tab-delimited file consisting of a series of amino acid coordinates (starting from the first amino acid as coordinate 1) in each protein. Each row of the file is for a different sequence in the input sequence file (see -i). The first column is the sequence ID (which must match all the IDs in the sequence file from the first non-white-space character after the defline character to the next white-space character (or the end of the line)). E.g. a defline of ">id1 my protein" would match a value in the first column of the segment file of "id1". See the usage for option -d for details on how the coordinates are treated.
A segment file contains coordinate pairs for each protein that are aligned with one-another independently. There must be the same number of coordinates on each row. An even or odd number of coordinates are acceptable, but they are treated in pairs, meaning each odd and even columned coordinate is treated as an inclusive pair such that the AAs corresponding to the pair are aligned with that segment [inclusive]. In the event that there are an odd number of coordinate columns, the coordinate in the last column is paired with the last coordinate in the protein.
Gaps between pairs are also aligned in an exclusive coordinate fashion, but if one protein contains a gap and another has no gap between pairs, e.g. 1-10 and 5-15 (where no characters in the first sequence are aligned with the first 4 characters of the second sequence), gap characters are artificially inserted. If the first coordinate (1) or last coordinate of a protein are not included, the first and last segments are aligned such that the end coordinate is inclusive and the inner coordinate is exclusive.
Example:
Mj 24 119 160 256 352 482 487 711
Pf 1 100 152 269 360 544 552 769
Sp 1 148 220 342 405 488 500 796
Hs 38 176 230 371 419 514 516 817
Coordinates aligned:
Mj 1-23 24-119 120-159 160-256 257-351 352-482 483-486 487-711 712-end
Pf * 1-100 101-151 152-269 270-359 360-544 545-551 552-769 770-end
Sp * 1-148 149-219 220-342 343-404 405-488 489-499 500-796 797-end
Hs 1-37 38-176 177-229 230-371 372-418 419-514 515-515 516-817 818-end
- Sequence excluded from segment alignment and filled in with gap characters.
If a coordinate is larger than the protein provided, an error or warning will be printed. If the second coordinate in the last pair is larger than the protein, it will be automatically adjusted to the last possible protein coordinate with a warning. Any other coordinate out of range will generate a fatal error.
Coordinates on each row must be in sequential ascending order.
Fasta sequence file.
The format of the weight matrix file is the same as is used by sequence alignment tools such as clustalw and muscle. There is a row and column for every single-letter aminco acid code and each cell contains a score representing how good of a match the pair of amino acids is (i.e. how similar they are). The last row and column (represented by '*') contains the minimum score. Examples of these matrices can be found (at the time of this writing) here:
ftp://ftp.ncbi.nih.gov/blast/matrices/
Example (BLOSUM62):
A R N D C Q E G H I L K M F P S T W Y V B Z X *
A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4
R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4
N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4
D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4
C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4
Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4
E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4
G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4
H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4
I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4
L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4
K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4
M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4
F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4
S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4
T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4
Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4
V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4
B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4
Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4
X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4
* -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1
* OUTPUT FORMAT: --aa-aln-suffix
The format output by the tool selected by -x/-y.
This script, given a set of very dissimilar amino acid sequences, will locate all (sequentially ordered) stretches of identical residues between every possible pair and output a table of coordinates. The coordinate table can then be supplied to the companion script "codonHomologizer.pl" to force the alignment to align these stretches of identity together. It will preferentially select a combination of stretches that will create the fewest gaps in the final alignment produced by codonHomologizer (ignoring end gaps).
A word of caution: This is an exhaustive & recursive search. There are combinations of parameters that can yield very long runtimes. The search can be very fast by narrowing the relative coordinate window in which to search for identity. A minimum of 3 idential AA residues is enforced, but you can use --force to over-ride the restriction.
This script finds identical stretches of the requested length only, however if a stretch of identity is some multiple of the requested stretch size, the resulting coordinates in the output table will be merged.
A hash is used in the search. The search is case-insensitive.
This script was created for very dissimilar sequences because existing alignment tools do not align to optimize stretches of identity, yet stretches of identity are required for crossover events to occur.^ If your sequences are very similar, this script is unnecessary.
^ https://www.ncbi.nlm.nih.gov/pubmed/27671930
stretchAlign.pl -i "input file(s)" [options]
-i REQUIRED [stdin if present] Amino acid sequence file.
-s OPTIONAL [3] Stretch size.
--help OPTIONAL Print general info and file formats.
--extended OPTIONAL Print extended usage.
stretchAlign.pl -i "input file(s)" [-s,-g,-l,-r,-m,-t,-p,-f,-o,-x,-e,--dir,--outfile,--help,--dry-run]
stretchAlign.pl -i "input file(s)" [-s,-g,-l,-r,-m,-t,-p,-f,-o,-x,-e,--dir,--outfile,--help,--dry-run] < input_file
-i,--aa-file,* REQUIRED [stdin if present] Input file(s). Space
separated, globs OK (e.g. $flag "*.input
[A-Z].{?,??}.inp"). See --help for file
format. May be supplied on standard in.
When standard input detected and -i is given
only 1 argument, it will be used as a file
name stub for appending outfile suffixes.
See --extended --help for advanced usage
examples.
*No flag required.
-s,--stretch-size OPTIONAL [3] This is the length with which to look
for perfect matches between pairs of
sequences. Only a series of matches of this
length (or multiples of this length) will be
found. The script must be re-run to look
for multiple lengths.
-g, OPTIONAL [auto*] Matches between sequences are only
--offset-max-global found if they occur within this relative
coordinate range. E.g. If this value is set
to N, a match at coordinate 100 in sequence
1 will only be reported if it occurs
(/starts) in sequence 2 between coordinates
100-N and 100+N.However, matches are forced
to be sequential. In other words, once a
match is located, the next match is
restricted to everything to the right in
each sequence. Must be greater than or
equal to --offset-max-local. *If the
difference between lengths of the sequences
is greater than [50], the default range is
the lesser of either the sequence size
difference or [100].
-l, OPTIONAL [auto*] Each sequential match between
--offset-max-local sequences are only allowed to be found if
they occur within this coordinate range
relative to the positions of the previous
match. E.g. If this value is set to N and
the previous match was at positions 100 and
200 for sequences 1 and 2 respectively, a
match at coordinate 1000 in sequence 1 will
only be reported if it occurs (/starts) in
sequence 2 between coordinates 1100-N and
1100+N.However, matches are forced to be
sequential. In other words, once a match is
located, the next match is restricted to
everything to the right in each sequence.
Must be less than or equal to
--offset-max-global. *The default value is
the value of --offset-max-global.
-r, OPTIONAL [1] The maximum proportion of anticipated
--gap-ratio-max- gaps to be created between a pair of
local stretches (see -s). In other words, this is
the difference in size of the sequences
between a pair of stretches over the longer
of the two. A value between 0 and 1 is
recommended, but it can be greater than 1.
See -m for an alternative gap control.
-m,--gap-max OPTIONAL [infinity] Stretch positions are restricted
to a range (see -g and -l), but many gaps
can be created in the resulting global
alignment after submitting to
codonHomologizer. While this maximum cannot
control preceisely how many gaps will be
used in the alignment, it can ensure that
the sum total of length differences between
the stretches (see -s) found is kept under
this maximum.
-t,--time-limit OPTIONAL [60] Report the best solution found for a
pair of sequences after searching for this
number of seconds. Every possible pair will
get this many seconds to find the largest
number of stretches that meet all the
requirements. See -p to reset this time
restriction upon finding a better solution.
Set to 0 to search all possible solutions
with no time limit.
-p, OPTIONAL [1] As long as the solution keeps improving,
--improvement-reset reset the --time-limit. Only finding
solutions with more stretches (see -s) will
reset the timer. Improved solutions with
fewer alignment gaps, but the same number of
stretches will not reset the timer.
-f,--aa-seq-format OPTIONAL [auto]{auto, fasta, fastq} Amino acid
sequence file format. Applies to files
submitted using -i.
-o,--outfile-suffix OPTIONAL [.seg] The extension of the output
tab-delimited coordinate file of matching
stretches (to be supplied to
codonHomologizer.pl using the -d option).
An outfile for every pair of sequences in -i
will be created using a numeric ID per pair
and prepended to this suffix (e.g. .1.seg,
.2.seg,...).
-x,--aa-pair-suffix OPTIONAL [.faa] Every pair of sequences used to find
matching stretches (see -i and -s) generates
a stretch coordinate file (see -o). If
there are more than 2 sequences in the
supplied sequence file (see -i), a
corresponding fasta file (per pair) will be
generated with this extension in order to be
supplied with the stretch coordinate file to
codonHomologizer using codonHomologizer's -i
option. An outfile for every pair of
sequences will be created using a numeric ID
per pair and prepended to this suffix (e.g.
.1$aa_suffix, .2$aa_suffix,...). The pair
ID will correspond to the pair ID of the
--outfile-suffix.
-e, OPTIONAL [0] Pad the coordinates of the found
--expand-coords-by stretches by this amount. This can help
coerce the alignment algorithm used in
codonHomologizer to abutt non-matching amino
acids with the stretch so that selected
codons could potentially lengthen the
stretch of homology by a nucleotide or 2.
--dir,--outdir OPTIONAL [none] All outfiles will be put in this
directory.
--outfile, OPTIONAL [stdout] Output file(s) - a named outfile
--output-file that is a mutually exclusive alternative to
supplying an outfile suffix. Space
separated.
--help OPTIONAL [1] Print general info and file format
descriptions. Includes advanced usage
examples with --extended.
--dry-run OPTIONAL Run without generating output files.
--verbose OPTIONAL Verbose mode/level. (e.g. --verbose 2)
--quiet OPTIONAL Quiet mode.
--overwrite OPTIONAL Overwrite existing output files. By
default, existing output files will not be
over-written. Supply twice to safely*
remove pre-existing output directories (see
--outdir). Mutually exclusive with
--skip-existing and --append.
*Will not remove a directory containing
manually touched files.
--skip-existing OPTIONAL Skip existing output files. Mutually
exclusive with --overwrite and --append.
--append OPTIONAL Append to existing output files. Mutually
exclusive with --overwrite and
--skip-existing.
--force OPTIONAL Prevent script-exit upon critical error and
continue processing. Supply twice to
additionally prevent skipping the processing
of input files that cause errors. Use this
option with extreme caution. This option
will not over-ride over-write protection.
See also --overwrite or --skip-existing.
--header,--noheader OPTIONAL [On] Print commented script version, date,
and command line call to each output file.
--debug OPTIONAL Debug mode/level. (e.g. --debug --debug)
Values less than 0 debug the template code
that was used to create this script.
--error-type-limit OPTIONAL [5] Limits each type of error/warning to
this number of outputs. Intended to
declutter output. Note, a summary of
warning/error types is printed when the
script finishes, if one occurred or if in
verbose mode. 0 = no limit. See also
--quiet.
--version OPTIONAL Print version info. Includes template
version with --extended.
--save-as-default OPTIONAL Save the command line arguments. Saved
defaults are printed at the bottom of this
usage output and used in every subsequent
call of this script. Supplying this flag
replaces current defaults with all options
that are provided with this flag. Values
are stored in [$defdir].
See --help --extended for changing script
behavior when no arguments are supplied on
the command line.
--pipeline-mode OPTIONAL Supply this flag to include the script name
in errors, warnings, and debug messages. If
not supplied, the script will try to
determine if it is running within a series
of piped commands or as a part of a parent
script.
--extended OPTIONAL Print extended usage/help/version/header
(and errors/warnings where noted). Supply
alone for extended usage. Includes extended
version in output file headers.
Incompatible with --noheader. See --help &
--version.
The amino acid sequence file can be in fasta or fastq format. There must be 2 or more sequences. This is an unaligned sequence file. Any alignment characters present such as gap characters ('-') or spaces will be removed. It may be upper or lower case.
Tab-delimited text file. The first column is the sequence ID, as parsed from the deflines of -i.
Fasta format.