Mapper: a fast, accurate aligner for genomic sequences
Latest release version can be downloaded here: https://github.com/mathjeff/Mapper/releases/download/1.1.0-beta12/mapper-1.1.0-beta12.jar
Contact:
Dr. Anni Zhang, MIT, [email protected]
Usage: java -jar mapper.jar [--out-vcf <out.vcf>] [--out-sam <out.sam>] [--out-refs-map-count <counts.txt>] [--out-unaligned <unaligned.fastq>] --reference <ref.fasta> --queries <queries.fastq> [options]
java -jar mapper.jar [--out-vcf <out.vcf>] [--out-sam <out.sam>] [--out-refs-map-count <counts.txt>] [--out-unaligned <unaligned.fastq>] --reference <ref.fasta> --paired-queries [--spacing ]<queries.fastq> <queries2.fastq> [options]
Aligns genomic sequences quickly and accurately using relatively high amounts of memory
INPUT:
--reference <file> the reference to use. Should be in .fasta/.fa/.fna or .fastq/.fq format or a .gz of one of those formats.
May be specified multiple times for multiple reference genomes
--queries <file> the reads to align to the reference. Should be in .fastq or .fasta format.
May be specified multiple times for multiple query files
--paired-queries <file1> <file2> [--spacing <mean> <distancePerPenalty>] Illumina-style paired-end reads to align to the reference. For each read <r1> in <file1> and the corresponding <r2> in <file2>, the alignment position of <r1> is expected to be slightly before the alignment position of the reverse complement of <r2>.
Each file should be in .fastq/.fasta format
May be specified multiple times for multiple query files
--spacing <expected> <distancePerPenalty> (default: 100 50)
Any query alignment whose inner distance deviates from <expected> has an additional penalty added.
That additional penalty equals (the difference between the actual distance and <expected>) divided by <distancePerPenalty>, unless the two query sequence alignments would overlap, in which case the additional penalty is 0.
--infer-ancestors
Requests that Mapper look for parts of the genome that likely shared a common ancestor in the past, and will lower the penalty of an alignment that mismatches the given reference but matches the inferred common ancestor.
--no-infer-ancestors
Disables ancestor inference.
--split-queries-past-size <size> Any queries longer than <size> will be split into smaller queries.
THIS OPTION IS A TEMPORARY EXPERIMENT FOR LONG READS TO DETECT REARRANGEMENTS AND IMPROVE PERFORMANCE.
--no-gapmers
When Mapper attempts to identify locations at which the query might align to the reference, Mapper first splits the query into smaller pieces and looks for an exact match for each piece.
By default, these pieces contain noncontiguous basepairs and might look like XXXXXXXX____XXXX.
This flag makes these pieces be contiguous instead to look more like XXXXXXXXXXXX.
THIS OPTION IS A TEMPORARY EXPERIMENT FOR TESTING THE PERFORMANCE OF GAPMERS.
OUTPUT FORMATS:
Summary by position
--out-vcf <file> output file to generate containing a description of mutation counts by position
--vcf-exclude-non-mutations if set, the output vcf file will exclude positions where no mutations were detected
--distinguish-query-ends <fraction> (default 0.1) In the output vcf file, we separately display which queries aligned at each position with <fraction> of the end of the query and which didn't.
Summary by genome
--out-refs-map-count <file> the output file to summarize the number of reads mapped to each combination of references
Raw output
--out-sam <file> the output file in SAM format
If <file> is '-', the SAM output will be written to stdout instead
--out-unaligned <file> output file containing unaligned reads. Must have a .fasta or .fastq extension
--no-output if no output is requested, skip writing output rather than throwing an error
Debugging
-v, --verbose output diagnostic information
--verbose-alignment output verbose information about the alignment process
This is more than what is output by --verbose
--verbose-reference output verbose information about the process of analyzing the reference
-vv implies --verbose-alignment and --verbose-reference
--verbosity-auto set verbosity flags dynamically and automatically based on the data and alignment
--out-ancestor <file> file to write our inferred ancestors. See also --no-infer-ancestors
Multiple output formats may be specified during a single run; for example:
--out-sam out.sam --out-unaligned out.fastq
ALIGNMENT PARAMETERS:
--max-penalty <fraction> (default 0.1) for a match to be reported, its penalty must be no larger than this value times its length
Setting this closer to 0 will run more quickly
--max-penalty-span <extraPenalty> (default --snp-penalty / 2) After Mapper finds an alignment having a certain penalty, Mapper will also look for and report alignments having penalties no more than <extraPenalty> additional penalty.
To only report alignments having the minimum penalty, set this to 0.
Computing the penalty of a match:
--snp-penalty <penalty> (default 1) the penalty of a point mutation
--ambiguity-penalty <penalty> (default --max-penalty) the penalty of a fully ambiguous position.
This penalty is applied in the case of an unknown basepair in the reference ('N').
--new-indel-penalty <penalty> (default 1.5) the penalty of a new insertion or deletion of length 0
--extend-indel-penalty <penalty> (default 0.5) the penalty of an extension to an existing insertion or deletion
--additional-extend-insertion-penalty <penalty> (default <ambiguity penalty>) additional penalty of an extension to an existing insertion
--max-num-matches <count> (default unlimited) the maximum number of positions on the reference that any query may match.
Any query that appears to match more than this many positions on the reference will be reported as unmatched.
OTHER:
Memory usage: to control the amount of memory that Java makes available to Mapper, give the appropriate arguments to Java:
-Xmx<amount> set <amount> as the maximum amount of memory to use.
-Xms<amount> set <amount> as the initial amount of memory to use.
For example, to start with 200 megabytes and increase up to 4 gigabytes as needed, do this
java -Xms200m -Xmx4g -jar mapper.jar <other mapper arguments>
--num-threads <count> number of threads to use at once for processing. Higher values will run more quickly on a system that has that many CPUs available.
To make changes to Mapper, see DEVELOPING.md