forked from freebayes/freebayes
-
Notifications
You must be signed in to change notification settings - Fork 0
Bayesian haplotype-based polymorphism discovery and genotyping.
License
angadps/freebayes
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
== FREEBAYES == Overview: FreeBayes is a Bayesian genetic variant detector designed to find small polymorphisms, specifically SNPs (single-nucleotide polymorphisms), indels (insertions and deletions), MNPs (multi-nucleotide polymorphisms), and complex events (composite insertion and substitution events) smaller than the length of a short-read sequencing alignment. It uses short-read alignments (BAM files with Phred+33 encoded quality scores) for any number of individuals from a population and a reference genome to determine the most-likely combination of genotypes for the population at each position in a reference genome (FASTA). It reports positions which it finds to be more likely polymorphic than monomorphic in a standard variant interchange format (VCF). It can also use an input set of variants (VCF) as a source of prior information, and a copy number variant map (BED) to define non-uniform ploidy variation across the samples under analysis. Documentation and citation: See http://arxiv.org/abs/1207.3907 for an overview of the statistical models used in FreeBayes. We ask that you cite this paper if you use FreeBayes in work that leads to publication. Obtaining: To download FreeBayes, please use git to download the most recent development tree. Currently, the tree is hosted on github, and can be obtained via: % git clone --recursive git://github.com/ekg/freebayes.git Note the use of --recursive. This is required, as the project contains some nested git submodules for external repositories. If you encounter issues with the development HEAD, or simply wish to obtain the most recent stable revision (0.9.9) then use: % git checkout c993c5c07e7673 Compilation: FreeBayes requires g++ and the standard C and C++ development libraries. Additionally, cmake is required for building the BamTools API. % make && sudo make install Will build and install the executable freebayes to /usr/local/bin, as well as the utilities bamfiltertech and bamleftalign. Usage: In its simplest operation, it requires only two inputs: a FASTA reference sequence, and a BAM-format alignment file sorted by reference position. For instance: % freebayes --fasta-reference h.sapiens.fasta NA20504.bam Will produce a VCF (Variant Call Format [1]) file on standard out describing all SNPs, INDELs, MNPs, and Complex events between the reference and the alignments in NA20504.bam. In order to produce correct output, the reference supplied must be the reference to which NA20504.bam was aligned. Users may specify any number of BAM files on the command line. FreeBayes uses the BamTools API [2] to open and parse these files in parallel, virtually merging them at runtime into one logical file with a merged header. (When provided multiple input files, the input to freebayes is the same as the output of the bamtools merge utility.) For a description of available command-line options and their defaults, run: % freebayes --help Calling variants: FreeBayes a standard VCF 4.1 outut stream. This format is designed for the probabilistic description of allelic variants within a population of samples, but it is equally suited to describing the probability of variation in a single sample. Of primary interest to most users is the QUAL field, which estimates the probability that there is a polymorphism at the loci described by the record. In freebayes, this value can be understood as 1 - P(locus is homozygous given the data). It is recommended that users use this value to filter their results, rather than accepting anything output by freebayes as ground truth. By default, records are output even if they have very low probability of variation, in expectation that the VCF will be filtered using tools such as vcffilter in vcflib [6]. Calling variants in a population: FreeBayes is designed to be run on many individuals from the same population (e.g. many human samples) simultaneously. The algorithm exploits a neutral model of evolution and allele diffusion to impute most-confident genotypings across the entire population. In practice, the quality and confidence in the callset will increase if you run multiple samples simultaneously. If your study has multiple individuals, you should run freebayes against them at the same time. To call variants in a population of samples, each alignment must have a read group identifier attached to it (RG tag), and the header of the BAM file in which it resides must map the RG tags to sample names (SM). Furthermore, read group IDs must be unique across all the files used in the analysis. One read group cannot map to multiple samples. The reason this is required is that freebayes operates on a virtually merged BAM stream provided by the BamTools API. If merging the files in your analysis using bamtools merge would generate a file in which multiple samples map to the same RG, the files are not suitable for use in population calling, and they must be modified. Users may add RG tags to BAM files which were generated without this information by using bamaddrg [5]. If you have many files corresponding to many individuals, add a unique read group and sample name to each, and then open them all simultaneously with freebayes. The VCF output will have one column per sample in the input. Input filters: FreeBayes filters its input so as to ignore low-confidence alignments and alleles which are only supported by low-quality sequencing observations (see --min-mapping-quality and --min-base-quality). It also will only evaluate a position if at least one read has mapping quality of --min-supporting-mapping-quality and one allele has quality of at least --min-supporting-base-quality. All these quality filters are set to sensible defaults, but may be turned off by specifying --no-filters on the command line, Reads with more than a fixed number of high-quality mismatches can be excluded by specifying --read-mismatch-limit. As a guard against spurious variation caused by sequencing artifacts, positions are skipped when no more than --min-alternate-count or --min-alternate-fraction non-clonal observations of an alternate are found in one sample. In data with high rates of insertion and deletion errors, you can use --indel-exclusion-window to exclude bases from reads within a number of bases of a putative indel allele. (This behavior is incompatible with indel detection.) Stream processing: FreeBayes can read BAM from standard input (--stdin) instead of directly from files. This allows the application of any number of streaming BAM filters and calibrators to its input. Two filters are of particular interest: 1) base alignment quality (BAQ) adjustment, a quality adjustment which applies a hidden markov model of read genesis to each alignment independently. This is currently implemented by samtools calmd. (See Biological Sequence Analysis Probabilistic Models of Proteins and Nucleic Acids by Durbin et. al. for more details.) 2) read-independent left realignment of indels. Aligners may position gaps in reads inconsistently despite the fact that the indels represent identical variation. By realigning insertions and deletions as far left as they will go without introducing mismatches between read placement and reference, we can homogenize the input to deal with the most common classes of indels. For example, you can apply BAQ adjustment region 10:6000..7000 to a set of BAM files using this pattern: % bamtools filter -region 10:6000..7000 -in NA20504.bam -in NA20507.bam \ | samtools calmd -EAru - reference.fasta \ | freebayes --stdin -f reference.fasta Using this pattern, you can filter out reads with certain criteria using bamtools filter without having to modify the input BAM file. You can also use the bamtools API to write your own custom filters in C++. An example filter is bamfiltertech (src/bamfiltertech.cpp), which is provided here to filter out technologies which have characteristic errors which may frustrate certain types of variant detection. INDELs: FreeBayes has been tested as an indel caller as part of the 1000 Genomes Project. In principle, any gapped aligner which is sensitive to indels will produce satisfactory results. Indels are called by default, but they may be ignored by using the --no-indels flag. Due to potential ambiguity, indels are not parsed when they overlap the beginning or end of alignment boundaries. When calling indels, it is important to homogenize the positional distribution of insertions and deletions in the input by using left realignment. This can be done by using the --left-align-indels flag, or in a streaming fashion as in this indel detection example: % bamtools merge -region 10:6000..7000 -in NA20504.bam -in NA20507.bam \ | bamleftalign -f reference.fasta \ | samtools fillmd -Aru - reference.fasta \ | freebayes --stdin \ --region 10:6000..7000 \ -f reference.fasta (Note that BAQ is applied after realignment. Also note the --indels flag, which is required for indel detection. Also, when supplied a --region specifier but reading from stdin, freebayes will not report sites outside of the target region, which may occur as bamtools merge emits all reads which are partially overlapping the target region.) Left realignment will place all indels in homopolymer and microsatellite repeats at the same position, provided that doing so does not introduce mismatches between the read and reference other than the indel. Indel detection using left realignment is not perfect, as some classes of indels, such as non-homologous insertions in repetitive sequence, are not presently handled. However, this method computationally inexpensive and handles the most common classes of alignment inconsistency. Pooled datasets: While FreeBayes always 'pools' sequence information from many individuals to improve the accuracy of small variant calls, it can also operate on 'pooled sequencing' datasets, in which DNA from many individuals is simultaneously sequenced and the original sample which originates a specific observation cannot be determined. To run on pooled data, set the --pooled flag (which turns off the prior component derived from the probability of a specific distribution of heterozygotes and homozygotes given the allele frequency), and set --ploidy to the number of total copies of the genome in each pooled sample. For example, if you have 10 individuals in a pool, and each individual is diploid, you would set --ploidy 20. Users should be aware that the current implementation may have poor performance with ploidy set very high (e.g. 40). Input variants: FreeBayes can use a BGZIP-compressed and tabix-indexed [7] VCF file as input, specified using the --variant-input parameter. Alleles, their frequencies, and genotype likelihoods can all be used as out-of-band input to the algorithm. This allows the use of information from population-specific panels during the detection of variants within a single individual. Directed genotyping can be enabled with the addition of the --use-only-input-alleles flag. When this flag is set, the model will only evaluate the input alleles, and the output VCF will contain exactly the same set of alleles that were input and genotypes for all the samples in the input alignments. The input data is handled differently depending on what information is included in the VCF. There are three modes that are enabled on the basis of the data available in the input VCF: 1) Sites and alleles only: If the input file has only a set of sites and alleles, the input variants serve as a set of hints which indicate that a variant may be known or expected at a given loci. An allele specified in the input will be considered in the Bayesian model even if there is not enough read evidence to pass the algorithm's input filters. However, the output will only contain those alleles which have enough support in the data to pass the --pvar threshold and, 2) Allele frequencies: If the input file has allele frequency information via the AF INFO field tag, then the allele frequency of each allele will be used as a prior in the Bayesian model. or 3) Samples and genotype likelihoods: If samples are provided, then their listed genotype likelihoods (GL) are used and the samples are incorporated into the model. This allows the imputation of a single, or small set of samples against a much larger panel without requiring the use of the full alignment data from that panel. Mode (1) is required for (2) or (3), but (3) supercedes (2). To enable (1) or (2), remove all of the sample-specific columns from the VCF. Bugs: Please report bugs using the built-in bug reporting feature in github or by sending the author an email. Mailing list: A summary of each commit will be posted to [email protected]. Please use this mailing list for public discussion of freebayes and variant detection issues. If you have a general question, this is the best place to pose it, as it may be helpful to other users. Author: Erik Garrison <[email protected]> Marth Lab [3], Boston College License: MIT References: [1] http://www.1000genomes.org/wiki/doku.php?id=1000_genomes:analysis:vcf4.0 [2] http://sourceforge.net/projects/bamtools/ [3] http://bioinformatics.bc.edu/marthlab/Main_Page [4] http://bioinformatics.bc.edu/marthlab/Mosaik [5] https://github.com/ekg/bamaddrg [6] https://github.com/ekg/vcflib [7] http://samtools.sourceforge.net/tabix.shtml
About
Bayesian haplotype-based polymorphism discovery and genotyping.
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published