Skip to content

angadps/freebayes

 
 

Repository files navigation

== FREEBAYES ==

Overview:

FreeBayes is a Bayesian genetic variant detector designed to find small
polymorphisms, specifically SNPs (single-nucleotide polymorphisms), indels
(insertions and deletions), MNPs (multi-nucleotide polymorphisms), and complex
events (composite insertion and substitution events) smaller than the length of
a short-read sequencing alignment.  It uses short-read alignments (BAM files
with Phred+33 encoded quality scores) for any number of individuals from a
population and a reference genome to determine the most-likely combination of
genotypes for the population at each position in a reference genome (FASTA).  It
reports positions which it finds to be more likely polymorphic than monomorphic
in a standard variant interchange format (VCF).  It can also use an input set of
variants (VCF) as a source of prior information, and a copy number variant map
(BED) to define non-uniform ploidy variation across the samples under analysis.


Documentation and citation:

See http://arxiv.org/abs/1207.3907 for an overview of the statistical models
used in FreeBayes.  We ask that you cite this paper if you use FreeBayes in
work that leads to publication.


Obtaining:

To download FreeBayes, please use git to download the most recent development
tree.  Currently, the tree is hosted on github, and can be obtained via:

    % git clone --recursive git://github.com/ekg/freebayes.git

Note the use of --recursive.  This is required, as the project contains some
nested git submodules for external repositories.

If you encounter issues with the development HEAD, or simply wish to obtain the
most recent stable revision (0.9.9) then use:

    % git checkout c993c5c07e7673

Compilation:

FreeBayes requires g++ and the standard C and C++ development libraries.
Additionally, cmake is required for building the BamTools API.

    % make && sudo make install

Will build and install the executable freebayes to /usr/local/bin, as well as
the utilities bamfiltertech and bamleftalign.


Usage:

In its simplest operation, it requires only two inputs: a FASTA reference
sequence, and a BAM-format alignment file sorted by reference position.  For
instance:

    % freebayes --fasta-reference h.sapiens.fasta NA20504.bam

Will produce a VCF (Variant Call Format [1]) file on standard out describing
all SNPs, INDELs, MNPs, and Complex events between the reference and the
alignments in NA20504.bam.  In order to produce correct output, the reference
supplied must be the reference to which NA20504.bam was aligned.

Users may specify any number of BAM files on the command line.  FreeBayes uses
the BamTools API [2] to open and parse these files in parallel, virtually
merging them at runtime into one logical file with a merged header.  (When
provided multiple input files, the input to freebayes is the same as the output
of the bamtools merge utility.)

For a description of available command-line options and their defaults, run:

    % freebayes --help


Calling variants:

FreeBayes a standard VCF 4.1 outut stream.  This format is designed for the
probabilistic description of allelic variants within a population of samples,
but it is equally suited to describing the probability of variation in a single
sample.

Of primary interest to most users is the QUAL field, which estimates the
probability that there is a polymorphism at the loci described by the record.
In freebayes, this value can be understood as 1 - P(locus is homozygous given
the data).  It is recommended that users use this value to filter their
results, rather than accepting anything output by freebayes as ground truth.

By default, records are output even if they have very low probability of
variation, in expectation that the VCF will be filtered using tools such as
vcffilter in vcflib [6].


Calling variants in a population:

FreeBayes is designed to be run on many individuals from the same population
(e.g. many human samples) simultaneously.  The algorithm exploits a neutral
model of evolution and allele diffusion to impute most-confident genotypings
across the entire population.  In practice, the quality and confidence in the
callset will increase if you run multiple samples simultaneously.  If your
study has multiple individuals, you should run freebayes against them at the
same time.

To call variants in a population of samples, each alignment must have a read
group identifier attached to it (RG tag), and the header of the BAM file in
which it resides must map the RG tags to sample names (SM).  Furthermore, read
group IDs must be unique across all the files used in the analysis.  One read
group cannot map to multiple samples.  The reason this is required is that
freebayes operates on a virtually merged BAM stream provided by the BamTools
API.  If merging the files in your analysis using bamtools merge would generate
a file in which multiple samples map to the same RG, the files are not suitable
for use in population calling, and they must be modified.

Users may add RG tags to BAM files which were generated without this
information by using bamaddrg [5].  If you have many files corresponding to
many individuals, add a unique read group and sample name to each, and then
open them all simultaneously with freebayes.  The VCF output will have one
column per sample in the input.


Input filters:

FreeBayes filters its input so as to ignore low-confidence alignments and
alleles which are only supported by low-quality sequencing observations (see
--min-mapping-quality and --min-base-quality).  It also will only evaluate a
position if at least one read has mapping quality of
--min-supporting-mapping-quality and one allele has quality of at least
--min-supporting-base-quality.  All these quality filters are set to sensible
defaults, but may be turned off by specifying --no-filters on the command line,

Reads with more than a fixed number of high-quality mismatches can be excluded
by specifying --read-mismatch-limit.

As a guard against spurious variation caused by sequencing artifacts, positions
are skipped when no more than --min-alternate-count or --min-alternate-fraction
non-clonal observations of an alternate are found in one sample.

In data with high rates of insertion and deletion errors, you can use
--indel-exclusion-window to exclude bases from reads within a number of bases
of a putative indel allele.  (This behavior is incompatible with indel
detection.)


Stream processing:

FreeBayes can read BAM from standard input (--stdin) instead of directly from
files.  This allows the application of any number of streaming BAM filters and
calibrators to its input.  Two filters are of particular interest:

    1) base alignment quality (BAQ) adjustment, a quality adjustment
    which applies a hidden markov model of read genesis to each alignment
    independently.  This is currently implemented by samtools calmd.  (See
    Biological Sequence Analysis Probabilistic Models of Proteins and Nucleic
    Acids by Durbin et. al. for more details.)

    2) read-independent left realignment of indels.  Aligners may position gaps
    in reads inconsistently despite the fact that the indels represent identical
    variation.  By realigning insertions and deletions as far left as they will go
    without introducing mismatches between read placement and reference, we can
    homogenize the input to deal with the most common classes of indels.

For example, you can apply BAQ adjustment region 10:6000..7000 to a set of BAM
files using this pattern:

    % bamtools filter -region 10:6000..7000 -in NA20504.bam -in NA20507.bam \
       | samtools calmd -EAru - reference.fasta \
       | freebayes --stdin -f reference.fasta

Using this pattern, you can filter out reads with certain criteria using
bamtools filter without having to modify the input BAM file.  You can also use
the bamtools API to write your own custom filters in C++.  An example filter is
bamfiltertech (src/bamfiltertech.cpp), which is provided here to filter out
technologies which have characteristic errors which may frustrate certain types
of variant detection.


INDELs:

FreeBayes has been tested as an indel caller as part of the 1000 Genomes
Project.  In principle, any gapped aligner which is sensitive to indels will
produce satisfactory results.  Indels are called by default, but they may be
ignored by using the --no-indels flag.  Due to potential ambiguity, indels are
not parsed when they overlap the beginning or end of alignment boundaries.

When calling indels, it is important to homogenize the positional distribution
of insertions and deletions in the input by using left realignment.  This can
be done by using the --left-align-indels flag, or in a streaming fashion as in
this indel detection example:

    % bamtools merge -region 10:6000..7000 -in NA20504.bam -in NA20507.bam \
       | bamleftalign -f reference.fasta \
       | samtools fillmd -Aru - reference.fasta \
       | freebayes --stdin \
                   --region 10:6000..7000 \
                   -f reference.fasta

(Note that BAQ is applied after realignment.  Also note the --indels flag,
which is required for indel detection.  Also, when supplied a --region
specifier but reading from stdin, freebayes will not report sites outside of
the target region, which may occur as bamtools merge emits all reads which are
partially overlapping the target region.)

Left realignment will place all indels in homopolymer and microsatellite
repeats at the same position, provided that doing so does not introduce
mismatches between the read and reference other than the indel.  Indel
detection using left realignment is not perfect, as some classes of indels,
such as non-homologous insertions in repetitive sequence, are not presently
handled.  However, this method computationally inexpensive and handles the most
common classes of alignment inconsistency.


Pooled datasets:

While FreeBayes always 'pools' sequence information from many individuals to
improve the accuracy of small variant calls, it can also operate on 'pooled
sequencing' datasets, in which DNA from many individuals is simultaneously
sequenced and the original sample which originates a specific observation
cannot be determined.  To run on pooled data, set the --pooled flag (which
turns off the prior component derived from the probability of a specific
distribution of heterozygotes and homozygotes given the allele frequency), and
set --ploidy to the number of total copies of the genome in each pooled sample.
For example, if you have 10 individuals in a pool, and each individual is
diploid, you would set --ploidy 20.  Users should be aware that the current
implementation may have poor performance with ploidy set very high (e.g. 40).


Input variants:

FreeBayes can use a BGZIP-compressed and tabix-indexed [7] VCF file as input,
specified using the --variant-input parameter.  Alleles, their frequencies, and
genotype likelihoods can all be used as out-of-band input to the algorithm.
This allows the use of information from population-specific panels during the
detection of variants within a single individual.

Directed genotyping can be enabled with the addition of the
--use-only-input-alleles flag.  When this flag is set, the model will only
evaluate the input alleles, and the output VCF will contain exactly the same set
of alleles that were input and genotypes for all the samples in the input
alignments.

The input data is handled differently depending on what information is included
in the VCF.  There are three modes that are enabled on the basis of the data
available in the input VCF:

   1) Sites and alleles only: If the input file has only a set of sites and
   alleles, the input variants serve as a set of hints which indicate that a
   variant may be known or expected at a given loci.  An allele specified in the
   input will be considered in the Bayesian model even if there is not enough
   read evidence to pass the algorithm's input filters.  However, the output
   will only contain those alleles which have enough support in the data to pass
   the --pvar threshold

   and,

   2) Allele frequencies: If the input file has allele frequency information via
   the AF INFO field tag, then the allele frequency of each allele will be
   used as a prior in the Bayesian model.

   or

   3) Samples and genotype likelihoods: If samples are provided, then their
   listed genotype likelihoods (GL) are used and the samples are incorporated
   into the model.  This allows the imputation of a single, or small set of
   samples against a much larger panel without requiring the use of the full
   alignment data from that panel.

Mode (1) is required for (2) or (3), but (3) supercedes (2).  To enable (1) or
(2), remove all of the sample-specific columns from the VCF.


Bugs:

Please report bugs using the built-in bug reporting feature in github or by
sending the author an email.


Mailing list:

A summary of each commit will be posted to [email protected].  Please
use this mailing list for public discussion of freebayes and variant detection
issues.  If you have a general question, this is the best place to pose it, as
it may be helpful to other users.


Author: Erik Garrison <[email protected]>
        Marth Lab [3], Boston College

License: MIT

References:

[1] http://www.1000genomes.org/wiki/doku.php?id=1000_genomes:analysis:vcf4.0
[2] http://sourceforge.net/projects/bamtools/
[3] http://bioinformatics.bc.edu/marthlab/Main_Page
[4] http://bioinformatics.bc.edu/marthlab/Mosaik
[5] https://github.com/ekg/bamaddrg
[6] https://github.com/ekg/vcflib
[7] http://samtools.sourceforge.net/tabix.shtml

About

Bayesian haplotype-based polymorphism discovery and genotyping.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published