Skip to content

Releases: broadinstitute/gatk

4.1.4.0

08 Oct 19:40
0a50314
Compare
Choose a tag to compare

Download release: gatk-4.1.4.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/

Highlights of the 4.1.4.0 release:

  • Major improvements and fixes to Mutect2, including more intelligent handling of paired reads during genotyping and better filtering.

  • Important bug fixes to HaplotypeCaller, the joint calling pipeline, and Funcotator

  • Beta support for building/testing on Java 11 (#6119) (#6145)

    • We encourage you to try this out and give us feedback!

Full list of changes:

  • New Tools

    • AlleleFrequencyQC: a QC tool that uses VariantEval to bin variants in 1000 Genomes by allele frequency. For each bin, we compare the expected allele frequency from 1000 Genomes with the observed allele frequency in the input VCF. This was designed with arrays in mind, as a way to discover potential bugs in our pipeline. #6039)
  • Mutect2

    • Mutect2 genotyping now forces paired reads to support the same haplotype (#5831)
    • New FilterAlignmentArtifacts now realigns a locally-assembled unitig of all variant read pairs (#6143)
    • Fixed a Mutect2 bug that overfiltered by one variant (#6101)
    • Fixed a small gene panel edge case for CalculateContamination (#6137)
    • Fixed a small gene panel edge case in orientation bias filter (#6141)
    • Unified the NIO and non-NIO M2 WDLs (call-caching will now work on Terra) (#6108)
    • Updated Mutect2 pon WDL to WDL 1.0 (#6187)
    • Removed Oncotator from the M2 WDL (Funcotator is still there) (#6144)
    • Fixed an issue in the M2 WDL that could cause the Funcotate task to be ignored by tools such as dxWDL (#6077)
    • Some miscellaneous code refactoring/improvements (#6184) (#6136) (#6107) (#6159)
  • HaplotypeCaller

    • HaplotypeCaller now force-calls like Mutect2: the -genotyping-mode GENOTYPE_GIVEN_ALLELES argument is gone (now you only need to specify --alleles force-calls.vcf) and alleles are now force-called in addition to any other alleles (#6090)
    • Renamed --output-mode EMIT_ALL_SITES to --output-mode EMIT_ALL_ACTIVE_SITES, and clarified the documentation for the argument (#6181)
    • Fixed a rare bug in the genotyping engine where it could emit untrimmed alleles for SNP sites (#6044)
    • Fixed some sources of non-determinism in the HaplotypeCaller that in rare cases could cause the output to vary slightly given the same inputs (#6195) (#6104)
    • Deleted the old exact AF calculation model (#6099)
  • Joint Calling

    • Fixed a regression in GATK 4.1.3.0 that caused us to not emit the AS_QD annotation when running a joint calling pipeline with CombineGVCFs (GenomicsDB was unaffected) (#6168)
    • Fixed allele-specific annotation array length issues when alleles are subset in tools such as GenotypeGVCFs (#6079)
    • Changed AS_RankSum outputs to "." for missing values rather than "nul" (#6079)
  • Funcotator

    • Fixed a bug that caused Funcotator to outputs fields in wrong order in some cases when writing a VCF (#6178)
      • Specifically, Funcotator would output functation fields in the wrong order when there was more than 1 site in a VCF data source with the exact same position and alleles and it matched one of the variants being annotated
  • Mitochondrial pipeline

    • Renamed the output vcf with the name of the sample and supplied a default value for autosomal_median_coverage (meaning you'll now get the NuMT filter even if you don't provide the actual autosomal coverage) (#6160)
  • Miscellaneous Changes

    • Beta support for building/testing on Java 11 (#6119) (#6145)
    • UpdateVCFSequenceDictionary now supports replacing an invalid sequence dictionary in a VCF (#6140)
    • CountFalsePositives now requires an intervals file (#6120)
    • AnalyzeSaturationMutagenesis: use supplementary alignments to identify large deletions (#6092)
    • AnalyzeSaturationMutagenesis: an insert at the start codon is not in the ORF (#6121)
    • Added a check for null sequence dictionaries in the dictionary validation code (#6147)
    • Update SV Spark pipeline example shell scripts saving results to GCS (#6114)
    • Update public key for installing R in docker (#6116)
    • Log exceptions during deletion on JVM exit instead of throwing (#6125)
    • Don't fail the build if we're in a git worktree folder (#6169)
    • Free a bit of memory fir the test suite by disabling mysql and postgress on travis (#6085)
    • Delete bogus index files for queryname sorted CRAMs. (#6149)
    • Cleanup GenomicsDB debugging test output (#6089)
  • Documentation

    • Fixed mitochondria mode documentation in FilterMutectCalls (#6174)
  • Dependencies

    • Updated HTSJDK to 2.20.3 (#6126)
    • Updated Picard to 2.21.1 (#6205)
    • Updated google-cloud-nio to 0.107.0 (#6042)
    • Updated Gradle to 5.6 (#6106)

4.1.3.0

09 Aug 19:20
Compare
Choose a tag to compare

Download release: gatk-4.1.3.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/

Highlights of the 4.1.3.0 release:

  • GnarlyGenotyper, a new beta joint genotyping tool which, along with ReblockGVCF, forms part of a forthcoming more scalable version of our joint genotyping pipeline that we call the "GATK Biggest Practices" pipeline
  • FuncotateSegments, a new beta companion tool to Funcotator that performs functional annotation on a segment file (.seg) rather than a VCF
  • GenomicsDBImport now has the ability to incrementally update an existing GenomicsDB workspace
  • Several important bug fixes to HaplotypeCaller and Mutect2

Compatibility notes:

  • GermlineCNVCaller models built in cohort mode with previous releases are no longer compatible. Users should rebuild these models with this release before running GermlineCNVCaller in case mode. See the CNV Tools section below for more details.

Full list of changes:

  • New Tools

    • GnarlyGenotyper (beta tool) (#4947) (#6075)

      • The GnarlyGenotyper is designed to perform joint genotyping on cohorts of at least tens of thousands of samples called with HaplotypeCaller and post-processed with ReblockGVCF to produce a multi-sample callset in a super highly scalable manner.
      • Caveats:
        • GnarlyGenotyper is intended to be used with GVCFs for which low quality variants have already been removed, derived from post-processing HaplotypeCaller GVCFs with ReblockGVCF. See the "Biggest Practices" usage example in the ReblockGVCF docs for details.
        • GnarlyGenotyper does not subset alternate alleles and can return some highly multi-allelic sites. PLs will not be output for sites with more than 6 alts to save space.
        • GnarlyGenotyper assumes all diploid genotypes
      • Annotations:
        • To generate all the annotations necessary for VQSR, input variants to the GnarlyGenotyper must include the QUALapprox and VarDP annotations along with the latest RAW_MQandDP annotation.
        • If allele-specific annotations are present, they will be used appropriately and a new AS_AltDP annotation giving the total depth across samples for each alternate allele will be added.
      • A GATK "Biggest Practices" pipeline including the GnarlyGenotyper is forthcoming pending some fixes improving on the above caveats.
    • FuncotateSegments (beta tool) (#5941)

      • A companion tool to Funcotator that performs functional annotation on a segment file (.seg) rather than a VCF
      • The Somatic CNV pipeline can optionally run this tool for functional annotation
  • HaplotypeCaller/Mutect2

    • Fixed a regression in HaplotypeCaller/Mutect2 that caused some variants to be lost at sites with high complexity (#5952)
    • Fixed a GGA (GENOTYPE_GIVEN_ALLELES) mode bug in HaplotypeCaller/Mutect2 where added alleles' cigars could have soft clips (#6047)
      • This bug would manifest as a "Cigar cannot be null" error
    • Fixed a bug where cached indel informativeness values could be incorrectly applied to the wrong sites in HaplotypeCaller/Mutect2 (#5911)
    • Fixed an edge case in HaplotypeCaller/Mutect2 where dangling end merging creates cycles (#5960)
    • Added hidden arguments to the assembly engine to track found haplotype counts and kmers used (#6049)
    • Fixed a bug in CalculateContamination when contamination is indistinguishable from zero (#5971)
    • Fixed a bug where normal p value argument in FilterMutectCalls was declared static (#5982)
  • CNV Tools

    • Added FuncotateSegments as an option to the Somatic CNV WDL (#5967)
    • Added QC metrics to the Germline CNV workflow (#6017)
    • Enabled GC-bias correction by default in CNV workflows (#5966)
    • Added denoised coverage file concatenation output to gCNV postprocessor (#5823) Note: The addition of this feature breaks compatibility with gCNV cohort-mode models built with previous releases.
    • Changed cr.igv.seg output of ModelSegments to give log2 Segment_Mean. (#5976)
    • Fixed CNV plotting script to allow spaces in input filenames. (#5983)
  • GenomicsDBImport

    • Added support for making incremental updates to existing workspaces (#5970)
      • This can be done using the new --genomicsdb-update-workspace-path argument
    • Fixed a crash in GenomicsDBImport on queries at positions inside deletions (#5899)
    • Treat AS_QUALapprox and AS_VarDP strings as array of int vectors (#5933)
  • Mitochondrial Calling Pipeline

    • Added NIO support and updated to WDL 1.0 (#6074)
  • Spark Tools

    • Removed the beta label from many simple Spark tools (#5991)
    • Bug fix for reading references from GCS on Spark (#6070)
    • Eliminated an unnecessary sort step in HaplotypeCallerSpark (#5909)
    • Fixed BaseRecalibratorSpark failure on a cluster due to system classloader issue (#5979)
    • Added a WDL for ReadsPipelineSpark (#5904)
    • Added a command-line argument to toggle using NIO on reading for Spark (#6010)
    • Added advanced arguments to MarkDuplicatesSpark to allow non-queryname sorted inputs when specifying multiple input bams and to treat unsorted inputs as queryGroup-sorted (#5974)
    • Clarified the behavior of MarkDuplicatesSpark when given multiple input bams, and improved the sorting behavior if given a mix of queryname-sorted and query-grouped bams (#5901)
    • Changed spark.yarn.executor.memoryOverhead to spark.executor.memoryOverhead as promoted by Spark 2.3 (#6032)
    • Handle newly-added arguments in ApplyBQSRUniqueArgumentCollection (#5949)
  • Miscellaneous Changes

    • Added a new BaseQualityHistogram variant annotation to generate base quality histograms (#5986)
    • Added a new SoftClippedReadFilter that can filter out reads where the ratio of soft-clipped bases to total bases exceeds some given value (#5995)
    • Fixed a serious bug in ValidateVariants where the tool would silently do no validation in the default case when a DBSNP file was not provided (#5984)
    • Fixed a "Record covers a position previously traversed" error in ValidateVariants for GVCFS with multiple contigs (#6028)
    • The RMSMappingQuality annotation now requires the --allow-old-rms-mapping-quality-annotation-data argument to run with GVCFs created by older versions of the GATK (#6060)
    • Added a simple TSV/CSV/XSV writer with cloud write support as an alternative to TableWriter (#5930)
    • Funcotator: added Funcotator stand-alone WDL to supported area (#5999)
    • Extracted the GenotypeGVCFs engine into publicly accessible class/function (#6004)
    • Refactored VariantEval methods to allow subclasses to override (#5998)
    • AnalyzeSaturationMutagenesis: arbitrarily choose 1 read for disjoint pairs, dump rejected reads, and various other improvements (#5926) (#6043)
    • Normalized some AssemblyRegion args in HaplotypeCallerSpark (#5977)
    • Don't redundantly delete temporary directories in RSCriptExecutor (#5894)
    • Treat all source files as UTF-8 for java, javadoc (#5946)
    • Updated an out-of-date argument name in an error message for the CycleCovariate
    • Changed an error about "duplicate feature inputs" to be a UserException (#5951)
    • Got rid of ExpandingArrayList in favor of ArrayList (#6069)
    • Disabled Codecov for now on travis due to spurious errors (#6052)
    • Lowered the Xms value in the test JVM (#6087)
    • Updated the travis installed R version to 3.2.5, matching our base docker image (#6073)
    • Fixed an erroneous warning about GCS test configuration (#5987)
    • Added a code of conduct (#6036)
  • Documentation

    • FilterVariantTranches documentation fix and improvement (#5837)
    • Updated FilterMutectCalls usage examples (#5890)
    • Added --max-mnp-distance 0 to usage example in CreateSomaticPanelOfNormals docs (#5972)
    • Updated the MarkDuplicatesSpark documentation to no longer contain a misleading usage example (#5938)
    • Added a clarification to the README to warn users to set their Gradle JVM properly in Intellij after setup (#6066)
    • Added links to download Java 8 to the README (#6025)
    • Remove non-ascii chars from javadoc (#5936)
  • Dependencies

    • Updated HTSJDK to 2.20.1 (#6083)
    • Updated Picard to 2.20.5 (#6083)
    • Updated Disq to 0.3.3 (#6083)
    • Updated Spark to 2.4.3 (#5990)
    • Updated Gradle to 5.4.1 (#6007)
    • Updated GenomicsDB to 1.1.0.1 (#5970)

4.1.2.0

23 Apr 17:33
fb16ae1
Compare
Choose a tag to compare

Download release: gatk-4.1.2.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/

Highlights of the 4.1.2.0 release:

  • Two new tools, MethylationTypeCaller and AnalyzeSaturationMutagenesis (see below for descriptions)
  • Significant improvements to GENOTYPE_GIVEN_ALLELES mode in Mutect2 and HaplotypeCaller
  • Fixed a serious bug in Funcotator that could cause END positions to be wrong for some deletions in MAF output
  • Significant updates to the mitochondrial calling pipeline

Full list of changes:

  • New Tools

    • MethylationTypeCaller (#5762)
      • Identifies methylated bases from bisulfite sequencing data. Given a bisulfite sequenced, methylation-aware aligned BAM and a reference, it outputs methylation-site coverage to a specified output vcf file.
    • AnalyzeSaturationMutagenesis (#5803)(#5883)
      • Processes reads from a saturation mutagenesis experiment, an experiment that systematically perturbs a mini-gene to ascertain which amino-acid variations are tolerable at each codon of the open reading frame. Its main job is to discover variations from wild-type sequence among the reads, and to summarize the variations observed.
  • Mutect2

    • Made significant improvements to GENOTYPE_GIVEN_ALLELES mode in Mutect2 and HaplotypeCaller (#5874). These improvements are described in more detail in #5857
    • CalculateContamination now works much better for very small gene panels (#5873)
    • We now correctly handle inputs with 100% contamination in Mutect2 filtering (#5853)
    • Mutect2 now uses natural logarithms internally (#5858). This does not change any outputs.
    • Minor update to the Mutect2 PON WDL (#5859)
  • Funcotator

    • Fixed a serious bug that could cause END positions to be wrong for some deletions in MAF output (#5876)
    • The tool now throws a user error for an AD field with only 1 value in MAF mode (#5860)
    • Added a new filter to FilterFuncotations. For two autosomal recessive genes, MUTYH and ATP7B, homozygous variants and compound heterozygous variants will be tagged and added to the output vcf. (#5843)
  • Mitochondrial Calling Pipeline

    • Updated the pipeline for the new Mutect2 filtering scheme and pulled filtering after the liftover and recombining of the VCF. (#5847)
    • Made the subsetting of the WGS bam fast by using PrintReads over just chrM instead of traversing the whole bam for NuMT mates. (#5847)
    • Moved polymorphic NuMTs based on autosomal coverage to a filter (it was an annotation before) (#5847)
    • Added an option to hard filter by VAF (#5847)
    • Bug fix for large input files to the mitochondrial pipeline (we now include the size of the input BAM/CRAM when calculating disk size, when necessary) (#5861)
  • Structural Variation Calling Pipeline

    • Bug fix to QNameFinder to handle reads with negative unclipped starts (#5864)
  • Miscellaneous Changes

    • Added a --min-fragment-length argument to the FragmentLengthReadFilter (#5886)
    • Added a --spark-verbosity argument to control verbosity of Spark-generated logs (#5825)
    • Added a new WalkerBase abstract class to be used for all built-in walkers (#4964)
    • Exposed transient attributes in the GATKRead API (#5664)
    • Convert more code to use GATKPathSpecifier (#5870) (#5832). This also fixes an InvalidPathException on Windows machines.
    • Fixes to the test suite related to the recent introduction of a codec for Picard interval lists (#5879)
    • Eliminated an error message during the Docker build in Travis logs by creating a directory before copying to it. (#5878)
  • Documentation

    • Updated the Mutect2 WDL README with Funcotator information (#5892)
    • Updated a usage example for CreateHadoopBamSplittingIndex (#5898)

4.1.1.0

28 Mar 23:06
ea3032d
Compare
Choose a tag to compare

Highlights of the 4.1.1.0 release:

  • A substantial (~33%) speedup to the HaplotypeCaller in GVCF mode (-ERC GVCF)
  • Major updates to Mutect2, including completely overhauled filtering and smarter handling of overlapping read pairs.
  • A tensorflow update for CNNScoreVariants that speeds up the tool by roughly ~2X when using the 2D model.
  • Important updates to the mitochondrial calling pipeline, and improved memory usage in the CNV pipeline.
  • Important bug fixes to Funcotator, VariantEval, GenomicsDBImport, and other tools, as well as to the --pedigree argument for annotations.

Docker image: https://hub.docker.com/r/broadinstitute/gatk/

Full list of changes:

  • HaplotypeCaller

    • Greatly improved the performance of the ReferenceConfidenceModel using dynamic programming and caching (#5607)
      • This speeds up whole-genome GVCF mode calling (-ERC GVCF) by ~33% in our tests!
    • Optimized some additional performance hotspots in the ReferenceConfidenceModel (#5616) (#5469) (#5652)
    • Can now write VCF outputs to Google Cloud Storage (GCS) (#5378)
    • Don't output variants with no ALT allele if the * (spanning deletion) allele gets dropped (#5844)
    • Added a --force-active argument that marks all regions as active. Useful for debugging/diagnostics. (#5635)
    • HaplotypeCallerSpark: made performance improvements to allow the tool to run on WGS in strict mode (#5721)
    • Fixed rare infinite recursion bug in KBestHaplotypeFinder (also affects Mutect2)(#5786)
  • Mutect2

    • Overhaul of FilterMutectCalls, which now applies a single threshold to an overall error probability (#5688)
      • FilterMutectCalls automatically determines the optimal threshold.
      • The new somatic clustering model learns tumors' allele fraction spectra and overall SNV and indel mutation rates in order to improve filtering.
      • Includes a rewrite of Mutect2 documentation -- better organization and now includes command line examples in addition to math.
    • Mutect2 now modifies base and indel qualities of overlapping paired reads to account for PCR error rather than discarding reads (#5794)
      • This especially improves indel sensitivity.
    • Optimized Mutect2 read orientation filtering by collecting F1R2 counts from within Mutect2 itself, greatly reducing wall-clock and CPU time (#5840)
    • New Mutect2 panel of normals workflow using GenomicsDB for scalability (#5675)
      • Panel of normals removes germline variants in order to contain only technical artifacts, and contains information about artifact prevalence
    • Rewrote Mutect2 active region likelihood as special case of full somatic likelihoods model, which reduces runtime by 5% (#5814)
    • Funcotator updates in Mutect2 WDL (#5742) (#5735)
    • Prune assemby graph before checking for cycles (#5562)
    • Refactor Mutect2 inheritance so that it doesn't have inactive arguments (#5758)
    • Added CRAM support to the Mutect2 WDL (#5668)
    • Split MNPs in Mutect2 PON WDL, fixing a potential bug (#5706)
    • Handle negative infinity log likelihoods from PairHMM in Mutect2 (#5736)
    • Fixed overfiltering in Mutect2 in GGA alleles mode with no reads (#5743)
    • Correct some Mutect2 VCF header lines (#5792)
    • Handle unmarked duplicates with mate MQ = 0 in Mutect2 (#5734)
    • Output sample names in Mutect2 PON header (#5733)
    • Avoid error due to finite precision error in Mutect2 PON creation (#5797)
    • Update Mutect2 javadoc to reflect v4.1 changes. (#5769)
    • Renamed the OxoGReadCounts annotation to OrientationBiasReadCounts (#5840)
  • CNNScoreVariants

    • We now use the latest Intel-optimized tensorflow (#5725)
      • This speeds up the 2D CNN by roughly 2X in our tests!
    • FilterVariantTranches is out of beta (#5628)
    • Fixed CNNScoreVariants hanging when the conda environment is not set up (#5819)
      • We now make sure that the GATK tool Python package is present before executing streaming Python commands.
    • Extensive updates to the CNN WDLs (#5251)
  • Mitochondrial Calling Pipeline

    • Added an option to recover all dangling branches, on by default for MT calling (#5693)
      • Fixes a large number of missed calls
    • Use adaptive pruning in the mitochondria pipeline (#5669)
    • Changed defaults in mitochondria mode in response to Mutect2 filtering overhaul (#5827)
    • Allowed the MT pipeline to work on bams with a mix of single and paired-end reads (#5818)
    • Added a hard filter to M2 for polymorphic NuMTs and low VAF sites (#5842)
    • Updated the haplochecker version to 0.1.2 to fix a bug with flipping the major and minor hg headers in its output (#5760)
    • Added the rest of the mitochondria joint-calling pipeline (#5673)
      • Merging and genotyping "somatic" GVCFs from Mutect2
    • Added a read filter for unmapped reads and their mates (#5826)
    • Refactored the MT WDL to make validations easier (#5708)
    • Updated a variable name in MT WDL to match gatk-workflows version (#5694)
  • GenotypeGVCFs

    • Added an option to merge intervals for better GenotypeGVCFs performance on GenomicsDB exome input (#5741)
    • Trim per-allele FORMAT annotations and optionally retain raw AS annotations (#5833)
      • GenotypeGVCFs now uses the header info to determine if FORMAT lists need to be subset when alleles are dropped
      • Fixes "F1R2 and F2R2 annotations not updated by GenotypeGvcfs" (#5704)
  • Funcotator

    • Non-locatable data sources can create funcotations again (#5774)
      • Fixes a bug where Funcotator was not adding funcotations from non-locatable data sources
    • Fixed handling of symbollic alleles when determining best transcript for GencodeFuncotation creation. (#5834)
    • FilterFuncotations: support for multi-allelic variants (#5588)
    • FilterFuncotations: support for gnomAD for allele frequency in ClinVarFilter and LofFilter, with a new argument telling it which dataset of gnomAD or ExAC to use (#5691)
    • Added # as a character to be sanitized by VCFOutputRenderer (#5817)
    • Added in Markdown files for Funcotator forum posts (#5630)
    • Updated Funcotator documentation with a FAQ section to respond to user comments (#5755)
  • CNV Tools

    • Improved memory usage in gCNV (#5781)
    • Improved memory requirements of CollectReadCounts (#5715)
    • Added some fixes for minor CNV issues (#5699)
    • Added io_commons.read_csv to address issues with formatting of sample names in gCNV (#5811)
    • Added gCNV PROBPROG 2018 extended abstract, archived notes on CNV methods, and deleted some legacy documentation (#5732)
  • Miscellaneous Changes

    • SelectVariants can now write VCF outputs to Google Cloud Storage (GCS) (#5378)
    • VariantEval bug fix: don't require the output file to already exist (#5681)
    • Fixed the --pedigree argument in the PossibleDeNovo annotation (#5663)
    • GenomicsDBImport: fixed a core dump when querying overlapping deletions (#5799)
    • GatherPileupSummaries: a new tool that combines the output of GetPileupSummaries from disjoint scatter jobs (#5599)
    • VariantsToTable: add splitting for allele-specific annotations and ADs (#5697)
    • CalculateGenotypePosteriors: fix reported bug where no-call genotypes with no reads get genotype posterior probabilities and calls (#5667)
    • Added a new argument to Spark tools enabling the user to control whether to sort the reads on output (#4874)
    • ReadsPipelineSpark: fixed an "Interval not within the bounds of a contig" error (#5645)
    • Concordance: fixed the tool to allow for no variation alleles in the truth data. (#5718)
    • ReblockGVCF: fix sites with zero AD to actually use SITE-level DP value as intended in (#5835)
    • Change UpdateVCFSequenceDictionary to use the specified dictionary uniformly (#5093)
    • Fixed gatk-nightly Docker builds (https://hub.docker.com/r/broadinstitute/gatk-nightly/) (#5759)
    • Print the Picard/HTSJDK versions in addition to the GATK version when running with --version (#5757)
    • IndexFeatureFile: fixed a crash on VCFs with 0 records (#5795)
    • PrintBGZFBlockInformation: removed the file extension check so that we can accept bams (#5801)
    • Added a new read filter: IntervalOverlapReadFilter (#5656)
    • Add NIO Path support to TableReader and TableWriter (#5785)
    • Replaced IntervalsSkipList with OverlapDetector (#4154)
    • Removed some unused arguments in VCF merging code (#5745)
    • Kebab-case some arguments in LocusWalker and LocusWalkerSpark (#5770)
    • Removed an unnecessary IllegalArgumentException in PairHMM (#5705)
    • Removed accidental uses of log4j v1 (#5682)
    • Improvements to Spark evaluation scripts (#5815)
    • Extract tests from PrintReadsIntegrationTest to share with the Spark version. (#5689)
  • Documentation

    • Improved the documentation for the StrandOddsRatio annotation (#5703)
    • Fixed the descriptions of some HaplotypeCaller arguments (#5658)
    • Update VariantRecalibrator example code to reflect new tagged argument syntax (#5710)
    • Corrected javadoc for the InbreedingCoeff annotation (#5768)
    • CalculateGenotypePosteriors: minor updates to javadoc and logger type (#5601)
    • Added and Updated javadoc for SortSamSpark and MarkDuplicatesSpark (#5672)
    • Added a link to a "GitHub basics for researchers" article at top of the GATK README (#5643)
    • Updated the main GATK README to remove outdated references to the Intel conda environment (#5753)
    • Trimmed overly-long tool...
Read more

4.1.0.0

30 Jan 03:38
Compare
Choose a tag to compare

It's been a year since the GATK 4.0.0.0 release in January 2018, and we decided that it was time to package up the past year's worth of GATK improvements into a new major release, which we're calling version 4.1.0.0!

To commemorate this milestone, we'll be publishing a series of in-depth technical articles and blog posts covering the major new features in version 4.1.0.0 on the official GATK blog.

Below we've compiled the highlights of the new features added between versions 4.0.0.0 and 4.1.0.0. If you're interested in seeing only the changes between the last release (4.0.12.0) and this release (4.1.0.0), click here instead.

Official docker image is here: https://hub.docker.com/r/broadinstitute/gatk/

Major changes between versions 4.0.0.0 and 4.1.0.0 (January 2018 to January 2019):


  • Next-Gen VQSR Replacement For Single-Sample

    • New suite of tools CNNScoreVariants, CNNVariantTrain, CNNVariantWriteTensors, and FilterVariantTranches
    • CNNScoreVariants is now out of beta and ready for production use
    • Performs variant training and scoring using a convolutional neural network.
    • Single-sample only
    • Produces better results than the legacy VariantRecalibrator (VQSR) and comparable or better results to third-party tools like DeepVariant
    • Sophisticated 2D model that uses the reads
  • Major HaplotypeCaller Improvements

    • Now genotypes and outputs spanning deletions
    • Now outputs VCF spec-compliant phased variants
    • Can emit MNPs via a new --max-mnp-distance argument
    • Important fix to the reference confidence calculation upstream of indels
    • New HaplotypeCaller priors for variants sites and homRef blocks
      • Added new --population-callset argument allowing an external panel of variants to be specified to inform the frequency distribution underlying the genotype priors
      • Added new --num-reference-samples-if-no-call argument to control whether to infer (and with what effective strength) that only reference alleles were observed at sites not seen in any panel
  • Major Mutect2 Improvements

    • Mutect2 is now out of beta
    • Support for multi-sample calling
    • Lots of support for high-depth calling such as cfDNA, UMIs, mitochondria, including a new active region likelihood, probabilistic assembly graph pruning that adjusts to the local depth, a new mitochondria mode, and new filters for blood biopsy and mitochondria
    • Now outputs VCF spec-compliant phased variants
    • Can emit MNPs via a new --max-mnp-distance argument
    • Added a genotype given alleles (GGA) mode
    • New STR indel error model that improves sensitivity and precision in STR (short-tandem repeat) contexts
    • Many new/improved filters to reduce false positives (eg., FilterAlignmentArtifacts)
    • Mutect2 now automatically recognizes and removes end repair artifacts in regions with inverted tandem repeats. This is extremely important for some FFPE samples.
    • New probabilistic orientation bias tool
    • Got rid of many questionable indels showing up in bamout of Mutect2 and the HaplotypeCaller
    • Big improvements to CalculateContamination, especially when tumor has lots of CNVs
    • NIO support in Mutect2 WDL
    • Significant speed improvements
    • Improved allele fraction estimation
    • Initial GVCF output support
  • Mitochondrial Calling

    • Added --mitochondria-mode to Mutect2 and FilterMutectCalls. This increases sensitivity and only applies filters that are optimized for mitochondria.
  • New allele frequency / qual score model

    • Is now the default in HaplotypeCaller and GenotypeGVCFs
    • Optimized for greater speed, should resolve many GenotypeGVCFs memory issues
    • Rare numerical finite precision issues in the allele-specific qual have been resolved
  • Major Improvements to the CNV (Copy Number Variation) tools

    • The CNV tools are now out of beta.
      • This includes the tools: AnnotateIntervals, CallCopyRatioSegments, CollectAllelicCounts, CollectReadCounts, CreateReadCountPanelOfNormals, DenoiseReadCounts, DetermineGermlineContigPloidy, FilterIntervals, GermlineCNVCaller, ModelSegments, PostprocessGermlineCNVCalls, PreprocessIntervals, PlotDenoisedCopyRatios, and PlotModeledSegments
    • Completed the GermlineCNVCaller (gCNV) pipeline and made various performance/runtime improvements to both the methods and WDLs.
    • Major changes include the addition of new tools (PostprocessGermlineCNVCalls, FilterIntervals, and CollectReadCounts, which replaces CollectFragmentCounts), as well as improvements to existing tools (notably, AnnotateIntervals).
    • Improved support for various formats, namely VCF output in the gCNV pipeline, IGV-compatible .seg output in the ModelSegments somatic CNV pipeline, and CRAM support for all CNV WDLs.
    • Developed tools and WDLs for tagging and filtering of germline events in the ModelSegments somatic CNV pipeline.
  • Funcotator Official Release

    • Funcotator is now out of beta
    • Huge number of bug fixes and accuracy improvements. Output for several fields is now more correct than other well-known functional annotation tools.
    • Some new features include:
      • MAF output support
      • NIO support for datasources
      • gnomAD support
      • dbsnp support
      • Support for Mitochondrial amino acid sequence/protein change strings
      • 5'/3' flank support
      • Major performance improvements due to added caching
      • Added ALL mode for transcript selection (--transcript-selection-mode ALL) which will output full annotation fields for all transcripts
    • Created a new FuncotatorDataSourceDownloader tool to download data sources
    • Added an experimental FilterFuncotations tool
  • MarkDuplicatesSpark is now a Validated, Scalable Replacement for MarkDuplicates

    • MarkDuplicatesSpark is now out of beta
    • Rewritten version of the tool matches Picard MarkDuplicates output and has greatly improved performance and scalability
    • Supports multiple BAM inputs
    • Indexes BAM outputs on-the-fly in parallel on a cluster
  • Additional Tools Ported from GATK3

    • Ported VariantAnnotator
    • Ported VariantEval
    • Ported FastaAlternateReferenceMaker and FastaReferenceMaker
    • Ported LeftAlignAndTrimVariants
    • Restored GenotypeGVCFs --include-non-variant-sites argument
  • Major Improvements to the SV (Structural Variation) Tools

    • Improvements to collection and calling of events based on discordant read pair evidence.
    • A new scaffolding algorithm greatly improves the contiguity of local assemblies, increasing sensitivity.
    • Regions of excessive sequencing depth are excluded from evidence collection and assembly, improving runtime performance.
    • A major overhaul of our algorithm for calling events based on local assemblies improves accuracy and allows for the accurate reporting of small complex SVs.
    • A machine learning (xgBoost) based classifier for SV evidence improves runtime and increases accuracy by determining which regions should be fed into the local assembly workflow.
  • Spark Improvements

    • New Disq Spark library allows faster and more accurate loading of formats like BAM and VCF
    • HaplotypeCallerSpark now has a "strict mode" that closely matches the regular HaplotypeCaller
    • Created RevertSamSpark, a parallelized Spark version of Picard's RevertSam tool
    • Migrated most Spark tools that take a reference and/or VCF to use Spark's intrinsic file copying mechanism instead of broadcast to distribute the reference and VCFs to worker nodes -- a big performance win!
  • GenomicsDB Improvements

    • Allele-specific annotation support
    • Multi-interval support (with some performance caveats)
    • Support for sites-only queries
    • Support for returning the GT field in queries
    • New protobuf-based API to allow configuration without editing JSON files
    • Added in machinery to allow per-annotation combine operations to be specified
    • Allow for hdfs and gcs URI's to be passed to GenomicsDB
    • Migrated from com.intel.genomicsdb to org.genomicsdb
  • "Goodies" Worth Mentioning

    • Added fasta.gz support to the -R/--reference argument in walker tools
    • SelectVariants can now drop specific annotation fields from the output vcf
    • CalculateGenotypePosteriors now supports indels
    • New tool ReblockGVCF to merge reference blocks in single-sample GVCFs for smaller filesizes
    • Improved MQ calculation accuracy, especially at sites with many uninformative reads; concomitant with new annotation tag and format
    • The -L argument now supports GCS (Google Cloud Storage) for interval list files / bed / vcf files in walker tools
    • Added support for "Requester Pays" GCS (Google Cloud Storage) buckets via new --gcs-project-for-requester-pays argument
    • Added GCS (Google Cloud Storage) output (-O) support to more tools
    • Improved Python integration (eliminated timeouts and reliance on prompt synchronization) means fewer glitches during runs of ML-based tools
    • A significantly (~33%) smaller GATK docker image
    • Changed argument tagging syntax from "--arg tag:value" to "--arg:tag value"
      • Affects command-line interface for VariantRecalibrator, VariantEval, VariantFiltration, and VariantAnnotator

Changes between versions 4.0.12.0 and 4.1.0.0 only:


  • Many tools are now out of beta and ready for production use!
    • `CNNScor...
Read more

4.0.12.0

17 Dec 18:47
02682d5
Compare
Choose a tag to compare

Highlights of this release include support for outputting phased variants in HaplotypeCaller/Mutect2, restoring the --include-non-variant-sites argument to GenotypeGVCFs, a port of the GATK3 tool VariantEval, a new library (Disq, https://github.com/disq-bio/disq) for working with BAM/CRAM/VCF/etc. formats on Spark, and GCS (Google Cloud Storage) support in Funcotator.

As usual, a docker image for this release can be downloaded from https://hub.docker.com/r/broadinstitute/gatk/

Full list of changes in this release:

  • HaplotypeCaller/Mutect2

    • Output VCF spec-compliant phased variants in HaplotypeCaller and Mutect2
    • Added an experimental adaptive pruning option for local assembly (#5473)
    • Improved implementation of allele-specific new qual (#5460)
    • Use cigar complexity to break ties in uninformative reads' best haplotypes (#5359)
    • Improved handling of regions that are too short after trimming in HaplotypeCaller and in Mutect2 (Closes issue #5079)
    • Optimization in CigarUtils to shortcut to M-only CIGAR when provably optimal (#5466)
    • Changed SUPPORTED_ALLELES_TAG from SA to XA (#5418)
  • HaplotypeCaller

    • Fixed bug in GGA mode caused by split multallic sites with genotypes (#5365)
    • The debug command line argument is now passed correctly in HaplotypeCaller (fixed issue #4943) (#5455)
  • Mutect2

    • Big improvements to CalculateContamination's model for determining hom alt sites (#5413)
    • Reduce false negatives from mapping quality filter on long indels in Mutect2 (#5497)
    • Added a mismatch ratio option in realignment filter (#5501)
    • Made Mutect2 read position filter default much less stringent (#5487)
    • Fixed M2 bug for germline resources with AF=. (#5442)
    • Fix read position annotation bug in M2 filter (#5495)
    • Cleaner Mutect2 VCF fields (#5510)
    • Moved PerAlleleAnnotations to the INFO field (#5518)
    • Removed unnecessary inheritance of M2 filtering arguments collection (#5498)
  • GenotypeGVCFs

    • Restored the --include-non-variant-sites argument from GATK3 to GenotypeGVCFs (#5219)
  • Ported the GATK3 tool VariantEval to GATK4 (#5043)

  • Replaced the Hadoop-BAM library with the newly-developed Disq library (https://github.com/disq-bio/disq) for efficiently working with BAM/CRAM/VCF/etc. formats on Spark (#5138)

    • Improves Spark performance across-the-board, and fixes many edge-case bugs in Hadoop-BAM
  • Funcotator

    • Added GCS support to Funcotator data sources, so that data sources can now be accessed directly from GCS buckets (#5425)
    • Added support for annotating 5'/3' flanks (#5403)
    • Funcotator now creates default annotations for difficult variants. (#5374)
    • Funcotator now can create annotations for symbollic alleles and masked alleles (#5406)
    • Funcotator now can match between hg19 and b37 data sources. (#5491)
    • Added in regression tests and fixes for correctness of many annotations (#5302)
    • Now DE_NOVO_START_IN_FRAME and DE_NOVO_START_OUT_FRAME are correct. (#5357)
    • Added cDNA Strings for Intronic Variants (#5321)
    • VCF data sources create an ID field for the ID of the variant
      used for the annotation (#5327)
    • Funcotator now computes MT protein changes. (#5361)
    • Funcotator now correctly populates transcript position. (#5380)
    • Added a script that can create data sources from BED files. (#5438)
    • Updated testing Gencode data sources to fully exercise test data set (#5423)
    • Moved validation test data out of large files area. (#5381)
    • Updated top-level class documentation for Funcotator. (#4655)
    • Added scripts to liftover gnomAD. Also bugfixes for Funcotator NIO. (#5514)
  • HaplotypeCallerSpark

    • Added a "strict mode" that allows HaplotypeCallerSpark to closely match the output of the regular HaplotypeCaller (#5416)
    • Now extends AssemblyRegionWalkerSpark (#5386)
  • MarkDuplicatesSpark: Added a few of the remaining unimplemented useful features from Picard (#5377)

  • CNV workflows

    • Changed FilterIntervals to operate on the intersection of intervals in all inputs. (#5408)
    • Fixed RAM usage parameter error in combine_tracks.wdl (#5358)
    • Various other improvements to combine_tracks.wdl (#5384)
    • Fixed gCNV WDL broken by Cromwell update on FireCloud. (#5407)
    • Replaced bash script in gCNV ScatterIntervals task with updated version of IntervalListTools. (#5414)
  • CNNScoreVariants

    • Check for and require hardware AVX support (#5291)
  • Changed SelectVariants so that it can handle multiple rsIDs separated by ';' in a VCF file (#5464)

  • Miscellaneous Changes

    • Added setIsUnplaced() to the GATKRead API to distinguish reads with no mapping information (#5320)
    • Fixed an integer overflow bug in the RMSMappingQuality annotation (#5435)
    • Fixed floating-point bug in MannWhitneyU on some JVMs. (#5371)
    • Standardized the output argument for LeftAlignIndels (#5474)
    • SplitIntervals now produces an .interval_list file (#5392)
    • Fixed a bug with GATK_GCS_STAGING in the GATK launcher script #1338 (#5452)
    • Added ExampleReadWalkerWithVariantsSpark.java and tests (#5289)
    • Add description getter and javadoc in GATKReportTable (#5443)
    • Fixed message in GATKAnnotationPluginDescription (#5444)
    • Replaced some uses of PrintWriter (#5461)
    • Refactor GVCFWriter to allow push/pull iteration. (#5311)
    • Add scripts/dataproc-cluster-ui to release bundle. (#5401)
    • Marked VariantAnnotator as a @DocumentedFeature (#5480)
    • Removed obsolete intel conda environment references. (#5482)
    • Deleted the CountSet class (#5467)
    • Test framework: disabled gcloud login on travis for non-cloud non-wdl tests (#5335)
    • Updated Spark scripts to reflect changes from #5386 and #5127. (#5415)
    • Fixed jexl logging and updated VariantFiltration doc. (#5422)
    • Fixed some dead links in the README (#5405)
  • Dependencies

    • Updated htsjdk to 2.18.1 (#5486)
    • Updated Picard to 2.18.16. (#5412)
    • Updated Intel-GKL dependency to 8.6 (#5463)

4.0.11.0

23 Oct 20:22
4.0.11.0
Compare
Choose a tag to compare

A release which includes major improvements to Mitochondrial calling in Mutect2 as well as bug fixes and improvements:

As always a docker is available here: https://hub.docker.com/r/broadinstitute/gatk/

Mutect2 and HaplotypeCaller changes:

  • Added --mitochondria-mode to Mutect2 and FilterMutectCalls. This increases sensitivity and only applies filters that are optimized for mitochondria. A best practices WDL for calling mitochondrial variants on WGS data will be available in the future. (#5193)

  • Strand based annotations will use both reads in an overlapping read pair (#5286)

  • Realignment filter annotates the VCF with passing and failing read counts (#5328)

  • New filters and annotation to support blood biopsy that count and filter based on N's at variant sites (#5317)

  • Fixed bug for M2 GGA alleles with zero coverage (#5303)

  • Fixed error in genotype given alleles mode when input alleles have genotypes (#5341) #5336

  • Add new annotations to bamout to make understanding calls easier (#5215)

  • Fixed a typo.

CNV Pipeline:

  • Added FilterIntervals to perform annotation-based and count-based filtering in the gCNV pipeline. (#5307) closes #2992 #4558

Spark:

  • Removed WellformedReadFilter from CountReadsSpark (#5329)
  • Support fasta.gz in GATKSparkTool (#5290) closes #5258

Other:

  • CNN variant update models validate scores cleanup training (#5175)
  • combine_tracks.wdl supports GISTIC2 conversion (and bugfix) (#5287) closes #5284 #5283
  • handle normal reads in validation sample in BasicSomaticValidator (#5322)

GenomicsDB:

  • Allow for hdfs and gcs URI's to be passed to GenomicsDB (#5197)

SelectVariants:

  • Enable SelectVariants to drop specific annotation fields from output vcf. (#5254) closes #5235

SplitNCigarReads:

  • Added defensive check to OverhangFixingManager splices for non-reference spanning reads (#5298) closes #5293
  • Fixed SplitNCigarReads ArrayIndexOutOfBounds error for reads with long deletions (#5285) closes #5230

Testing:

  • Added a toggle to update the expected outputs in HaplotypeCallerIntegrationTest (#5324)
  • Added a new servicekey.json for travis (#5308) closes #5305
  • Added full-sized B37 and HG38 references to our large test data (#5309) closes #5111
  • Added in new data sources for funcotator testing. (#5296)

4.0.10.1

09 Oct 19:29
4dd7ba8
Compare
Choose a tag to compare

This is a small release that improves the calculation of the MQ (mapping quality) annotation, which provides an estimate of the overall mapping quality of reads supporting a variant call. It also introduces a number of experimental improvements to the CNV workflows, as well as a bug fix to LocusWalkerSpark.

As usual, a docker image for this release can be downloaded from https://hub.docker.com/r/broadinstitute/gatk/

Full list of changes in this release:

  • Improve MQ calculation accuracy (#4969)

    • Change raw MQ to a tuple of (sumSquaredMQs, totalDepth) for better accuracy where there are lots of uninformative reads or called single-sample variants with homRef genotypes.
    • Note that incorporating this change into a pipeline will require a concomitant update to this version for GenomicsDBImport and GenotypeGVCFs.
  • Updated SimpleGermlineTagger and somatic CNV experimental post-processing workflow with several experimental changes that improve precision results, and expand possible evaluations, of GATK CNV (#5252)

    • New script combine_tracks.wdl for post-processing somatic CNV calls. This wdl will perform two operations:
      • Increases precision by removing:
        • germline segments. As a result, the WDL requires the matched normal segments.
        • Areas of common germline activity or error from other cancer studies.
      • Converts the tumor model seg file to the same format as AllelicCapSeg, which can be read by ABSOLUTE. This is currently done inline in the WDL.
        • This is not a trivial conversion, since each segment must be called whether it is balanced or not (MAF =? 0.5). The current algorithm relies on hard filtering and may need updating pending evaluation.
        • For more information about AllelicCapSeg and ABSOLUTE, see:
          • Carter et al. Absolute quantification of somatic DNA alterations in human cancer, Nat Biotechnol. 2012 May; 30(5): 413–421
          • https://software.broadinstitute.org/cancer/cga/absolute
          • Brastianos, P.K., Carter S.L., et al. Genomic Characterization of Brain Metastases Reveals Branched Evolution and Potential Therapeutic Targets (2015) Cancer Discovery PMID:26410082
    • Changes to GATK tools to support the above:
      • SimpleGermlineTagger now uses reciprocal overlap to in addition to breakpoint matching when determining a possible germline event. This greatly improved results in areas near centromeres.
      • Added tool MergeAnnotatedRegionsByAnnotation. This simple tool will merge genomic regions (specified in a tsv) when given annotations (columns) contain exact values in neighboring segments and the segments are within a specified maximum genomic distance.
    • New scripts multi_combine_tracks.wdl and aggregate_combine_tracks.wdl which run combine_tracks.wdl on multiple pairs and combine the results into one seg file for easy consumption by IGV.
  • LocusWalkerSpark: fix issue where intervals with no reads were being dropped (#5222)

    • This fixes the bug reported in #3823
  • Added SparkTestUtils.roundTripThroughJavaSerialization() method for better serialization testing on Spark (#5257)

  • Build system: set the same compiler flags for all gradle JavaCompile tasks (#5256)

4.0.10.0

03 Oct 22:36
Compare
Choose a tag to compare

Highlights of this release include a new tool ReblockGVCF, a bug fix for a crash in Mutect2, and a more efficient distribution mechanism for the reference and VCFs in Spark tools.

As usual, a docker image for this release can be downloaded from https://hub.docker.com/r/broadinstitute/gatk/

Full list of changes in this release:

  • Added a new experimental tool ReblockGVCF (#4940)

    • A tool to merge reference blocks in single-sample GVCFs for smaller filesizes
  • Mutect2:

    • Fixed a bug in the PalindromeArtifactClipReadTransformer (#5241)
      • This filter would crash with an out-of-bounds error for fragment lengths and/or mate start positions that went off the end of a contig.
    • Changed the way the log10AlleleFractions are calculated in SomaticLikelihoodsEngine: now we use the mean of the posterior of the allele fractions. (#5231)
    • Reword comments in Mutect2 WDL to not refer to the old orientation bias filter as deprecated. (#5196)
    • Cited CGA in Mutect docs (#5228)
  • HaplotypeCaller: Allow MNP calling in GVCF mode with stern warnings about not trying joint-genotyping from the resulting GVCFs. (#5182)

    • HaplotypeCaller will now allow you to output MNPs in GVCF mode with a warning, however since joint genotyping of MNPs is unsupported, CombineGVCFs and GenomicsDBImport will now refuse to process GVCFs containing MNPs.
  • GATK Spark tools:

    • Migrated most Spark tools that take a reference and/or VCF to use Spark's intrinsic file copying mechanism instead of broadcast to distribute the reference and VCFs to worker nodes (#5127) (#5221)
      • This improves the performance of Spark tools that take a reference and/or VCF as side inputs, as the new distribution mechanism doesn't load the entire contents of the files into memory like broadcast did.
      • As a side effect of this change, support for 2bit references has been removed from tools that were migrated to the new distribution mechanism (in particular, BaseRecalibratorSpark and HaplotypeCallerSpark).
      • The CNV Spark tools have not yet been migrated, and still support 2bit references for now.
    • Bug fix: ensure that intervals with no reads are not dropped by the SparkSharder (#5248)
  • Funcotator:

    • Added command line exclusion lists, so that users can prune fields from the output. (#5226)
    • Added Funcotator excluded fields option explicitly to the M2 WDLs. (#5242)
  • Fix a multithreaded race condition in GenotypeLikelihoodCalculators by synchronizing updates of shared genotype likelihood tables. (#5071)

    • This bug affected HaplotypeCallerSpark, but not the regular HaplotypeCaller
  • GenomicsDB: added in machinery to allow per-annotation combine operations to be specified (#4993)

  • GATK Engine: Hooked up CountingVariantFilter to VariantWalkers (#4954)

  • StreamingPythonScriptExecutor: added a new message to the StreamingProcessController ack FIFO protocol to allow additional message detail to be passed as part of a negative ack. (#5170)

    • This improves exception message propagation for fatal errors when running Python tools.
  • gCNV WDLs:

    • Tar calls from all samples. (#5225)
      • This fixes an issue where the gCNV WGS cohort germline WDL was outputting vcf files with names that do not correspond to the actual samples inside the files.
    • Added multi-sample functionality to gCNV case mode WDL, and added a wrapper for gCNV case mode WDL to help optimize cloud computation cost. Also optimized how data is sent to postprocessing task in gCNV WDLs. (#5176)
  • gCNV kernel: Enforced ViterbiSegmentationEngine to analyze single samples only (#5176)

  • Added a dataproc-cluster-ui script to easily open the Spark UI on dataproc clusters (#5188)

  • Fixed pom issues that prevented publishing to maven central (#5224)

  • Added tabix to the docker base image (#5247)

4.0.9.0

20 Sep 17:11
6e352bb
Compare
Choose a tag to compare

Highlighting this release are some important fixes and improvements to the HaplotypeCaller, in particular support for genotyping spanning deletions and a fix to the reference confidence calculation around indels. This release also brings support for "Requester Pays" GCS (Google Cloud Storage) buckets, fasta.gz support to the -R/--reference argument, a port of LeftAlignAndTrimVariants from GATK3, a new tool FuncotatorDataSourceDownloader to download Funcotator datasources, and bug fixes to Mutect2, VariantRecalibrator, and SelectVariants.

As usual, a docker image for this release can be downloaded from https://hub.docker.com/r/broadinstitute/gatk/

  • HaplotypeCaller

    • Fixed the reference confidence calculation upstream of indels (#5172)
      • Improve hom-ref GQs near indels in GVCFs. Also consider bases on either side of indels informative if local assembly has been performed.
      • The previous behavior generated some PL=0,0,0 no-calls because the CIGAR of reads containing indels wasn't taken into account when determining which reads were informative for the indel reference confidence model. The local realignment wasn't being used inside the active region previously either, which has been fixed. A related change considers bases on either side of indels informative if local assembly has been performed (but not during active region detection). Both result in far fewer 0,0,0 calls. Unfortunately there are still some 0,0,X homRef calls related to #5171.
    • Make HaplotypeCaller genotype and output spanning deletions (#4963)
      • Modifies HaplotypeCaller so that it can output and genotype spanning deletion alleles represented by the * allele.
      • Fixes #2960
      • Previously, the output of HaplotypeCaller would not include spanning deletion alleles when run in single sample VCF mode or in genotype given alleles mode, even when that genotype would be more appropriate. In the joint calling workflow GenotypeGVCFs adds genotypes for spanning deletions, although the input likelihoods will not be broken out to specifically account for spanning deletion alleles.
    • Simplify HaplotypeBAMWriter code. #944 (#5122)
  • Mutect2

    • Mutect2 now emits DP values in the FORMAT field (#5185)
    • Add --get-af-from-ad option to recalculate the allele fraction based on AD instead of the Bayesian estimate (#5118)
      • Recommended for mitochondrial applications
    • Fixed a StringIndexOutOfBoundsException crash in the ReferenceBases annotation when a variant is within 10 base pairs of the end of a chromosome (#5151)
    • Restore base quality filter code that got removed unintentionally in #4895. (#5123)
    • Remove extra space in the MutectVersion header line (previously was Mutect Version) (#5184)
  • Added support for "Requester Pays" GCS (Google Cloud Storage) buckets via new --gcs-project-for-requester-pays argument (#5140)

  • Added fasta.gz support to the -R/--reference argument in walker tools (#5120)

  • Added GCS/NIO support to the --tmp-dir argument (#4469)

  • Upgraded google-cloud-java to the official 0.62.0 release, and move off of our custom fork of the library. This release includes the retry for transient 502 errors that we added to our fork in GATK 4.0.8.0 (#5194) (#5135)

  • Ported the LeftAlignAndTrimVariants tool from GATK3 (#5144)

  • VariantRecalibrator: the serialized model now sets annotation order (#3655)

    • This addresses a problem where serialized GMMs for VQSR assumed that the annotation order would be the same between the commands that generated them and the commands that used them. VQSR no longer depends on the commandline order of the annotations.
  • SelectVariants: Drop sites with the * allele as the only ALT when running with --exclude-non-variants (#5129)

  • Funcotator:

    • Created a new FuncotatorDataSourceDownloader tool to download data sources. (#5150)
    • Add an experimental FilterFuncotations tool (#4991)
    • Updated COSMIC to annotate protein change strings with their counts. (#5181)
    • Fix INDEL start/stop position and alleles for VCF gencode output. (#5131)
    • Get datasource version from a manifest file instead of the README (#5149)
    • Extract a new FuncotatorEngine to make it easier to write additional tools in the future that leverage Funcotator's annotation engine (#5134)
    • Handle character encoding error cases. (#5124)
  • CNNScoreVariants:

    • Add WDLs and JSONs to run CNNScoreVariants in a single-sample workflow (#4774)
    • Added --python-profile argument to enable Python profiling. (#4953)
  • CNV tools:

    • Produce an IGV-compatible seg file alongside the copy ratio calls in CallCopyRatioSegments (#5115)
    • Added optional mappability and segmental-duplication annotation to AnnotateIntervals. (#5162)
    • Improvements and refactoring of the Nucleotide class (#4846)
  • SV tools:

    • Bug fix to read name mangling in ExtractOriginalAlignmentRecordsByNameSpark (#5107)
    • Added an InsertSizeDistribution class to represent expected insert-size distribution (normal and log-normal distributed) parameterized by insert size mean and stddev (#4827)
    • Added documentation clarification and additional validation to SVInterval (#5157)
    • Test and utils clean up (#5116)
  • MarkDuplicatesSpark:

    • Switched MarkDuplicatesSpark tile-parsing code to use shorts in order to match Picard (#5165)
    • Added better error messages around missing read groups in MarkDuplicatesSpark (#5177)
  • Clone read base qualities rather than reference them directly in the read clipper code to prevent unsafe array operations (#4926)

  • Fix three bugs in the AlignmentUtils class (#3494)

    • The treatment of D-over-D in function applyCigarToCigar() was backward.
    • In function createReadAlignedToRef() the read start position passed to the leftAlignIndel() call was incorrect if the haplotype has an indel relative to reference.
    • When the leftAlignIndel() call drops any leading D operator in the result cigar, the read start position needs to be adjusted accordingly.
  • Test infrastructure improvements:

    • Split out gatk-testUtils as a separate artifact in our build system(#5112)
    • Skip push builds if there is a pull request (cuts down on total number of travis builds by about half) (#5156)
    • We now share the test settings between the main build and the docker tests (#5155)
  • Documented use of --temp-dir with GenomicsDBImport. (#5047)

  • Deleted obsolete experimental tool MarkDuplicatesGATK in favor of MarkDuplicatesSpark (#5166)

  • Deleted obsolete experimental tool BaseRecalibratorSparkSharded (#5192)

  • Upgraded htsjdk to version 2.16.1 (#5168)

  • Upgraded Picard to version 2.18.13. (#5173)