Releases: broadinstitute/viral-ngs
Releases · broadinstitute/viral-ngs
v1.19.3
New:
- WDL workflow for Genbank submission [#797, #793, #800]
metagenomics.py taxlevel_summary
to tabulate taxonomic abundance data from multiple Kraken-format summary files. [#792]- WDL workflow
spikein
to report spike-ins [#796] - new
contigs
WDL workflow that runs depletion, SPAdes, and has a placeholder for future contig-based taxonomic classification steps to be added later [#796] - sample name is now reported in Krona reports as the "root"-level name [#785]
Changed:
- WDL workflows for depletion now default to use an hg19 BWA database instead of an hg19 bmtagger database [#796]
- WDL
assembly.scaffold
: change name of final output file from {sample_name}.scaffold.fasta to {sample_name}.scaffolded_imputed.fasta [#796]
Fixed:
- Fixes and updates to
tbl_transfer_prealigned --oob_clip
behavior. [#807] - Fixes version mismatch for
tbl2asn
spec [#787] - Addresses edge cases of tbl2asn usage [#797]
- In Snakemake workflow, skip blank lines in
samples-*.txt
files [#794] - bugfix ncbi annotation step in Snakemake workflow for Genbank submission prep [#786]
Added/Upgraded:
v1.19.2
New:
illumina.py illumina_demux
now supports simplex runs: non-multiplexed flowcells that do not have a SampleSheet and have noB
entries in theirread_structure
[#776, #759]. Support has also been added for iSeq/FireFly-style flowcell IDs. Simplex-basecalling functionality is not yet tested at the pipeline level (Snakemake or WDL).
Changed:
- Filename cleanups: more assembly WDL tasks now stripping off suffixes left by previous steps (e.g. "taxfilt" or "assembly1-trinity") before appending their new suffixes, resulting in cleaner filenames throughout the assembly workflow. The final
refine_2x_and_plot
step now creates its final output filename based on the input reads bam instead of the input fasta (as in the pre-WDL dx workflows) and calls the final assembly samplename.fasta (instead of samplename.taxfilt.assembly1-trinity.scaffold.refine2.fasta). [#783] - Final assembly output file variable is now
final_assembly_fasta
instead ofrefine2_assembly_fasta
. [#783]
Fixed:
- The WDL
refine_2x_and_plot
combined task (introduced in the last month as an optimization to reduce instance spin up and staging time) was computing theassembly_length
andassembly_length_unambiguous
numbers incorrectly: it was computing on the input fasta (from scaffolding), not the output / final assembly. It was also accidentally invoking theplot_coverage
python command twice. [#783] - bmtagger in
deplete_human
was randomly memory-starved and would fail with "sacrifice child" errors. We have increased bmtagger (srprism)'s RAM limit in the WDL invocation to 90% RAM, up from 50% (since it never runs simultaneously with Java or anything else). [#781, #780] - Allow Picard SamToFastq more RAM during
rmdup_mvicuna_bam
[#771] - bugfix in
taxon_filter.py deplete_human
backwards compatibility wrapper arounddeplete
[#775] - bugfix argparse setup for standalone command
taxon_filter.py deplete_bwa_bam
(was referring to non-existent argument --JVMmemory) [#778] - No longer silently swallow runtime errors during scaffolding due to a bug in how we parallelized it. [#772]
Added/upgraded:
- upgrade from mummer3 to mummer4, which fixes a few odd bugs, speeds up scaffolding, increases the genome size we can scaffold to, while producing otherwise identical output. [#772, #677]
- bump Picard from 2.13 to 2.17.5 [#774]
- bump Cromwell from v29 to v30.2 [#773]
- bump dxWDL from 0.58.1 to 0.59 [#782]
- bump wdltool from 0.14 to cromwell/womtool 30.2 [#782]
- WDL pipeline: uses GNU Parallel to parallelize up to
nproc
invocations of single-threaded Krona in the kraken WDL task. Addresses the observation that the Kronafor
loop was sometimes taking several times longer than Kraken itself. Also parallelizes the single-threaded tar/gzip calls after Krona. [#781]
v1.19.1
New:
- [#762] The
taxon_filter.py
file now has a new command,deplete_bwa_bam
, which uses bwa for depletion of sequence data provided in*.fasta
format or pre-indexed bwa database format. - [#762] bwa-based depletion is now available as an option in
taxon_filter.py deplete
via the--bwaDbs
argument
Changed:
- The
taxon_filter.py deplete_human
command is now deprecated in favor oftaxon_filter.py deplete
. Thedeplete_human
command will remain for the time being for compatibility. - [#755, #766] Add a new
align_and_plot
workflow in WDL, Cromwell, DNAnexus.
Fixed:
- [#761] Fix a tar extraction bug when running within the Docker container as root
- [#765] Fix TruSight illumina indexes
- [#760] Prevent ambiguous contig alignment during scaffolding from causing hard failures (warn and proceed with remaining contigs)
- [#741] When scaffolding against multiple reference genomes, allow some to fail, as long as some succeed
Added/Upgraded:
- [#752, #750, ] DNAnexus workflows now include defaulted file parameters for various databases
- [#751] MVicuna duplicate removal is now parallelized if multiple read groups exist in the input BAM
- [#756, #767] Docker
viral-baseimage
upgraded from 0.1.6 (zesty) to 0.1.8 (artful) with fixes for Spectre and Meltdown
Documentation:
v1.19.0
This is a release with many changes, including new WDL pipelines, a distribution of viral-ngs on DNAnexus that will be updated in sync with the latest version of viral-ngs, the ability to provide multiple references for scaffolding, and several critical bug fixes. With this release, the Docker image for viral-ngs moves from Docker Hub to quay.io/broadinstitute/viral-ngs.
New:
- WDL (more info) pipelines have been added, inspired by the previous DNAnexus implementation of viral-ngs. The WDL files currently reside within the
pipes/WDL/
directory of viral-ngs. The pipelines can be executed locally or in the Google cloud viacromwell
(on bioconda), or via the public distribution available on DNAnexus.- WDL workflows are tested locally on Travis via Cromwell
- WDL workflows are compiled for DNAnexus via dxWDL, and tested on DNAnexus
- a simple form of reference selection via
assembly.py::order_and_orient
. Scaffolding is now performed using several references (in parallel); the one that yields the most non-N bases is chosen to be used for the scaffolded genome. For the positional argument,inReference
, multiple FASTA files may now be provided, each containing one reference genome. Alternatively, multiple references may be given by specifying a single filename, and giving the number of reference segments with the--nGenomeSegments
parameter. If multiple references are given, they must all contain the same number of segments listed in the same order.- This has been included in the new WDL pipelines
- New kraken execution strategy to process multiple inputs in one run
taxon_filter.py
changes todeplete_bmtagger_bam
anddeplete_blastn_bam
: can now accept blast/bmtagger databases as.tar.gz
,.tar.lz4
,.tar.bz2
bundles and also as unindexed fasta files (that will be indexed on the fly)- new internal function
util.file.extract_tarball
exposed on the CLI asread_utils.py::extract_tarball
. Accepts stdin piped input.
Changed:
- various and extensive changes to how the viral-ngs Docker image is prepared and distributed:
- Note: The Docker image is now available from quay.io/broadinstitute/viral-ngs, which is faster for staging than Docker Hub
- the Docker image build process no longer relies on the
easy-deploy-viral-ngs.sh
script
--threads
argparse option now common and available across viral-ngs commands- optimizations in
illumina.py::illumina_demux
illumina.py::common_barcodes
execution time has been reduced- in
easy-deploy-viral-ngs.sh
, some messages have been moved fromstdout
tostderr
taxon_filter.py
: clean up and optimization aroundblastn
-based read depletion- various development-related changes including:
- travis cleanup re: pip package installs, conditionals, build matrix
- Docker deployment bugfixes
Fixed:
- prevent
reports.py::plot_coverage
from removing the bam file provided as input if it is already sorted and dupe removal is being not performed. In such cases the input bam is used directly and is now preserved. diamond
tests for accession taxonomy fixed: subprocess.PIPE replaced with named pipes to prevent deadlockstaxon_filter.py::bmtagger_build_db
default value for word_size is now18
, not8
- fixes the use of fasta databases for
taxon_filter.py::deplete_bmtagger_bam
anddeplete_human
Added/Upgraded:
- pysam
0.12.0.1
->0.13.0
- samtools
1.5
->1.6
- kraken
0.10.6_fork3
->1.0.0_fork3
- lz4-bin
131
added as requirement - pigz
2.3.4
added as requirement - lbzip2
2.5
added as requirement
v1.18.2
Changed:
- Demultiplexing from Illumina basecalls is now more permissive of varying input directory
- [dev-related] conda package and Docker image are now built on each branch commit
Fixed:
- [dev-related] package and docker build now optimized for more rapid built+test
--notemp
added to Snakemake call script to support usage of--immediate-submit
as required by newer Snakemake versions- Snakemake pipeline demux fixed
-l h_rt=hh:mm:ss
spec now consistently using=
v1.18.1
v1.18.0
New:
- The Snakemake pipeline can now source database files from S3, GS, or SFTP if given protocol-prefixed paths (
s3://
,gs://
,sftp://
) and if the system is preconfigured with credentials. - The
config.yaml
file has been changed to includes3://*
paths for pre-built databases, rather than Broad Institute-specific paths (and files listed are live and available for all!) - Kraken is now enabled on OSX, though significant RAM is required to use it
- The reports.py::
align_and_plot_coverage
and read_utils.py::align_and_fix
functions now expose an optional argument,--minScoreToFilter
. This adds an option—when using bwa—to calculate an alignment score for each query by summing the scores across the query's alignments, and keep only the queries whose score is at least the value of the specified threshold. - sample sheets can now be specified in
*.csv.gz
format - For debugging or more bespoke analysis, temp files can now be kept more easily by setting the
VIRAL_NGS_TMP_DIRKEEP
environment variable - The cd-hit-dup tool has been added as an alternative to mvicuna for removing duplicate reads, via a new CLI function read_utils.py::
rmdup_cdhit_bam
. Note that this is not currently used in the pipeline by default. - The Gap2Seq tool has been added for filling gaps between contigs. It is exposed via the new CLI command: assembly.py::
gapfill_gap2seq
. Note that this is not currently used in the pipeline by default. - The Spades assembler has been added as an alternative to Trinity for de novo assembly. Note that this is not currently used in the pipeline by default.
- Expose blastn
--chunkSize
intaxon_filter.deplete_human
.
Changed:
- metagenomics rules in the Snakemake pipeline now break out kraken files as separate targets
- improvements to speed of automated tests
- The source and binaries for mvicuna and v-phaser2 have been removed from this repository since they now reside in their own repositories
- viral-ngs is no longer tested against or distributed for Python 3.4, from this release forward. This should not impact users since the package is typically installed in an isolated conda environment with Python 3.5 or 2.7.
- The Snakemake rules and cluster-submitter have been updated to reflect changes to the UGER cluster system at the Broad Institute, which now requires that
-l h_rt hh:mm:ss
be passed to schedule max runtime for each job - performance improvements to lastal filtering
- lastal database is now built automatically if supplied pre-built
- SPAdes wrapper more resilient to empty fastq inputs
- Reimplement samtools.filterByCigarString using pysam instead of samtools
- Kraken on OSX now exists on broad-viral: enable it in OSX git hooks and turn on all tests
- Remove lastal optional outputs from
taxon_filter.deplete_human
Fixed:
- In the Snakemake pipeline, code that reads sample sheets and barcode files is now more tolerant of different formats, including files formatted with Windows-style newlines (
\r\n
for Windows vs.\n
for Linux/Unix/macOS) - fixed handling of empty subtrees when importing
*.yaml
files within*.yaml
config files (for config includes/composition) - fixed other edge cases related to config imports
Upgraded:
- last
719
->876
- Update samtools to
1.5
- Update pysam to
0.12.0.1
v1.17.3
v1.17.2
Changed:
- Improved HiSeq X / HiSeq 4000 compatibility: broken symlinks are now removed from Illumina lane directories if present. This is helpful for HiSeq-X/4000 systems, which write out a single
s.locs
file for cluster locations rather than per-tile location files. Picard'sCheckIlluminaDirectory
can create symlinks that take the place of per-tile*.locs
files, however these links can break when runs are moved between systems. The change in this release of viral-ngs allows broken links to be removed and corrected. CheckIlluminaDirectory
is now called on each call toillumina_demux
to check run directories for validity prior to demultiplexing