Skip to content

Running individual tools

sprokopec edited this page Jun 13, 2024 · 12 revisions

Running individual steps

Sometimes, you may only want to run one or a few steps, rather than the full pipeline (ie, alignment), or you may already have BAMs (aligned elsewhere) and want to run a specific variant calling tool (ie, mutect2).

Note: to process BAMs produced elsewhere, you MUST have the identical reference used for alignment OR be prepared to subset each BAM to contigs present in your desired reference file (ie, you can not process BAMs aligned by TGL using the hg38 reference on H4H [igenome-human/hg38]!!)

In all cases, tools will write individual commands to file: /path/to/output/directory/TOOL/logs/run_tool_step_sample/script.sh

Pre-processing steps

run BWA to align to a reference genome

perl bwa.pl \
-t /path/to/dna_pipeline_config.yaml \
-d /path/to/fastq_dna_config.yaml \
-o /path/to/output/directory/BWA \
-b /path/to/output/bam.yaml \
-c slurm \
--remove \
--dry-run { if this is a dry-run } \
--no-wait { if not a dry-run and you don't want to wait around for it to finish }

once BWA has finished, run GATK indel realignment and base quality score recalibration

perl gatk.pl \
-t /path/to/dna_pipeline_config.yaml \
-d /path/to/bwa_bam_config.yaml \
-o /path/to/output/directory/GATK \
-b /path/to/output/bam.yaml \
-c slurm \
--remove \
--dry-run { if this is a dry-run } \
--no-wait { if not a dry-run and you don't want to wait around for it to finish }

QC steps

once GATK has finished, get BAM QC metrics, including coverage, contamination estimates and callable bases

perl contest.pl \
-t /path/to/dna_pipeline_config.yaml \
-d /path/to/gatk_bam_config.yaml \
-o /path/to/output/directory/BAMQC/ContEst \
-c slurm \
--remove \
--dry-run { if this is a dry-run } \
--no-wait { if not a dry-run and you don't want to wait around for it to finish }

perl get_sequencing_metrics.pl \
-t /path/to/dna_pipeline_config.yaml \
-d /path/to/gatk_bam_config.yaml \
-o /path/to/output/directory/BAMQC/SequenceMetrics \
-c slurm \
--remove \
--dry-run { if this is a dry-run } \
--no-wait { if not a dry-run and you don't want to wait around for it to finish }

perl get_coverage.pl \
-t /path/to/dna_pipeline_config.yaml \
-d /path/to/gatk_bam_config.yaml \
-o /path/to/output/directory/BAMQC/Coverage \
-c slurm \
--remove \
--dry-run { if this is a dry-run } \
--no-wait { if not a dry-run and you don't want to wait around for it to finish }

Variant calling steps

Germline SNV/INDELs

run GATK's HaplotypeCaller to produce gvcfs


perl haplotype_caller.pl \
-t /path/to/dna_pipeline_config.yaml \
-d /path/to/gatk_bam_config.yaml \
-o /path/to/output/directory/HaplotypeCaller \
-c slurm \
--remove \
--dry-run { if this is a dry-run } \
--no-wait { if not a dry-run and you don't want to wait around for it to finish }

once HaplotypeCaller has finished, run variant recalibration


perl genotype_gvcfs.pl \
-t /path/to/dna_pipeline_config.yaml \
-d /path/to/gatk_bam_config.yaml \
-o /path/to/output/directory/HaplotypeCaller \
-c slurm \
--remove \
--dry-run { if this is a dry-run } \
--no-wait { if not a dry-run and you don't want to wait around for it to finish }

once recalibration is complete, annotate and filter using CPSR


perl annotate_germline.pl \
-t /path/to/dna_pipeline_config.yaml \
-d /path/to/gatk_bam_config.yaml \
-i /path/to/output/directory/HaplotypeCaller/cohort \
-o /path/to/output/directory/HaplotypeCaller \
-c slurm \
--remove \
--dry-run { if this is a dry-run } \
--no-wait { if not a dry-run and you don't want to wait around for it to finish }

Somatic Variants

run GATK's MuTect (v1) to produce somatic SNV calls

# Create a panel of normals (germline calls + sequencing artefacts)
# will only run if normal samples are available
perl mutect.pl \
--create-panel-of-normals \
-t /path/to/dna_pipeline_config.yaml \
-d /path/to/gatk_bam_config.yaml \
-o /path/to/output/directory/MuTect \
-c slurm \
--remove \
--dry-run { if this is a dry-run } \
--no-wait { if not a dry-run and you don't want to wait around for it to finish }

# Generate somatic SNV calls
# can be run on T/N pairs or tumour-only samples (using panel of normals)
perl mutect.pl \
-t /path/to/dna_pipeline_config.yaml \
-d /path/to/gatk_bam_config.yaml \
-o /path/to/output/directory/MuTect \
--pon /path/to/panel_of_normals.vcf { optional if not using the one created here: can also be specified in dna_pipeline_config.yaml if created elsewhere } \
-c slurm \
--remove \
--dry-run { if this is a dry-run } \
--no-wait { if not a dry-run and you don't want to wait around for it to finish }

run GATK's MuTect2 to produce somatic SNV calls

# Create a panel of normals (germline calls + sequencing artefacts)
# will only run if normal samples are available
perl mutect2.pl \
--create-panel-of-normals \
-t /path/to/dna_pipeline_config.yaml \
-d /path/to/gatk_bam_config.yaml \
-o /path/to/output/directory/MuTect2 \
-c slurm \
--remove \
--dry-run { if this is a dry-run } \
--no-wait { if not a dry-run and you don't want to wait around for it to finish }

# Generate somatic SNV and INDEL calls
# can be run on T/N pairs or tumour-only samples (using panel of normals)
perl mutect2.pl \
-t /path/to/dna_pipeline_config.yaml \
-d /path/to/gatk_bam_config.yaml \
-o /path/to/output/directory/MuTect2 \
--pon /path/to/panel_of_normals.vcf { optional if not using the one created here: can also be specified in dna_pipeline_config.yaml if created elsewhere } \
-c slurm \
--remove \
--dry-run { if this is a dry-run } \
--no-wait { if not a dry-run and you don't want to wait around for it to finish }

A few notes about Mutect2: Some samples (WGS and some WXS samples) will have exceptionally long run times. One solution is to run Mutect2 per-chromosome, however this alters the statistical models applied and thus may produce a different set of variants. Currently, stringent filtering plus the final ensemble approach typically make these differences irrelevant, but it is something to be aware of.

run SomaticSniper to produce SNV calls

# will only run on T/N pairs
perl somaticsniper.pl \
-t /path/to/dna_pipeline_config.yaml \
-d /path/to/gatk_bam_config.yaml \
-o /path/to/output/directory/SomaticSniper \
--pon /path/to/panel_of_normals.vcf { optional; can also be specified in dna_pipeline_config.yaml if created elsewhere } \
-c slurm \
--remove \
--dry-run { if this is a dry-run } \
--no-wait { if not a dry-run and you don't want to wait around for it to finish }

SomaticSniper will ONLY run on tumour samples with a matched normal, and will ONLY produce somatic SNV calls (no panel of normals will be generated for this caller). If you wish to perform additional germline filtering, you may provide a panel of normals developed elsewhere.

run VarDict to produce variant calls

# for WXS or smaller, targeted-panel datasets
# can be run on T/N pairs or tumour-only samples (using panel of normals)
perl vardict.pl \
-t /path/to/dna_pipeline_config.yaml \
-d /path/to/gatk_bam_config.yaml \
-o /path/to/output/directory/VarDict \
--pon /path/to/panel_of_normals.vcf { optional if not using the one created here; can also be specified in dna_pipeline_config.yaml if created elsewhere } \
-c slurm \
--remove \
--dry-run { if this is a dry-run } \
--no-wait { if not a dry-run and you don't want to wait around for it to finish }

# or, for WGS, will split by chromosome
# can be run on T/N pairs or tumour-only samples (using panel of normals)
perl vardict_wgs.pl \
-t /path/to/dna_pipeline_config.yaml \
-d /path/to/gatk_bam_config.yaml \
-o /path/to/output/directory/VarDict \
--pon /path/to/panel_of_normals.vcf { optional if not using the one created here; can also be specified in dna_pipeline_config.yaml if created elsewhere } \
-c slurm \
--remove \
--dry-run { if this is a dry-run } \
--no-wait { if not a dry-run and you don't want to wait around for it to finish }

run Manta to find candidate Indels (required for Strelka somatic variant detection)

# run Manta
perl manta.pl \
-t /path/to/dna_pipeline_config.yaml \
-d /path/to/gatk_bam_config.yaml \
-o /path/to/output/directory/Manta \
-c slurm \
--remove \
--dry-run { if this is a dry-run } \
--no-wait { if not a dry-run and you don't want to wait around for it to finish }

# run Strelka to create a panel of normals (germline calls)
# will only run if normal samples are available
perl strelka.pl \
-t /path/to/dna_pipeline_config.yaml \
-d /path/to/gatk_bam_config.yaml \
-o /path/to/output/directory/Strelka \
--create-panel-of-normals \
-c slurm \
--remove \
--dry-run { if this is a dry-run } \
--no-wait { if not a dry-run and you don't want to wait around for it to finish }

# run Strelka to produce somatic variant calls
# can be run on T/N pairs or tumour-only samples (using panel of normals)
perl strelka.pl \
-t /path/to/dna_pipeline_config.yaml \
-d /path/to/gatk_bam_config.yaml \
-m /path/to/output/directory/Manta \
-o /path/to/output/directory/Strelka \
--pon /path/to/panel_of_normals.vcf { optional if not using the one created here: can also be specified in dna_pipeline_config.yaml if created elsewhere } \
-c slurm \
--remove \
--dry-run { if this is a dry-run } \
--no-wait { if not a dry-run and you don't want to wait around for it to finish }

run VarScan to produce SNV and CNA calls

# Run T/N pairs to generate CNA calls, plus germline and somatic SNV and INDEL calls.
# This will create a panel of normals from the germline calls (only if not provided) and 
# finally, will run T-only samples with germline filtering using the panel of normals
perl varscan.pl \
-t /path/to/dna_pipeline_config.yaml \
-d /path/to/gatk_bam_config.yaml \
-o /path/to/output/directory/VarScan \
--pon /path/to/panel_of_normals.vcf { optional if not using the one created here: can also be specified in dna_pipeline_config.yaml if created elsewhere } \
-c slurm \
--remove \
--dry-run { if this is a dry-run } \
--no-wait { if not a dry-run and you don't want to wait around for it to finish }

# Use VarScan output to run Sequenza with gamma tuning for optimized SCNA calls
# will only run on T/N pairs
perl run_sequenza_with_optimal_gamma.pl \
-t /path/to/dna_pipeline_config.yaml \
-d /path/to/gatk_bam_config.yaml \
-o /path/to/output/directory/VarScan \
-c slurm \
--dry-run { if this is a dry-run } \
--no-wait { if not a dry-run and you don't want to wait around for it to finish }

run GATK:CNV to produce CNA calls

# Generate somatic CNV calls
# can be run on T/N pairs or tumour-only samples BUT requires at least 1 normal to estimate baseline distribution
perl gatk_cnv.pl \
-t /path/to/dna_pipeline_config.yaml \
-d /path/to/gatk_bam_config.yaml \
-o /path/to/output/directory/GATK_CNV \
-c slurm \
--remove \
--dry-run { if this is a dry-run } \
--no-wait { if not a dry-run and you don't want to wait around for it to finish }

# Generate germline CNA calls
perl gatk_gcnv.pl \
-t /path/to/dna_pipeline_config.yaml \
-d /path/to/gatk_bam_config.yaml \
-o /path/to/output/directory/GATK_CNV \
-c slurm \
--remove \
--dry-run { if this is a dry-run } \
--no-wait { if not a dry-run and you don't want to wait around for it to finish }

GATK:CNV will ONLY run if at least one normal sample is provided, but will then run all provided samples (T/N and tumour-only).

run Delly to produce SV calls

# Generate somatic SV calls
# can be run on T/N pairs or tumour-only samples
perl delly.pl \
-t /path/to/dna_pipeline_config.yaml \
-d /path/to/gatk_bam_config.yaml \
-o /path/to/output/directory/Delly \
-c slurm \
--remove \
--dry-run { if this is a dry-run } \
--no-wait { if not a dry-run and you don't want to wait around for it to finish }

# Generate germline SV calls
perl delly.pl \
-t /path/to/dna_pipeline_config.yaml \
-d /path/to/gatk_bam_config.yaml \
-o /path/to/output/directory/Delly \
--germline \
-c slurm \
--remove \
--dry-run { if this is a dry-run } \
--no-wait { if not a dry-run and you don't want to wait around for it to finish }

Note: germline SV calling by Delly has not been thoroughly tested! Please report any issues encountered.

run NovoBreak to produce somatic SV calls

# will only run on T/N pairs
perl novobreak.pl \
-t /path/to/dna_pipeline_config.yaml \
-d /path/to/gatk_bam_config.yaml \
-o /path/to/output/directory/NovoBreak \
-c slurm \
--remove \
--dry-run { if this is a dry-run } \
--no-wait { if not a dry-run and you don't want to wait around for it to finish }

NovoBreak will ONLY run on tumour samples with a matched normal.

run Pindel to produce somatic INDEL and SV calls

# can be run on T/N pairs or tumour-only samples
perl pindel.pl \
-t /path/to/dna_pipeline_config.yaml \
-d /path/to/gatk_bam_config.yaml \
-o /path/to/output/directory/Pindel \
-c slurm \
--remove \
--dry-run { if this is a dry-run } \
--no-wait { if not a dry-run and you don't want to wait around for it to finish }

Pindel is VERY resource intensive. For this reason, WGS and WXS projects are split by chromosomes (and in some cases by chromosome arms), and we will skip all telomere and centromere regions.

run SViCT to produce somatic SV calls

# for targeted-ctDNA cohorts only
perl svict.pl \
-t /path/to/dna_pipeline_config.yaml \
-d /path/to/bam_config.yaml \
-o /path/to/output/directory/SViCT \
-c slurm \
--remove \
--dry-run { if this is a dry-run } \
--no-wait { if not a dry-run and you don't want to wait around for it to finish }

run ASCAT to produce tumour fraction and CNA calls

# will only run on T/N pairs
perl ascat.pl \
-t /path/to/dna_pipeline_config.yaml \
-d /path/to/gatk_bam_config.yaml \
-o /path/to/output/directory/ASCAT \
-c slurm \
--remove \
--dry-run { if this is a dry-run } \
--no-wait { if not a dry-run and you don't want to wait around for it to finish }

ASCAT will ONLY run on tumour samples with a matched normal.

run IchorCNA to produce tumour fraction and CNA calls


perl ichor_cna.pl \
-t /path/to/dna_pipeline_config.yaml \
-d /path/to/gatk_bam_config.yaml \
-o /path/to/output/directory/IchorCNA \
-c slurm \
--remove \
--dry-run { if this is a dry-run } \
--no-wait { if not a dry-run and you don't want to wait around for it to finish }

run panelCN.mops to produce somatic CNA calls

# only runs on targeted-panels (seq_type: targeted)
perl panelcn_mops.pl \
-t /path/to/dna_pipeline_config.yaml \
-d /path/to/gatk_bam_config.yaml \
-o /path/to/output/directory/panelCNmops \
-c slurm \
--remove \
--dry-run { if this is a dry-run } \
--no-wait { if not a dry-run and you don't want to wait around for it to finish }

panelCN.mops will ONLY run if at least one normal sample is provided, but will then run all provided samples (T/N and tumour-only).

run Mavis to annotate Delly and Manta SV calls

# can be run on T/N pairs or tumour-only samples
perl mavis.pl \
-t /path/to/dna_pipeline_config.yaml \
-d /path/to/dna_gatk_bam_config.yaml \
-o /path/to/output/directory/Mavis \
--manta /path/to/strelka/directory \
--delly /path/to/delly/directory \
--novobreak /path/to/novobreak/directory \
--pindel /path/to/pindel/directory \
--svict /path/to/svict/directory \
--rna /path/to/gatk_rnaseq_bam_config.yaml { optional if pughlab_rnaseq_pipeline.pl was run previously } \
--arriba /path/to/arriba/directory { optional if pughlab_rnaseq_pipeline.pl was run previously } \
--starfusion /path/to/starfusion/directory { optional if pughlab_rnaseq_pipeline.pl was run previously } \
--fusioncatcher /path/to/fusioncatcher/directory { optional if pughlab_rnaseq_pipeline.pl was run previously } \
-c slurm \
--remove \
--dry-run { if this is a dry-run } \
--no-wait { if not a dry-run and you don't want to wait around for it to finish }

Mavis will also accept fusion calls generated from RNA-Seq data (using STAR-Fusion or FusionCatcher) but the input data must be in the same format as would be output by pughlab_rnaseq_pipeline.pl

Other tools

run MSI-Sensor

# can be run on T/N pairs or tumour-only samples BUT requires at least 1 normal to estimate baseline distribution
perl msi_sensor.pl \
-t /path/to/dna_pipeline_config.yaml \
-d /path/to/gatk_bam_config.yaml \
-o /path/to/output/directory/MSI \
-c slurm \
--remove \
--dry-run { if this is a dry-run } \
--no-wait { if not a dry-run and you don't want to wait around for it to finish }

Summarize

summarize output and create final report

perl pughlab_pipeline_auto_report.pl \
-t /path/to/dna_pipeline_config.yaml \
-d /path/to/gatk_bam_config.yaml \
--create_report \
-c slurm \
--dry-run { if this is a dry-run; NOTE that this will fail if the above pipeline has not completed } \
--no-wait { if not a dry-run and you don't want to wait around for it to finish }

This report generator is still a work in progress and may not run correctly if only a subset of tools were run!