Skip to content

Workflow (running a pipeline)

sprokopec edited this page Sep 17, 2021 · 12 revisions

Running a pipeline

  1. Download the latest version of the pipeline
cd /cluster/home/username/git/
git clone https://github.com/pughlab/pipeline-suite/
  1. Run FASTQC to verify fastq quality
module load perl

perl ~/git/pipeline-suite/collect_fastqc_metrics.pl \
-d /path/to/fastq_config.yaml \
-t /path/to/fastqc_tool_config.yaml \
-c slurm \
{optional: --rna, --dry-run }

Be sure to run FASTQC to verify fastq quality prior to running downstream pipelines. In particular, ensure read length is consistent, GC content is similar (typically between 40-60%) and files are unique (no duplicated md5sums). Check here for example output.

  1. Prepare interval files (ie, for WXS)

For WXS or targeted-sequencing panels, a bed file containing target regions should be provided (listing at minimum: chromosome, start and end positions). Variant calling pipelines MuTect and Mutect2 will add 100bp of padding to each region provided. For consistency, this padding must be manually added prior to variant calling with other tools (ie, Strelka, SomaticSniper, VarDict and VarScan). This function will additionally create a bgzipped version of the padded interval file required by Strelka.

module load perl

perl ~/git/pipeline-suite/scripts/format_interval_bed.pl \
-b /path/to/base/intervals.bed \
-r /path/to/reference.fa

Make sure you have write permissions on the directory containing the intervals bed file as this will write output files to the same directory as the original bed file! If you don't have write permissions, create a symlink in a folder you do have permission for (ie, the project directory) and use that.

  1. Run DNA (or RNA) pipeline
module load perl

# DNA
perl ~/git/pipeline-suite/pughlab_dnaseq_pipeline.pl \
-t /path/to/dna_pipeline_config.yaml \
-d /path/to/dna_fastq_config.yaml \
--preprocessing \
--variant_calling \
--summarize \
-c slurm \
--remove \
--dry-run { optional }

# RNA
perl pughlab_rnaseq_pipeline.pl \
-t /path/to/rna_pipeline_config.yaml \
-d /path/to/fastq_rna_config.yaml\
--summarize \
-c slurm \
--remove \

This will generate the directory structure in the output directory (provided in /path/to/dna_pipeline_config.yaml or /path/to/rna_pipeline_config.yaml), including a "logs/run_{D|R}NA_pipeline_TIMESTAMP/" directory containing a file "run_{D|R}NASeq_pipeline.log" which lists the individual tool commands; these can be run separately if "--dry-run" was set, or in the event of a failure at any stage and you don't need to re-run the entire thing (Note: doing so would not regenerate files that already exist).

Resuming a run:

If the initial run is unsuccessful or incomplete, check the logs to identify the problem - it is most likely due to insufficient memory or runtime allocation. In this case, update the necessary parameters for the affected stage in the dna_pipeline_config.yaml or rna_pipeline_config.yaml .

Now, rerun individual tool commands (as found in /path/to/output/directory/logs/TIMESTAMP/run_{D|R}NASeq_pipeline.log).

DNA Pipeline Notes:

The following tools will produce a Panel of Normals if normal samples are provided:

  • only if requested (--create-panel-of-normals): MuTect (v1), MuTect2, Strelka
  • by default, unless an external panel is provided (--pon, or in dna_pipeline_config.yaml): VarDict, VarScan

The following tools will only run on T/N pairs:

  • ContEst (Note, ContEst is and older tool and contamination estimates for both T/N and tumour-only samples are provided by get_sequencing_metrics.pl)
  • SomaticSniper, Sequenza, NovoBreak (set run: N for each of these tools in dna_pipeline_config.yaml if running a tumour-only cohort)

The following tools require at least 1 normal sample to estimate baseline distribution (set run: N for each of these tool in dna_pipeline_config.yaml if running a tumour-only cohort):

  • GATK:CNV, MSI_Sensor

The following tools will run on tumour-only samples:

  • requires a Panel of Normals: _MuTect (v1), MuTect2 _
  • with or without a Panel of Normals (but with is preferred): Strelka, VarDict, VarScan (SNV/INDELs only)
  • no restrictions: HaplotypeCaller/GenotypeGVCFs/CPSR, Manta, Delly
Clone this wiki locally