Skip to content

Workflow (running a pipeline)

sprokopec edited this page Jun 13, 2024 · 12 revisions

Running a pipeline

1. Download the latest version of the pipeline

cd /cluster/home/username/git/
git clone https://github.com/pughlab/pipeline-suite/

2. Run FASTQC to verify fastq quality

module load perl

perl ~/git/pipeline-suite/collect_fastqc_metrics.pl \
-d /path/to/fastq_config.yaml \
-t /path/to/fastqc_tool_config.yaml \
-c slurm \
{optional: --rna, --dry-run }

Be sure to run FASTQC to verify fastq quality prior to running downstream pipelines. In particular, ensure read length is consistent, GC content is similar (typically between 40-60%) and files are unique (no duplicated md5sums). Check here for example output.

3. Prepare interval files (ie, for WXS)

For WXS or targeted-sequencing panels, a bed file containing target regions should be provided (listing at minimum: chromosome, start and end positions). Variant calling pipelines MuTect and Mutect2 will add 100bp of padding to each region provided. For consistency, this padding must be manually added prior to variant calling with other tools (ie, Strelka, SomaticSniper, VarDict and VarScan). This function will additionally create a bgzipped version of the padded interval file required by Strelka.

module load perl

perl ~/git/pipeline-suite/scripts/format_intervals_bed.pl \
-b /path/to/base/intervals.bed \
-r /path/to/reference.fa

Make sure you have write permissions on the directory containing the intervals bed file as this will write output files to the same directory as the original bed file! If you don't have write permissions, create a symlink in a folder you do have permission for (ie, the project directory) and use that.

4. Run DNA (or RNA) pipeline

I suggest always first running the pipeline with the --dry-run flag. This will generate the directory structure and output the individual run scripts so that you can verify that a) that you have permissions to write here and b) that there are no errors/missing values/typos in your config files. Once everything looks set, run the pipeline again without --dry-run.

module load perl

# Run the DNA-Seq pipeline as a full block
perl ~/git/pipeline-suite/pughlab_dnaseq_pipeline.pl \
-t /path/to/dna_pipeline_config.yaml \
-d /path/to/dna_fastq_config.yaml \
--preprocessing \
--qc \
--variant_calling \
--summarize \
--create_report \
-c slurm \
--remove \
--dry-run { optional }

# Run the DNA-Seq pipeline as separate stages (ie, preprocessing first, then once that's complete, run the QC and variant calling stages)
perl ~/git/pipeline-suite/pughlab_dnaseq_pipeline.pl \
-t /path/to/dna_pipeline_config.yaml \
-d /path/to/dna_fastq_config.yaml \
--preprocessing \
-c slurm \
--remove \
--dry-run { optional }

perl ~/git/pipeline-suite/pughlab_dnaseq_pipeline.pl \
-t /path/to/dna_pipeline_config.yaml \
-d /path/to/dna_gatk_bam_config.yaml \
--qc \
-c slurm \
--remove \
--dry-run { optional }

perl ~/git/pipeline-suite/pughlab_dnaseq_pipeline.pl \
-t /path/to/dna_pipeline_config.yaml \
-d /path/to/dna_gatk_bam_config.yaml \
--variant_calling \
-c slurm \
--remove \
--dry-run { optional }

perl ~/git/pipeline-suite/pughlab_dnaseq_pipeline.pl \
-t /path/to/dna_pipeline_config.yaml \
-d /path/to/dna_gatk_bam_config.yaml \
--summarize \
-c slurm \
--remove \
--dry-run { optional }

# Run the RNA-Seq pipeline as a full block
perl pughlab_rnaseq_pipeline.pl \
-t /path/to/rna_pipeline_config.yaml \
-d /path/to/fastq_rna_config.yaml\
--summarize \
-c slurm \
--remove \
--dry-run { optional }

# Run the EM-Seq pipeline as a full block
perl pughlab_rnaseq_pipeline.pl \
-t /path/to/emseq_pipeline_config.yaml \
-d /path/to/fastq_ems_config.yaml\
--fastq_prep \
--alignment \
--qc \
--analysis \
-c slurm \
--remove \
--dry-run { optional }

This will generate the directory structure in the output directory (provided in your tool config), including a "logs/run_{D|R}NA_pipeline_TIMESTAMP/" directory containing a file "run_{D|R}NASeq_pipeline.log" which lists the individual tool commands; these can be run separately if "--dry-run" was set, or in the event of a failure at any stage and you don't need to re-run the entire thing (Note: doing so should not regenerate files that already exist).

Resuming a run:

If the initial run is unsuccessful or incomplete, check the logs to identify the problem - it is most likely due to insufficient memory or runtime allocation. In this case, update the necessary parameters for the affected stage in the tool config (dna_pipeline_config.yaml or rna_pipeline_config.yaml).

Now, rerun individual tool commands (as found in /path/to/output/directory/logs/TIMESTAMP/run_{D|R}NASeq_pipeline.log).

Additional Notes:

The following tools will produce a Panel of Normals if normal samples are provided:

  • only if requested (--create-panel-of-normals): MuTect (v1), MuTect2, Strelka
  • by default, unless an external panel is provided using --pon, or in dna_pipeline_config.yaml: VarDict, VarScan
  • SomaticSniper and Pindel will not output germline variants and will therefore not produce a panel of normals

The following tools will only run on T/N pairs:

  • ContEst (Note: ContEst is an older tool and has been mostly replaced by GATK's CalculateContaminatio (part of get_sequencing_metrics.pl). However, ContEst is still useful to estimate contamination at the lane level)
  • SomaticSniper, Sequenza, NovoBreak (if running a tumour-only cohort you can set run: N for each of these tools in dna_pipeline_config.yaml, however not doing so should cause any failures)

The following tools require at least 1 normal sample to estimate baseline distribution (set run: N for each of these tool in dna_pipeline_config.yaml if running a tumour-only cohort):

  • GATK:CNV, MSI_Sensor, panelCN.mops

The following tools will run on tumour-only samples:

  • requires a Panel of Normals: MuTect (v1), MuTect2
  • with or without a Panel of Normals (but with is preferred): Strelka, VarDict, VarScan (SNV/INDELs only)
  • no restrictions: HaplotypeCaller/GenotypeGVCFs/CPSR, Manta, Delly, Pindel, SViCT, IchorCNA

Other restrictions:

  • SViCT will only run on targeted-sequencing from ctDNA.