Skip to content

Input files

sprokopec edited this page Jun 13, 2024 · 6 revisions

Pipeline Configuration

The pipeline-suite runs using parameters provided in YAML format. There are two types of required configuration files:

Tool configs:

  • these specify common parameters, including:
    • project name
    • sequencing type (wgs, exome, rna or targeted), sequencing centre and platform
    • is_ctdna (true or false)
    • hpc_group to indicate group to use for job submission (ie, slurm -A group argument)
    • ref_type (ie, hg19 or hg38)
    • path to desired output directory (will be created if this is the initial run)
    • paths to tool-specific reference files/directories
    • desired versions of tools
  • and, for each tool, memory and run time parameters for each step

Examples: dna_pipeline_config.yaml and rna_pipeline_config.yaml

Data configs:

You have 3 options to produce the data config file:

  1. use create_data_yaml__tgl.R (if filenames are in a specific format [ie, from TGL])
  2. use create_fastq_yaml.pl (if filenames are variable but contain the sample ID)
  3. write the file yourself using the format described below (can be tedious!)

1) using create_data_yaml__tgl.R

If your fastq files come from TGL, they should follow the following format: TGL01_0001_Ov_P_PE_420_EX_210220_A00469_0160_BAVYWVDTXY_1_CTGATCGT-GCGCATAT_R1.fastq.gz

This function will split the filename to extract patient ID, sample ID, sample type, library and lane IDs:

  • patient ID: TGL01_0001
  • sample ID: TGL01_0001_Ov_P
  • sample type: P [where R is interpreted to be reference and anything else is tumour]
  • library ID: TGL01_0001_Ov_P_PE_420_EX
  • lane ID: 210220_A00469_0160_BAVYWVDTXY_1

This can also be used to create a yaml for BAM files if you already have these: TGL01_0001_Ov_P_WG_EXTERNALID.filter.deduped.recalibrated.bam

  • patient ID: TGL01_0001
  • sample ID: TGL01_0001_Ov_P
  • sample type: P [where R is interpreted to be reference and anything else is tumour]

Create the YAML file:

module load R

Rscript /path/to/pipeline_suite/scripts/create_data_yaml__tgl.R \
-d /path/to/fastq/directory \
-o /path/to/output_fastq_config.yaml \
-t fastq { or bam }

2) create_fastq_yaml.pl

Note: create_fastq_yaml.pl requires all fastq files to be in a single, flat directory (/path/to/all/fastq/*.fastq.gz) and will attempt to link fastqs to samples by the Sample.ID provided by sample_info.txt.

Prepare sample_info.txt - this is a tab-separated text file (with a header) listing Patient ID, Sample ID and Sample Type:

Patient.ID Sample.ID Type
SMP-001 SMP-001-T tumour
SMP-001 SMP-001-N normal
SMP-002 SMP-002-T1 tumour
SMP-002 SMP-002-T2 tumour
SMP-002 SMP-002-N normal

If your fastq files come from TGL, they should follow the following format: TGL01_0001_Ov_P_PE_420_EX_210220_A00469_0160_BAVYWVDTXY_1_CTGATCGT-GCGCATAT_R1.fastq.gz

Thus, the entry in sample_info.txt would look like this: TGL01_0001 TGL01_0001_Ov_P tumour

Create the YAML file:

module load perl

perl /path/to/pipeline_suite/scripts/create_fastq_yaml.pl \
-d /path/to/fastq/directory \
-o /path/to/output_fastq_config.yaml \
-t dna { or rna } \
-i /path/to/sample_info.txt

Example output:

---
TGL01_0001:
    TGL01_0001_Ov_P:
        type: tumour
        libraries:
            TGL01_0001_Ov_P_PE_420_EX:
                runlanes:
                    210220_A00469_0160_BAVYWVDTXY_1_CTGATCGT-GCGCATAT:
                        fastq:
                            R1: /cluster/projects/pughlab/data/PROJECTNAME/EXOME/TGL01_0001_Ov_P_PE_420_EX_210220_A00469_0160_BAVYWVDTXY_1_CTGATCGT-GCGCATAT_R1.fastq.gz
                            R2: /cluster/projects/pughlab/data/PROJECTNAME/EXOME/TGL01_0001_Ov_P_PE_420_EX_210220_A00469_0160_BAVYWVDTXY_1_CTGATCGT-GCGCATAT_R2.fastq.gz
    TGL01_0001_Ov_R:
        type: normal
        libraries:
            TGL01_0001_Ov_R_PE_420_EX:
                runlanes:
                    210220_A00469_0160_BAVYWVDTXY_1_CTGATCGT-GCGCATAT:
                        fastq:
                            R1: /cluster/projects/pughlab/data/PROJECTNAME/EXOME/TGL01_0001_Ov_R_PE_420_EX_210220_A00469_0160_BAVYWVDTXY_1_CTGATCGT-GCGCATAT_R1.fastq.gz
                            R2: /cluster/projects/pughlab/data/PROJECTNAME/EXOME/TGL01_0001_Ov_R_PE_420_EX_210220_A00469_0160_BAVYWVDTXY_1_CTGATCGT-GCGCATAT_R2.fastq.gz
TGL01_0002:
    TGL01_0002_Ov_P:
        type: tumour
        libraries:
            TGL01_0002_Ov_P_PE_465_EX:
                runlanes:
                    210220_A00469_0160_BAVYWVDTXY_1_CTGATCGT-GCGCATAT:
                        fastq:
                            R1: /cluster/projects/pughlab/data/PROJECTNAME/EXOME/TGL01_0002_Ov_P_PE_465_EX_210220_A00469_0160_BAVYWVDTXY_1_CTGATCGT-GCGCATAT_R1.fastq.gz
                            R2: /cluster/projects/pughlab/data/PROJECTNAME/EXOME/TGL01_0002_Ov_P_PE_465_EX_210220_A00469_0160_BAVYWVDTXY_1_CTGATCGT-GCGCATAT_R2.fastq.gz
    TGL01_0002_Ov_M:
        type: tumour
        libraries:
            TGL01_0002_Ov_P_PE_845_EX:
                runlanes:
                    210220_A00469_0160_BAVYWVDTXY_1_CTGATCGT-GCGCATAT:
                        fastq:
                            R1: /cluster/projects/pughlab/data/PROJECTNAME/EXOME/TGL01_0002_Ov_M_PE_845_EX_210220_A00469_0160_BAVYWVDTXY_1_CTGATCGT-GCGCATAT_R1.fastq.gz
                            R2: /cluster/projects/pughlab/data/PROJECTNAME/EXOME/TGL01_0002_Ov_M_PE_845_EX_210220_A00469_0160_BAVYWVDTXY_1_CTGATCGT-GCGCATAT_R2.fastq.gz
                    210220_A00469_0160_BAVYWVDTXY_2_CTGATCGT-GCGCATAT:
                        fastq:
                            R1: /cluster/projects/pughlab/data/PROJECTNAME/EXOME/TGL01_0002_Ov_M_PE_845_EX_210220_A00469_0160_BAVYWVDTXY_2_CTGATCGT-GCGCATAT_R1.fastq.gz
                            R2: /cluster/projects/pughlab/data/PROJECTNAME/EXOME/TGL01_0002_Ov_M_PE_845_EX_210220_A00469_0160_BAVYWVDTXY_2_CTGATCGT-GCGCATAT_R2.fastq.gz
    TGL01_0002_Ov_R:
        type: normal
        libraries:
            TGL01_0002_Ov_R_PE_333_EX:
                runlanes:
                    210220_A00469_0160_BAVYWVDTXY_1_CTGATCGT-GCGCATAT:
                        fastq:
                            R1: /cluster/projects/pughlab/data/PROJECTNAME/EXOME/TGL01_0002_Ov_R_PE_333_EX_210220_A00469_0160_BAVYWVDTXY_1_CTGATCGT-GCGCATAT_R1.fastq.gz
                            R2: /cluster/projects/pughlab/data/PROJECTNAME/EXOME/TGL01_0002_Ov_R_PE_333_EX_210220_A00469_0160_BAVYWVDTXY_1_CTGATCGT-GCGCATAT_R2.fastq.gz

Note: these functions do not handle single-end (SE) data! To provide SE reads to the aligner, please use this format:

---
SMP-003:
    SMP-003-T:
        type: tumour
        libraries:        
            LIBRARY_NAME_SE:
                runlanes:
                    LANE_NAME:
                        fastq:
                            SE: /cluster/projects/pughlab/data/PROJECTNAME/EXOME/SAMPLE_ID_LANE_NAME.SE.fastq.gz