-
Notifications
You must be signed in to change notification settings - Fork 2
Input files
The pipeline-suite runs using parameters provided in YAML format. There are two types of required configuration files:
- tool configs:
- data configs:
- fastq_config.yaml, generated by create_data_yaml__tgl.R or create_fastq_yaml.pl (see below)
- bam_config.yaml, generated by any tool which outputs BAMs that are required for downstream steps
- these specify common parameters, including:
- project name
- sequencing type (wgs, exome, rna or targeted), sequencing centre and platform
- is_ctdna (true or false)
- hpc_group to indicate group to use for job submission (ie, slurm -A group argument)
- ref_type (ie, hg19 or hg38)
- path to desired output directory (will be created if this is the initial run)
- paths to tool-specific reference files/directories
- desired versions of tools
- and, for each tool, memory and run time parameters for each step
Examples: dna_pipeline_config.yaml and rna_pipeline_config.yaml
You have 3 options to produce the data config file:
- use create_data_yaml__tgl.R (if filenames are in a specific format [ie, from TGL])
- use create_fastq_yaml.pl (if filenames are variable but contain the sample ID)
- write the file yourself using the format described below (can be tedious!)
If your fastq files come from TGL, they should follow the following format:
TGL01_0001_Ov_P_PE_420_EX_210220_A00469_0160_BAVYWVDTXY_1_CTGATCGT-GCGCATAT_R1.fastq.gz
This function will split the filename to extract patient ID, sample ID, sample type, library and lane IDs:
- patient ID: TGL01_0001
- sample ID: TGL01_0001_Ov_P
- sample type: P [where R is interpreted to be reference and anything else is tumour]
- library ID: TGL01_0001_Ov_P_PE_420_EX
- lane ID: 210220_A00469_0160_BAVYWVDTXY_1
This can also be used to create a yaml for BAM files if you already have these:
TGL01_0001_Ov_P_WG_EXTERNALID.filter.deduped.recalibrated.bam
- patient ID: TGL01_0001
- sample ID: TGL01_0001_Ov_P
- sample type: P [where R is interpreted to be reference and anything else is tumour]
Create the YAML file:
module load R
Rscript /path/to/pipeline_suite/scripts/create_data_yaml__tgl.R \
-d /path/to/fastq/directory \
-o /path/to/output_fastq_config.yaml \
-t fastq { or bam }
Note: create_fastq_yaml.pl requires all fastq files to be in a single, flat directory (/path/to/all/fastq/*.fastq.gz) and will attempt to link fastqs to samples by the Sample.ID provided by sample_info.txt.
Prepare sample_info.txt - this is a tab-separated text file (with a header) listing Patient ID, Sample ID and Sample Type:
Patient.ID Sample.ID Type
SMP-001 SMP-001-T tumour
SMP-001 SMP-001-N normal
SMP-002 SMP-002-T1 tumour
SMP-002 SMP-002-T2 tumour
SMP-002 SMP-002-N normal
If your fastq files come from TGL, they should follow the following format:
TGL01_0001_Ov_P_PE_420_EX_210220_A00469_0160_BAVYWVDTXY_1_CTGATCGT-GCGCATAT_R1.fastq.gz
Thus, the entry in sample_info.txt would look like this:
TGL01_0001 TGL01_0001_Ov_P tumour
Create the YAML file:
module load perl
perl /path/to/pipeline_suite/scripts/create_fastq_yaml.pl \
-d /path/to/fastq/directory \
-o /path/to/output_fastq_config.yaml \
-t dna { or rna } \
-i /path/to/sample_info.txt
---
TGL01_0001:
TGL01_0001_Ov_P:
type: tumour
libraries:
TGL01_0001_Ov_P_PE_420_EX:
runlanes:
210220_A00469_0160_BAVYWVDTXY_1_CTGATCGT-GCGCATAT:
fastq:
R1: /cluster/projects/pughlab/data/PROJECTNAME/EXOME/TGL01_0001_Ov_P_PE_420_EX_210220_A00469_0160_BAVYWVDTXY_1_CTGATCGT-GCGCATAT_R1.fastq.gz
R2: /cluster/projects/pughlab/data/PROJECTNAME/EXOME/TGL01_0001_Ov_P_PE_420_EX_210220_A00469_0160_BAVYWVDTXY_1_CTGATCGT-GCGCATAT_R2.fastq.gz
TGL01_0001_Ov_R:
type: normal
libraries:
TGL01_0001_Ov_R_PE_420_EX:
runlanes:
210220_A00469_0160_BAVYWVDTXY_1_CTGATCGT-GCGCATAT:
fastq:
R1: /cluster/projects/pughlab/data/PROJECTNAME/EXOME/TGL01_0001_Ov_R_PE_420_EX_210220_A00469_0160_BAVYWVDTXY_1_CTGATCGT-GCGCATAT_R1.fastq.gz
R2: /cluster/projects/pughlab/data/PROJECTNAME/EXOME/TGL01_0001_Ov_R_PE_420_EX_210220_A00469_0160_BAVYWVDTXY_1_CTGATCGT-GCGCATAT_R2.fastq.gz
TGL01_0002:
TGL01_0002_Ov_P:
type: tumour
libraries:
TGL01_0002_Ov_P_PE_465_EX:
runlanes:
210220_A00469_0160_BAVYWVDTXY_1_CTGATCGT-GCGCATAT:
fastq:
R1: /cluster/projects/pughlab/data/PROJECTNAME/EXOME/TGL01_0002_Ov_P_PE_465_EX_210220_A00469_0160_BAVYWVDTXY_1_CTGATCGT-GCGCATAT_R1.fastq.gz
R2: /cluster/projects/pughlab/data/PROJECTNAME/EXOME/TGL01_0002_Ov_P_PE_465_EX_210220_A00469_0160_BAVYWVDTXY_1_CTGATCGT-GCGCATAT_R2.fastq.gz
TGL01_0002_Ov_M:
type: tumour
libraries:
TGL01_0002_Ov_P_PE_845_EX:
runlanes:
210220_A00469_0160_BAVYWVDTXY_1_CTGATCGT-GCGCATAT:
fastq:
R1: /cluster/projects/pughlab/data/PROJECTNAME/EXOME/TGL01_0002_Ov_M_PE_845_EX_210220_A00469_0160_BAVYWVDTXY_1_CTGATCGT-GCGCATAT_R1.fastq.gz
R2: /cluster/projects/pughlab/data/PROJECTNAME/EXOME/TGL01_0002_Ov_M_PE_845_EX_210220_A00469_0160_BAVYWVDTXY_1_CTGATCGT-GCGCATAT_R2.fastq.gz
210220_A00469_0160_BAVYWVDTXY_2_CTGATCGT-GCGCATAT:
fastq:
R1: /cluster/projects/pughlab/data/PROJECTNAME/EXOME/TGL01_0002_Ov_M_PE_845_EX_210220_A00469_0160_BAVYWVDTXY_2_CTGATCGT-GCGCATAT_R1.fastq.gz
R2: /cluster/projects/pughlab/data/PROJECTNAME/EXOME/TGL01_0002_Ov_M_PE_845_EX_210220_A00469_0160_BAVYWVDTXY_2_CTGATCGT-GCGCATAT_R2.fastq.gz
TGL01_0002_Ov_R:
type: normal
libraries:
TGL01_0002_Ov_R_PE_333_EX:
runlanes:
210220_A00469_0160_BAVYWVDTXY_1_CTGATCGT-GCGCATAT:
fastq:
R1: /cluster/projects/pughlab/data/PROJECTNAME/EXOME/TGL01_0002_Ov_R_PE_333_EX_210220_A00469_0160_BAVYWVDTXY_1_CTGATCGT-GCGCATAT_R1.fastq.gz
R2: /cluster/projects/pughlab/data/PROJECTNAME/EXOME/TGL01_0002_Ov_R_PE_333_EX_210220_A00469_0160_BAVYWVDTXY_1_CTGATCGT-GCGCATAT_R2.fastq.gz
Note: these functions do not handle single-end (SE) data! To provide SE reads to the aligner, please use this format:
---
SMP-003:
SMP-003-T:
type: tumour
libraries:
LIBRARY_NAME_SE:
runlanes:
LANE_NAME:
fastq:
SE: /cluster/projects/pughlab/data/PROJECTNAME/EXOME/SAMPLE_ID_LANE_NAME.SE.fastq.gz