Tutorial to run on NIG Super-Computer (Japanese).
- Python3 and pip (required to install cwltool by
pip install cwltool
) - cwltool
- samtools, awk, perl (required to create a depth file for Tasuke+)
- Docker
If you want to run the pipeline on NIG-SuperCompurer, use singularity instead of Docker, and node.js is required additionally (See below). - 20GB memory
Execute the commands below to download reference files.
$ ./scripts/download_reference_files.sh
Genomic Fasta, Annotation GTF, and Protein Fasta will be downloaded to reference
directory.
Shown below is the commands to run the pipeline from test
directory. You can invoke the pipeline from any directory by changing the file paths appropriately.
- Help
cwltool ../workflows/rapdb-pipeline.cwl -h
- Run
All the output files will be generated into the directory specified with--outdir
. If not specified, files will be generated into the current working directory.or specifying acwltool --outdir test_out ../workflows/rapdb-pipeline.cwl --fastq1 read1.fq.gz --fastq2 read2.fq.gz --outprefix SAMDxxxxxxx --threads 2 --reference ../reference/IRGSP-1.0_genome_M_C_unanchored.fa --ref_gtf ../reference/RAP-DB_MSU_concatenated_for_snpEff.gtf --ref_protein ../reference/RAP-DB_MSU_concatenated_protein.fa
job.yaml
filecwltool --outdir test_out ../workflows/rapdb-pipeline.cwl rapdb-pipeline.test.job.yaml
- Run using singularity (NIG-SC)
Use--singularity
. Singularity image files will be generated into the current directory. It is recommended to run the pipeline in the same working directory to avoid generating redundant image files. "node.js" is also required. See this for the detailed tutorial in Japanese.cwltool --singularity --outdir test_out ../workflows/rapdb-pipeline.cwl --fastq1 read1.fq.gz --fastq2 read2.fq.gz --outprefix SAMDxxxxxxx --threads 2 --reference ../reference/IRGSP-1.0_genome_M_C_unanchored.fa --ref_gtf ../reference/RAP-DB_MSU_concatenated_for_snpEff.gtf --ref_protein ../reference/RAP-DB_MSU_concatenated_protein.fa
- For debugging (Use
--cachedir
to keep cache files and to resume the job)cwltool --cachedir test_cache --outdir test_out ../workflows/rapdb-pipeline.cwl --fastq1 read1.fq.gz --fastq2 read2.fq.gz --outprefix SAMDxxxxxxx --threads 2 --reference ../reference/IRGSP-1.0_genome_M_C_unanchored.fa --ref_gtf ../reference/RAP-DB_MSU_concatenated_for_snpEff.gtf --ref_protein ../reference/RAP-DB_MSU_concatenated_protein.fa
- Run without making depth data for Tasuke+
If the depth file for Tasuke+ is not required, userapdb-pipeline_wo_tasuke.cwl
.orcwltool --outdir test_out ../workflows/rapdb-pipeline_wo_tasuke.cwl --fastq1 read1.fq.gz --fastq2 read2.fq.gz --outprefix SAMDxxxxxxx --threads 2 --reference ../reference/IRGSP-1.0_genome_M_C_unanchored.fa --ref_gtf ../reference/RAP-DB_MSU_concatenated_for_snpEff.gtf --ref_protein ../reference/RAP-DB_MSU_concatenated_protein.fa
cwltool --outdir test_out ../workflows/rapdb-pipeline_wo_tasuke.cwl rapdb-pipeline.test.job.yaml
Name | Description | Mandatory / (default) |
---|---|---|
--fastq1 | FASTQ file for the forward read | YES |
--fastq2 | FASTQ file for the reverse read | YES |
--reference | FASTA file for the reference genome | YES |
--ref_gtf | Reference annotation in GTF for snpEff | YES |
--ref_protein | Reference protein FASTA for snpEff | YES |
--outprefix | Prefix for the output file | (out) |
--filter-expression | Filtering condition for GATK-VariantFiltration | ("QD < 5.0 || FS > 50.0 || SOR > 3.0 || MQ < 50.0 || MQRankSum < -2.5 || ReadPosRankSum < -1.0 || ReadPosRankSum > 3.5") |
--threads | Number of threads for parallel processing | (1) |
Reference files (genomic FASTA, protein FASTA, GTF) can be retrieved using download_reference_files.sh
.
See Analysis workflow for detection of genome-wide variations in TASUKE+ of RAP-DB
Files in the brackets ([ ]) will be output in the result directory.
- Read preprocessing (read_preprocessing.cwl)
1.1 FASTQ stats for raw reads (seqkit stats) [stats report]
1.2 Quality check for raw reads (FastQC) [Report HTML]
1.3 Adapter trimming and read QC (Trimmomatic) [summary]
1.4 FASTQ stats for preprocessed reads (seqkit stats) [stats report]
1.5 Read quality check (FastQC) [Report HTML] - Read mapping and BAM conversion (fastq2bam.cwl)
2.1 Read mapping (BWA)
2.2 Convert SAM to BAM (samtools)
2.3 Sort BAM (samtools)
2.4 Create unmapped BAM (Picard FastqToSam)
2.5 Merge mapped and unmapped BAM (Picard Merge)
2.6 Remove duplicated reads (Picard MarkDuplicate) [BAM, metrics file]
2.7 Create BAM index (samtools index) [BAM index (.bai)]
2.8 Statistics for de-duplicated BAM (samtools stats) [stats.txt]
2.9 Extract unmapped reads from BAM (samtools, picard) [FASTQ] - Variant calling, genotyping, filtering (bam2vcf.cwl)
3.1 Variant calling (GATK HaplotypeCaller) [gVCF, tbi]
3.2 Genotyping (GATK GenotypeGVCFs)
3.3 Filtering (GATK VariantFiltration)
Filtering condition: "QD < 5.0 || FS > 50.0 || SOR > 3.0 || MQ < 50.0 || MQRankSum < -2.5 || ReadPosRankSum < -1.0 || ReadPosRankSum > 3.5"
3.4 Variant selection for SNP and INDEL (GATK SelectVariants) [VCF, tbi] - Variant Annotation (snpeff_all.cwl)
4.1 Build SnpEff database from reference files (SnpEff build)
4.2 Annotate variants (SnpEff) [VCF, tbi, summary, effected genes] - Make depth data for TASUKE+ (bam2tasuke.cwl) [Tasuke+ depth file, average depth]
File name | Description | Step |
---|---|---|
{outprefix}_read-stats-raw.tsv | Stats file for raw FASTQ files | 1.1 |
{fastq1/2}_fastqc.html | FastQC report for raw reads | 1.2 |
{outprefix}.trimmomatic.summary.txt | Summary of Trimmomatic result | 1.3 |
{outprefix}_read-stats.tsv | Stats file for preprocessed FASTQ files | 1.4 |
{fastq1/2}.trimmomatic-pe_fastqc.html | FastQC report for reads processed using Trimmomatoc | 1.5 |
{outprefix}.rmdup.bam | Alignment result in BAM format, after de-duplication | 2.6 |
{outprefix}.rmdup.bam.bai | BAM index for the file above | 2.7 |
{outprefix}.rmdup.metrics | Metrics file for BAM de-duplication | 2.6 |
{outprefix}.rmdup.bam.stats.txt | Statistics for de-duplicated BAM (generated using samtools-depth, in-house script) | 2.8 |
{outprefix}.unmapped-read.{r1/r2}.fastq.gz | Reads not mapped to the reference genome | 2.9 |
variants_{outprefix}.g.vcf.gz | gVCF file generated from GATK-HapolotypeCaller | 3.1 |
variants_{outprefix}.g.vcf.gz.tbi | Tab index for the file above | 3.1 |
variants_{outprefix}.varonly.vcf.gz | VCF file genotyped, filtered, and selected for variants | 3.4 |
variants_{outprefix}.varonly.vcf.gz.tbi | Tab index for the file above | 3.4 |
variants_{outprefix}.snpEff.vcf.gz | VCF file annotated using snpEFff | 4.2 |
variants_{outprefix}.snpEff.vcf.gz.tbi | Tab index for the file above | 4.2 |
snpEff_genes_{outprefix}.txt | snpEff result for effected genes | 4.2 |
snpEff_summary_{outprefix}.html | Summary of snpEff | 4.2 |
{outprefix}_tasuke_depth.tsv.gz | Depth file for Tasuke+ | 5 |
{outprefix}.rmdup.bam.avg-depth.txt | Average depth file for de-duplicated BAM (generated using samtools-depth) | 5 |
- Create depth file for Tasuku+
tasuku_bamtodepth.cwl
now runs locally without using docker. Make sure that Samtools must be installed and be in thePATH
environmental variable.