This repository contains the source code of the Nextflow pipelines for sequencing analyses, as well as a Jupyter Notebook implementing a machine-learning model for Candida auris clade detection.
The dependencies required before running the workflow are following
- Nextflow
- Docker
- All the software used in this paper was packaged inside the docker image. Please refer to
docker
directory to build the necessary containers for the workflow
candida_auris_isolate_illumina.nf
The workflow processes Illumina reads, following these steps
- Preprocess the reads by trimming the adapters and quality filtering using Fastp
- Calculate statistics information using Fastqc and Seqkit
- Perform reads-based species identification using Mash
- Perform sequencing depth checking using our custom script
- Perform genome assembly and filtering short contigs using SPAdes and Seqkit
- Perform assembly QC evaluation using Quast
- Perform contig species identification using Mash
- Perform variant calling using SnpEff and Snippy
candida_auris_isolate_nanopore.nf
The workflow processes a Nanopore read, following these steps
- Preprocess the reads by quality filtering using chopper
- Calculate statistics information using NanoStat and Seqkit
- Perform sequencing depth checking using our custom script
- Perform genome assembly and draft consensus genome construction using Flye
- Perform genome polishing and short contigs filtering using Medaka and Seqkit
- Perform assembly QC evaluation using Quast with
--nanopore
option
candida_auris_isolate_illumina_polish.nf
The workflow performs genome polishing by using Illumina short reads to improve the quality of the nanopore-based assembled genome, following these steps.
The below code is an example command to run the workflow
candida_auris_isolate_illumina.nf
Note that the Illumina paired-end read was retrieved from the SRA database by using --sra_accession
option customized in our script.
For the workflow configuration, please refer to nextflow.config
# =================================
# candida_auris_isolate_illumina.nf
# The Nextflow script for Illumina
# =================================
nextflow candida_auris_isolate_illumina.nf \
-with-report /data/sgh_candida_auris/20230630_illumina_analysis/SRR24877249/SRR24877249.nextflow.report.html \
-with-trace /data/sgh_candida_auris/20230630_illumina_analysis/SRR24877249/SRR24877249.nextflow.trace.txt \
-with-timeline /data/sgh_candida_auris/20230630_illumina_analysis/SRR24877249/SRR24877249.nextflow.timeline.html \
-w /data/sgh_candida_auris/20230630_illumina_analysis/nextflow_work \
--outdir /data/sgh_candida_auris/20230630_illumina_analysis/SRR24877249 \
--sample_id SRR24877249 \
--genome_size 12500000 \
--sra_accession SRR24877249
candida_auris_isolate_nanopore.nf
For the workflow configuration, please refer to nextflow.config
# =================================
# candida_auris_isolate_nanopore.nf
# The Nextflow script for Nanopore
# =================================
nextflow candida_auris_isolate_nanopore.nf \
-with-report /home/ubuntu/data/sgh_candida_auris/20230604_analysis_results/N00521_BC17/nextflow_report.html \
-with-trace /home/ubuntu/data/sgh_candida_auris/20230604_analysis_results/N00521_BC17/nextflow_trace.txt \
-with-timeline /home/ubuntu/data/sgh_candida_auris/20230604_analysis_results/N00521_BC17/nextflow_timeline.html \
-w /home/ubuntu/data/sgh_candida_auris/20230604_analysis_results/nextflow_work \
--outdir /home/ubuntu/data/sgh_candida_auris/20230604_analysis_results/N00521_BC17 \
--sample_id N00521_BC17 \
--nanopore_reads /home/ubuntu/data/sgh_candida_auris/N00521_BC17.fastq.gz
candida_auris_isolate_illumina_polish.nf
For the workflow configuration, please refer to nextflow.config
# =================================
# candida_auris_isolate_illumina_polish.nf
# The Nextflow script for Hybrid genome assembly
# =================================
nextflow candida_auris_isolate_illumina_polish.nf \
-with-report /data/sgh_candida_auris/20230619_illumina_polish/F01567/nextflow_report.html \
-with-trace /data/sgh_candida_auris/20230619_illumina_polish/F01567/nextflow_trace.txt \
-with-timeline /data/sgh_candida_auris/20230619_illumina_polish/F01567/nextflow_timeline.html \
-w /data/sgh_candida_auris/20230619_illumina_polish/nextflow_work \
--outdir /data/sgh_candida_auris/20230619_illumina_polish/F01567 \
--sample_id F01567 \
--genome_size 12500000 \
--nanopore_reads /data/sgh_candida_auris/20230604_analysis_results/N00466_BC05/N00466_BC05_for_downstream.fastq \
--nanopore_contigs /data/sgh_candida_auris/20230604_analysis_results/N00466_BC05/N00466_BC05.nanopore.flye.medaka_x2.fasta \
--illumina_reads_1 /data/sgh_candida_auris/illumina_fastq/WMB1897_DKDL220004524-1a-AK17215-AK4966_HHJJGCCX2_L1_1.fq.gz \
--illumina_reads_2 /data/sgh_candida_auris/illumina_fastq/WMB1897_DKDL220004524-1a-AK17215-AK4966_HHJJGCCX2_L1_2.fq.gz
Illumina and Nanopore sequencing data of three Clade VI isolates have been deposited in the National Centre for Biotechnology Information (NCBI) Sequence Read Archive (SRA) under BioProject accession number PRJNA1000034.
Illumina data for large cohort analysis, a total of 4,475 unique NCBI accession ids were retrieved from the SRA database The Sequence Read Archive (SRA).
As a proof-of-concept, we provide a Jupyter notebook machine_learning_for_clade_detection.ipynb
for the automatic detection of a new Candida auris. These approaches can potentially enhance genomic surveillance by the early identification and investigation of outlier genomes.
In short, Bayesian logistic regression models were trained based on SNP distances and previously reported clade information to learn a threshold for predicting whether a pair of genomes are from the same clade. The threshold was then used to determine the relationships between genome pairs (edge) in a graph, which captured clusters (connected components) representing existing and potential new clades. The analysis was based on 3,651 publicly available WGS and three Clade VI (PRJNA1000034), of which 1,132 (31%) had previously reported clade information were included (A large SNP distance matrix file is available upon request.) At each time point, a graph was generated, where nodes represent genomes and edges link between two genomes that were predicted to belong to the same clade. The number of clusters (connected components) present in the graph represents the total number of clades predicted to be present in the dataset. An overview of our machine learning approach for detecting potential Candida auris new clade is depicted in the figure below.