WGS analysis for Candida auris

This repository contains the source code of the Nextflow pipelines for sequencing analyses, as well as a Jupyter Notebook implementing a machine-learning model for Candida auris clade detection.

Prerequisite

The dependencies required before running the workflow are following

Nextflow
Docker
All the software used in this paper was packaged inside the docker image. Please refer to docker directory to build the necessary containers for the workflow

Workflow description

candida_auris_isolate_illumina.nf
The workflow processes Illumina reads, following these steps
1. Preprocess the reads by trimming the adapters and quality filtering using Fastp
2. Calculate statistics information using Fastqc and Seqkit
3. Perform reads-based species identification using Mash
4. Perform sequencing depth checking using our custom script
5. Perform genome assembly and filtering short contigs using SPAdes and Seqkit
6. Perform assembly QC evaluation using Quast
7. Perform contig species identification using Mash
8. Perform variant calling using SnpEff and Snippy

candida_auris_isolate_nanopore.nf
The workflow processes a Nanopore read, following these steps
1. Preprocess the reads by quality filtering using chopper
2. Calculate statistics information using NanoStat and Seqkit
3. Perform sequencing depth checking using our custom script
4. Perform genome assembly and draft consensus genome construction using Flye
5. Perform genome polishing and short contigs filtering using Medaka and Seqkit
6. Perform assembly QC evaluation using Quast with --nanopore option
candida_auris_isolate_illumina_polish.nf
The workflow performs genome polishing by using Illumina short reads to improve the quality of the nanopore-based assembled genome, following these steps.
1. Perform long-read indexing, then align short-reads to it using BWA
2. From i. we generate a sorted index file in BAM format using SAMtools
3. Perform genome polishing from the long-read and its index file using Pilon
4. Perform assembly QC evaluation using Quast with --nanopore option

Running the workflow

The below code is an example command to run the workflow

candida_auris_isolate_illumina.nf
Note that the Illumina paired-end read was retrieved from the SRA database by using --sra_accession option customized in our script.
For the workflow configuration, please refer to nextflow.config

# =================================
# candida_auris_isolate_illumina.nf
# The Nextflow script for Illumina
# =================================

nextflow candida_auris_isolate_illumina.nf \
-with-report /data/sgh_candida_auris/20230630_illumina_analysis/SRR24877249/SRR24877249.nextflow.report.html \
-with-trace /data/sgh_candida_auris/20230630_illumina_analysis/SRR24877249/SRR24877249.nextflow.trace.txt \
-with-timeline /data/sgh_candida_auris/20230630_illumina_analysis/SRR24877249/SRR24877249.nextflow.timeline.html \
-w /data/sgh_candida_auris/20230630_illumina_analysis/nextflow_work \
--outdir /data/sgh_candida_auris/20230630_illumina_analysis/SRR24877249 \
--sample_id SRR24877249 \
--genome_size 12500000 \
--sra_accession SRR24877249

candida_auris_isolate_nanopore.nf
For the workflow configuration, please refer to nextflow.config

# =================================
# candida_auris_isolate_nanopore.nf
# The Nextflow script for Nanopore
# =================================

nextflow candida_auris_isolate_nanopore.nf \
-with-report /home/ubuntu/data/sgh_candida_auris/20230604_analysis_results/N00521_BC17/nextflow_report.html \
-with-trace /home/ubuntu/data/sgh_candida_auris/20230604_analysis_results/N00521_BC17/nextflow_trace.txt \
-with-timeline /home/ubuntu/data/sgh_candida_auris/20230604_analysis_results/N00521_BC17/nextflow_timeline.html \
-w /home/ubuntu/data/sgh_candida_auris/20230604_analysis_results/nextflow_work \
--outdir /home/ubuntu/data/sgh_candida_auris/20230604_analysis_results/N00521_BC17 \
--sample_id N00521_BC17 \
--nanopore_reads /home/ubuntu/data/sgh_candida_auris/N00521_BC17.fastq.gz

candida_auris_isolate_illumina_polish.nf
For the workflow configuration, please refer to nextflow.config

# =================================
# candida_auris_isolate_illumina_polish.nf
# The Nextflow script for Hybrid genome assembly
# =================================

nextflow candida_auris_isolate_illumina_polish.nf \
-with-report /data/sgh_candida_auris/20230619_illumina_polish/F01567/nextflow_report.html \
-with-trace /data/sgh_candida_auris/20230619_illumina_polish/F01567/nextflow_trace.txt \
-with-timeline /data/sgh_candida_auris/20230619_illumina_polish/F01567/nextflow_timeline.html \
-w /data/sgh_candida_auris/20230619_illumina_polish/nextflow_work \
--outdir /data/sgh_candida_auris/20230619_illumina_polish/F01567 \
--sample_id F01567 \
--genome_size 12500000 \
--nanopore_reads /data/sgh_candida_auris/20230604_analysis_results/N00466_BC05/N00466_BC05_for_downstream.fastq \
--nanopore_contigs /data/sgh_candida_auris/20230604_analysis_results/N00466_BC05/N00466_BC05.nanopore.flye.medaka_x2.fasta \
--illumina_reads_1 /data/sgh_candida_auris/illumina_fastq/WMB1897_DKDL220004524-1a-AK17215-AK4966_HHJJGCCX2_L1_1.fq.gz \
--illumina_reads_2 /data/sgh_candida_auris/illumina_fastq/WMB1897_DKDL220004524-1a-AK17215-AK4966_HHJJGCCX2_L1_2.fq.gz

Data availability

Illumina and Nanopore sequencing data of three Clade VI isolates have been deposited in the National Centre for Biotechnology Information (NCBI) Sequence Read Archive (SRA) under BioProject accession number PRJNA1000034.

Illumina data for large cohort analysis, a total of 4,475 unique NCBI accession ids were retrieved from the SRA database The Sequence Read Archive (SRA).

Machine-learning models

As a proof-of-concept, we provide a Jupyter notebook machine_learning_for_clade_detection.ipynb for the automatic detection of a new Candida auris. These approaches can potentially enhance genomic surveillance by the early identification and investigation of outlier genomes.

In short, Bayesian logistic regression models were trained based on SNP distances and previously reported clade information to learn a threshold for predicting whether a pair of genomes are from the same clade. The threshold was then used to determine the relationships between genome pairs (edge) in a graph, which captured clusters (connected components) representing existing and potential new clades. The analysis was based on 3,651 publicly available WGS and three Clade VI (PRJNA1000034), of which 1,132 (31%) had previously reported clade information were included (A large SNP distance matrix file is available upon request.) At each time point, a graph was generated, where nodes represent genomes and edges link between two genomes that were predicted to belong to the same clade. The number of clusters (connected components) present in the graph represents the total number of clades predicted to be present in the dataset. An overview of our machine learning approach for detecting potential Candida auris new clade is depicted in the figure below.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WGS analysis for Candida auris

Prerequisite

Workflow description

Running the workflow

Data availability

Machine-learning models

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
conf		conf
data		data
docker		docker
modules		modules
.gitignore		.gitignore
README.md		README.md
candida_auris_isolate_illumina.nf		candida_auris_isolate_illumina.nf
candida_auris_isolate_illumina_polish.nf		candida_auris_isolate_illumina_polish.nf
candida_auris_isolate_nanopore.nf		candida_auris_isolate_nanopore.nf
candida_auris_machine_learning_for_clade_detection.ipynb		candida_auris_machine_learning_for_clade_detection.ipynb
nextflow.config		nextflow.config

CSB5/Candida_auris_CladeVI

Folders and files

Latest commit

History

Repository files navigation

WGS analysis for Candida auris

Prerequisite

Workflow description

Running the workflow

Data availability

Machine-learning models

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages