Skip to content

brooklabteam/mNGS-human-fever

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mNGS-human-Fever

This repo walks through some simple analyses of the CZID sequencing data from the undiagnosed human fever project uploaded here. You can start by playing around on CZID to get a feel for the data that are available. Then, you can move analyses over here into R.

CZID Quality Control

First, on CZID, try highlighting all the samples, and in the upper righthand corner of the sample list, click 'Download' - you will see several options for types of files to download. First, try clicking the 'Samples Overview' csv file for download -- we have stored that here in the 'data' subfolder under the name "gce_sample_summary.csv".

In the "R-code" subfolder, you will find a script "process_plot_summary.R" that walks through how to visualize this output. The resulting plots start with "QC_" and can be found in the "figures" folder.

Here are a few examples of what is produced - proportion of reads cleared by each filtration step in CZID:

QC metrics by sample type:

CZID Pathogen Prevalence and Heatmaps

On CZID, there is also the option to download "Sample Metadata", which is user-uploaded metadata that has been uploaded with the sequences. Depending on who did the upload, however, these data are often incomplete. Here, we will supply our own metadata file instead - you can find this one in the "data" subfolder under name "GCE-human-metadat.csv".

Also, on CZID, if you want to download the pathogen hits associated with each sample, try selecting all the samples, then building a 'heatmap', then download the corresponding csv file from the heatmap. You have the option to download the heatmap after filtering or before. There are a few subsets of these heatmaps that are referenced in the 'GCE_human_analyses.R' script. I downloaded them as separate subsets to help with the addition of higher level taxonomic groupings (CZID only downloads pathogen name for example), as well as sample type.

Among other things, these show you how to make heatmaps like the following:

And how to summarise the data like so:

CZID Genome Analyses

Finally, it is possible to download genomic data for a single sample (or several samples simultaneously), as well as the heatmap metrics. Take a look at this sample as an example. If you click on "Metapneumovirus", you will see the coverage visualization plot below and the two contigs that were constructed. You can download those directly as .fasta files by clicking on the cloud icon to the right.

You can also download all of the non-host contigs from this sample, (meaning all of the contigs that were assembled by de novo assembly in CZID after all of the filtration steps) by clicking on the "Download" button with the cloud in the upper right of the screen.

These contigs can then be BLASTed (either manually or via a command line script) to NCBI. Occassionally, CZID assembles a contig correctly but gets the BLAST link wrong, so a manual BLAST to a curated reference database can be helpful. Gwen used this project to identify contigs that were hits to bat CoVs her paper, following this pipeline here.

Additionally, you can download the unmapped non-host reads (meaning all reads that passed filter but not yet assembled into contigs) by clicking on the "View Results Folder" line from the "Download" button and scrolling to the very bottom. This two nonhost fastq files are Illumina paired end reads.

You can use these to do your own de novo assembly or to map reads to a contig to see the coverage depth (e.g. support) for a particular genome (essentially recreating the coverage plot above). I walk through a couple examples of these paired covereage plots using Christian's script, comparing coverage from MSSPE and mNGS in the script '' located in the R-scripts folder. A few plots found in the "figures" sub-folder are then produced. This script compares the HCoV-HKU1 coverage from mNGS (see here and click on the 'Consensus Genome' tab) vs. that with MSSPE (see here).

This highlights one other feature of CZID--you can build a 'consensus genome' where it maps all raw reads to the closet hit in GenBank and tries to build a consensus genome. You will see this as an option when you upload new samples. You can download the data needed to produce above by clicking the 'Download All' button (top right with cloud) on each sample page in CZID.

Clicking on the 'Download All' option downloads a zipped file that actually has a read depth plot as well as the non-host reads (fq.gz files) and several output files. The 'samtools_depth.txt' file that is produced gives you the read depth across the entire genome. I included these two samtools files and teh two .tsv report files for mngs and msspe in the 'data' subfolder -- see '' script to process these into a coverage plot like this:

The interesting thing about above is that, by just looking at the raw read depth plot, it looks like MSSPE is doing great. But when you actually plot reads per million, we see that that mNGS actuallyhas higher reads per million in many places in the genome, but that there were just A LOT of reads produced in the MSSPE run (this makes sense as it was run on a NovaSeq while the mNGS was run on a NextSeq). However, we can say that the coverage has greater breadth in the case of MSSPE, meaning no dropouts with no resolution across the genome. This is a value that us reported in the .tsv report files as "Coverage >= 1x (%)" -- you'll see it is 99% for the MSSPE (meaning every basepair has support at 1X or higher), while for mNGS it is only 1.71%.

Finally, once you have genomes in hand, the Brook lab has lots of great resources for how to build Bayesian timetrees (e.g. here) or maximum likelihood phylogenies (e.g. here). There are also good how-to scripts for these in the Mada-Bat-CoV repo.

It's also worth taking a glance through the SARS-CoV-2 genome curation repo to get an idea of other secrets hidden in CZID! There are some scripts in here that automate the download of the samtools files to produce read coverage plots across many sample types.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages