-
Notifications
You must be signed in to change notification settings - Fork 4
Week 5 tests
- This is where we left off in the previous week:
Due to lack of knowledge of what should go into the facet_wrap, we were not able to solve that particular BV issue
- After realizing that that was an issue with how the plot y and x axes were being labelled, we used the content in the metadata to label the axes
- We tested the whole pipeline with the Dog data, stingless bee data, and nf-core/ampliseq data and ran it successfully until the end.
- A slight change in the row.names of the metadata object(samdata) was made by specifying it to pick the variable holding the sample basename(either samples_F or samples_R) which should also be the SampleID names in the metadata file, instead of keying in the SampleID names as we had done last week:
Initially:
samdata <- read.table("set5_meta.txt",header = TRUE,row.names = "sample")
Last week:
sample <- c("Dog1", "Dog2", "Dog3", "Dog8", "Dog9", "Dog10", "Dog15", "Dog16", "Dog17", "Dog22", "Dog23", "Dog24", "Dog29")
samdata <- read.table("/node/cohort4/nelly/16s-rRNA-Project/practice.dataset1.meta.txt",header = TRUE,row.names = sample)
Changed to:
samdata <- read.table("/home/nelly-wambui/yosef-metadata/yosef-metadata.csv", header = TRUE, sep=",", row.names = samples_F)
- Also, a separator was added in the "samdata" object as shown in the above illustration
- We made grammatical changes and an arrangement change too:
- Grammatical
Initially: rooting a the tree Changed to: rooting the tree
- Arrangement
Initially Arrangement error: Rooting of the tree was before the creation of the phyloseq object and was therefore bringing an error of not finding the phyloseq object Changed to: We moved the following part to go under the creation of the phyloseq object: #rooting a the tree set.seed(711) phy_tree(phyloseq_object) <- root(phy_tree(phyloseq_object), sample(taxa_names(phyloseq_object), 1), resolve.root = TRUE) is.rooted(phy_tree(phyloseq_object)) #Taxonomy plots. #Phylum phylum_plot <- plot_bar(phyloseq_object, x="sample", fill="Phylum") + facet_wrap(~BV, scales="free_x") #Genus top50 <- names(sort(taxa_sums(phyloseq_object), decreasing=TRUE))[1:50] ps.top50 <- transform_sample_counts(phyloseq_object, function(OTU) OTU/sum(OTU)) ps.top50 <- prune_taxa(top50, ps.top50) Genus_plot <- plot_bar(ps.top50, x="sample", fill="Genus") + facet_wrap(~BV, scales="free_x")
- Grammatical
- Changes that should be made to be specific to one's data are in the following steps:
- Creating a phyloseq object: provide metadata file path(note separator in the file and update accordingly)
- Taxonomy plots: in the phylum and genus plots make changes in x, fill and the first content of the facet wrap according to the metadata
- Visualising alpha diversity: in diversitybyinflammation and top30plot bars make changes in x and colour according to the metadata
- Beta diversity: change the PCoa plot colour and shape according to the metadata
-
The following are the results we obtained after running the pipeline with the stingless bee data
-
Some of the results had commands for saving them in rds and png formats in a the current working directory while some, as shown below, did not have commands for saving them:
-
Quality profile plots that will inform the trimming are saved as "rawplot.pdf"
-
Stingless bee data forward reads quality profile plots
-
Stingless bee data reverse reads quality profile plots
-
Specific forward reads sample quality plot
-
Specific reverse reads sample quality plot
-
Random forward reads samples quality plots
-
Random reverse reads samples quality plots
-
-
Quality profile after trimming and filtering
-
Filtered forward reads
-
Filtered reverse reads
-
Random forward reads samples after trimming and filtering
-
Random reverse reads samples after trimming and filtering
-
-
Filtered sequences are stored in a gzipped format in the working directory
-
An error model for both forward and reverse reads is generated and saved in the current working directory as "error_forward.rds" "error_reverse.rds"
-
Error plots:
-
Forward reads error plots
-
Reverse reads error plots
-
-
The sequence table created is saved in the current working directory as "read-count-tracking.tsv"
-
The assigned taxonomy with the silva database is saved as "taxa_silva_taxonomy.tsv" and with rdp database as "taxa_rdp_taxonomy.tsv"
-
Taxonomy plots include:
-
Phylum plot
-
Genus plot
-
-
The phyloseq object is saved as "phyloseq_object.rds"
-
Alpha diversity plot is saved as "divbyinf.png":
- Alpha diversity plot
-
Richness bar plot for top 30 ASVS:
-Richness bar plot
-
Beta diversity plot is saved as "PCoa.rds":
- Beta diversity PCoa plot PCoa plot
-
Rarefaction:
- Rarefaction curve
-
Vertical line at the fewest sequences in any sample:
- Vertical line
-
We ran all the analysis of all the 56 samples of stingless bee.
nextflow run nf-core/ampliseq -profile singularity --input "/data/yosef/stinglessbee/test_data" --FW_primer "CCTACGGGNGGCWGCAG" --RV_primer "GACTACHVGGGTATCTAATCC" --metadata "final.tsv" --picrust --email "[email protected]" -resume
. ├── cutadapt ├── dada2 ├── fastqc ├── multiqc ├── overall_summary.tsv ├── picrust ├── pipeline_info └── qiime2
function description X10K X11K X12K X13K X14K X15K X16K X17K X18K X19K X1K X20K X21K X22K X23K X24K X25K X26K X27K X28K X29K X2K X30K X31K X32K X33K X34K X35K X36K X37K X38K X39K X3K X40K X41K X42K X43K X44K X45K X46K X47K X48K X49K X4K X50K X51K X52K X53K X54K X55K X5K X6K X7K X8K X9K EC:1.1.1.1 Alcohol dehydrogenase 146203.22 257164.69 301016.71 259887.76 307200.8 296558.24 242832.02 294386.9 301062.66000000003 279653.43999999994 180106.68 293277.66000000003 122888.99 136866.24999999994 68278.97 42908.930000000015 410851.6 367521.77 498615.6 198306.04 310085.68 244578.91 192504.48 245845.7 243906.04 226665.22 300611.17 228634.15 253331.21 295092.44 325457.36 257567.75 199450.96999999997 298710.53 251076.7 177375.74 149260.03 268108.38 212558.4 357379.21 300098.08 263985.25 351999.25 148684.54 337689.75 359992.59 299050.73 287365.51 390123.75 388865.16 127793.21999999994 242069.16999999995 183991.06 244399.98 184928.25000000003 EC:1.1.1.100 3-oxoacyl-[acyl-carrier-protein] reductase 151265.20000000007 137003.89 117449.35 113144.62 172872.5 133824.77 188419.14 217771.76 103540.5 281782.69 177652.6 298584.46 129340.25 159438.03999999998 99532.46000000002 125947.62 161346.69 146352.16000000006 210635.61 130408.04 157556.19999999998 205612.9 180787.49 170110.55 170633.19 149550.3 172061.83000000002 173980.81 128955.37 136691.93 152881.22 129109.5 175521.89999999994 149842.31999999998 162689.8 113702.84 96965.83 157133.58 130784.13000000002 264684.67 220979.15 203626.25 282236.5 144730.15 250320.32000000004 242824.48 244164.09 246770.64 273479.75 303928.83 136967.28000000006 249149.53 215836.33000000007 248255.38999999996 200057.79 EC:1.1.1.102 3-dehydrosphinganine reductase 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 11.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0 0.0 2.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 EC:1.1.1.103 L-threonine 3-dehydrogenase 789.2 13938.93 71.53 49.05 34.1 136.93 501.5100000000001 153.2 69.5 92.93 4448.36 717.04 40.5 66.57 143.98999999999995 410.6 111.0 263.54 60.0 2.25 43.0 1005.12 75.0 64.0 7.0 8.86 30.0 10.5 118.4 37.14 159.5 3.5 1149.62 99.21 669.5 2520.34 143.3 37.0 2409.98 16.5 73.83 14.0 60.0 731.93 64.17 7.0 4.29 16.0 99.75 19.33 985.28 953.33 2647.050000000001 2675.96 1428.33 EC:1.1.1.105 All-trans-retinol dehydrogenase (NAD(+)) 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 47.0 0.0 0.0 35.0 0.0 15.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 4.0 0.0 18.0 0.0 0.0 0.0 0.0 EC:1.1.1.107 Pyridoxal 4-dehydrogenase 0.0 2.0 0.0 0.0 0.0 1.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 EC:1.1.1.108 Carnitine 3-dehydrogenase 344.75 45.0 151.0 68.5 25.0 89.55 371.0 167.0 32.0 200.0 1692.5 444.0 36.75 56.5 149.25 265.0 82.0 137.5 0.0 0.0 24.0 1253.75 43.0 31.0 0.0 24.0 33.5 19.0 66.75 61.25 20.0 38.0 1548.75 21.0 68.0 16.0 25.0 0.75 19.5 39.75 28.5 22.75 0.0 644.25 0.83 35.0 61.0 10.0 0.0 0.0 712.75 103.61 3386.25 310.25 131.25 EC:1.1.1.11 D-arabinitol 4-dehydrogenase 702.91 12637.43 30.63 508.38 21.0 214.1 157.9 27.0 422.0 48.0 6639.980000000001 417.37 497.14 1496.14 1704.16 1711.86 96.15 205.52 74.67 84.57 66.0 918.27 362.0 136.0 415.0 22.43 87.28999999999998 50.71 109.0 42.14 182.0 0.0 558.69 30.0 94.0 1456.13 271.63 100.5 398.15 195.0 1166.71 325.0 1994.0 361.03 670.0 7.0 7.29 80.0 409.75 208.0 181.51 554.22 1722.4100000000005 2207.9 2405.33 EC:1.1.1.122 D-threo-aldose 1-dehydrogenase 392.25 625.0 38.0 220.5 47.0 1279.42 98.0 95.5 67.0 81.0 1835.89 1153.0 36.75 39.5 65.42 101.0 41.0 90.0 0.0 0.0 9.0 1397.42 0.0 43.0 0.0 0.0 0.0 8.0 39.75 60.25 10.0 12.0 1652.08 35.0 1117.0 1194.0 620.0 166.75 347.5 34.75 0.0 2.0 0.0 795.58 7.0 3.0 3.29 19.0 62.0 0.0 996.08 343.75 8691.14 110.75 264.03000000000003
nextflow run nf-core/ampliseq -name SeroFluidMetagenomic5 -profile singularity --input "data" --FW_primer "GGAAGTAAAAGTCGTAACAAGG" --RV_primer "GCTGCGTTCTTCATCGATGC" --metadata "metadata.tsv" --picrust --email "[email protected]" -resume --illumina_pe_its
. ├── cutadapt ├── dada2 ├── fastqc ├── multiqc ├── overall_summary.tsv ├── pipeline_info └── qiime2
- Picrust did not run on ITS
- ITS is not a marker gene
nextflow run nf-core/ampliseq -profile singularity --input "data" --FW_primer "AGAGTTTGATCCTGGCTCAG" --RV_primer "GGTTACCTTGTTACGACTT" --metadata "metadata.tsv" --pacbio --email "[email protected]" --extension '/*_R_001.fastq.gz' --picrust -resume
. ├── cutadapt ├── dada2 ├── fastqc ├── multiqc ├── overall_summary.tsv ├── picrust ├── pipeline_info └── qiime2
nextflow run nf-core/ampliseq -profile singularity --input "data" --FW_primer "CAGCMGCCGCGGTAATAC" --RV_primer "CCGTCAATTCMTTTGAGTTT" --metadata "metadata.tsv" --iontorrent --picrust --email "[email protected]" --extension "/*_R_001.fastq.gz" -resume
. ├── cutadapt ├── dada2 ├── fastqc ├── multiqc ├── overall_summary.tsv ├── picrust ├── pipeline_info └── qiime2
We run analysis from 18S single end datasets, but ran into an error
Command error:
Error rates could not be estimated (this is usually because of very few reads).
Error in getErrors(err, enforce = TRUE) : Error matrix is NULL.
Calls: learnErrors -> dada -> getErrors
Execution halted
We created a test dataset for MBBU 16S Accreditation for end users to use to verify that the pipeline works on their machine before running their own data. A configuration file "test.config" specifies the parameters of the test datasets
Test.config file
params{ outdir = 'Test-Results' primers = "../test_data/primers.txt" metadata = "../test_data/metadata.tsv" reads = "../test_data/data/*_R{1,2}_001.fastq.gz" tracedir = "${params.outdir}/pipeline_info" }
The command for running test data is
nextflow -C test.config run main.nf
. ├── Rdb ├── artifacts ├── chimeras ├── dereps ├── fastqc_post ├── fastqc_raw ├── filter ├── merge ├── multiqc_postfastqc ├── multiqc_raw ├── orient ├── otus ├── pipeline_info ├── trimming └── visualization
The running parameters which include the path for primers, metadata and reads are specified in the nextflow.config file. We tested the use of flags for specifying these prameters rather than editing the configuration file.
- --primers "path/to/primers.txt"
- --metadata "path/to/metadata.tsv"
- --reads "path/to/reads/*_R{1,2}_001.fastq" We tested the used of flags on the stingless bee microbiome data
nextflow run main.nf --primers primers.txt --metadata metadata.tsv --reads microbiome-data/*_R{1,2}.fastq.gz
. ├── Rdb ├── artifacts ├── chimeras ├── dereps ├── fastqc_post ├── fastqc_raw ├── filter ├── merge ├── multiqc_postfastqc ├── multiqc_raw ├── orient ├── otus ├── pipeline_info ├── trimming └── visualization
We improved the Documentation for the MBBU 16S Accreditation Qiime2, to include a brief of what the pipeline is about and the step by step instruction of how to get started. We will send a PR soon!\
- start of README.md *
This pipeline is for analysis of 16s rRNA
This pipeline performs the following in order:
- Quality check of raw reads
- Trimming of adapters from reads
- Merging/Stitching
- Filtering and Primer removal
- Orientation
- Dereplication
- Chimera detection
- Clustering OTUs
- Phylogeny
- Taxonomy
- Alpha diversity and Beta diversity
- Install
Nextflow
(>21.11.3
) - Install
Trimmomatic
(>V0.39
),qiime2
(>V2020.6
),multiqc
(>V1.4
),vsearch
(>V2.16.0
) - Download the pipeline into your directory with the command
git clone [email protected]:mbbu/16S_Accreditation.git
- Test on the given test data with the command
nextflow -C test.config run main.nf
- Start running your own analysis
nextflow run main.nf --primers <path/to/primers.txt> --metadata <path/to/metadata.tsv> --reads <path/to/reads/*_R{1,2}.fastq.gz>
Check this contributors page for collaborators. All are listed in the report, including those who did not commit to GitHub.
- End of README *
- Nextflow script for MBBU 16S Accreditation Dada2
- Yosef's pipeline
- Test pipeline on cloud
- Final report