Week 5 tests

MBBU/16S-Accreditation DADA2 pipeline

Week 4 Issue

This is where we left off in the previous week:

Due to lack of knowledge of what should go into the facet_wrap, we were not able to solve that particular BV issue

After realizing that that was an issue with how the plot y and x axes were being labelled, we used the content in the metadata to label the axes

Testing the pipeline

We tested the whole pipeline with the Dog data, stingless bee data, and nf-core/ampliseq data and ran it successfully until the end.
A slight change in the row.names of the metadata object(samdata) was made by specifying it to pick the variable holding the sample basename(either samples_F or samples_R) which should also be the SampleID names in the metadata file, instead of keying in the SampleID names as we had done last week:

Initially:
samdata <- read.table("set5_meta.txt",header = TRUE,row.names = "sample")

Last week:
sample <- c("Dog1", "Dog2", "Dog3", "Dog8", "Dog9", "Dog10", "Dog15", "Dog16", "Dog17", "Dog22", "Dog23", "Dog24", "Dog29")
samdata <- read.table("/node/cohort4/nelly/16s-rRNA-Project/practice.dataset1.meta.txt",header = TRUE,row.names = sample)

Changed to:
samdata <- read.table("/home/nelly-wambui/yosef-metadata/yosef-metadata.csv", header = TRUE, sep=",", row.names = samples_F)

Also, a separator was added in the "samdata" object as shown in the above illustration

We made grammatical changes and an arrangement change too:

Grammatical

Initially:
rooting a the tree

Changed to:
rooting the tree

Arrangement

Initially
Arrangement error:
Rooting of the tree was before the creation of the phyloseq object and was therefore bringing an error of not finding the phyloseq object

Changed to:
We moved the following part to go under the creation of the phyloseq object:
#rooting a the tree
set.seed(711)
phy_tree(phyloseq_object) <- root(phy_tree(phyloseq_object), sample(taxa_names(phyloseq_object), 1), resolve.root = TRUE)
is.rooted(phy_tree(phyloseq_object))


#Taxonomy plots.
#Phylum
phylum_plot <-  plot_bar(phyloseq_object, x="sample", fill="Phylum") + facet_wrap(~BV, scales="free_x")

#Genus
top50 <- names(sort(taxa_sums(phyloseq_object), decreasing=TRUE))[1:50]
ps.top50 <- transform_sample_counts(phyloseq_object, function(OTU) OTU/sum(OTU))
ps.top50 <- prune_taxa(top50, ps.top50)
Genus_plot <-  plot_bar(ps.top50, x="sample", fill="Genus") + facet_wrap(~BV, scales="free_x")

Continuation of changes from those indicated in week 4:

Changes that should be made to be specific to one's data are in the following steps:
- Creating a phyloseq object: provide metadata file path(note separator in the file and update accordingly)
- Taxonomy plots: in the phylum and genus plots make changes in x, fill and the first content of the facet wrap according to the metadata
- Visualising alpha diversity: in diversitybyinflammation and top30plot bars make changes in x and colour according to the metadata
- Beta diversity: change the PCoa plot colour and shape according to the metadata

Output

The following are the results we obtained after running the pipeline with the stingless bee data
Some of the results had commands for saving them in rds and png formats in a the current working directory while some, as shown below, did not have commands for saving them:
- Quality profile plots that will inform the trimming are saved as "rawplot.pdf"
  - Stingless bee data forward reads quality profile plots
  - Stingless bee data reverse reads quality profile plots
  - Specific forward reads sample quality plot
  - Specific reverse reads sample quality plot
  - Random forward reads samples quality plots
  - Random reverse reads samples quality plots
- Quality profile after trimming and filtering
  - Filtered forward reads
  - Filtered reverse reads
  - Random forward reads samples after trimming and filtering
  - Random reverse reads samples after trimming and filtering
- Filtered sequences are stored in a gzipped format in the working directory
- An error model for both forward and reverse reads is generated and saved in the current working directory as "error_forward.rds" "error_reverse.rds"
- Error plots:
  - Forward reads error plots
  - Reverse reads error plots
- The sequence table created is saved in the current working directory as "read-count-tracking.tsv"
- The assigned taxonomy with the silva database is saved as "taxa_silva_taxonomy.tsv" and with rdp database as "taxa_rdp_taxonomy.tsv"
- Taxonomy plots include:
  - Phylum plot
  - Genus plot
- The phyloseq object is saved as "phyloseq_object.rds"
- Alpha diversity plot is saved as "divbyinf.png":
  - Alpha diversity plot
- Richness bar plot for top 30 ASVS:
  
  -Richness bar plot
- Beta diversity plot is saved as "PCoa.rds":
  - Beta diversity PCoa plot PCoa plot
- Rarefaction:
  - Rarefaction curve
- Vertical line at the fewest sequences in any sample:
  - Vertical line

nf-core/ampliseq

Stingless bee microbiome

We ran all the analysis of all the 56 samples of stingless bee.

Command

 nextflow run nf-core/ampliseq -profile singularity --input "/data/yosef/stinglessbee/test_data" --FW_primer "CCTACGGGNGGCWGCAG" --RV_primer "GACTACHVGGGTATCTAATCC" --metadata "final.tsv" --picrust --email "[email protected]" -resume

Output

Overall

.
├── cutadapt
├── dada2
├── fastqc
├── multiqc
├── overall_summary.tsv
├── picrust
├── pipeline_info
└── qiime2

Alpha rarefaction

Barplot

Alpha diversity

Beta diversity

Functional analysis results

function	description	X10K	X11K	X12K	X13K	X14K	X15K	X16K	X17K	X18K	X19K	X1K	X20K	X21K	X22K	X23K	X24K	X25K	X26K	X27K	X28K	X29K	X2K	X30K	X31K	X32K	X33K	X34K	X35K	X36K	X37K	X38K	X39K	X3K	X40K	X41K	X42K	X43K	X44K	X45K	X46K	X47K	X48K	X49K	X4K	X50K	X51K	X52K	X53K	X54K	X55K	X5K	X6K	X7K	X8K	X9K
EC:1.1.1.1	Alcohol dehydrogenase	146203.22	257164.69	301016.71	259887.76	307200.8	296558.24	242832.02	294386.9	301062.66000000003	279653.43999999994	180106.68	293277.66000000003	122888.99	136866.24999999994	68278.97	42908.930000000015	410851.6	367521.77	498615.6	198306.04	310085.68	244578.91	192504.48	245845.7	243906.04	226665.22	300611.17	228634.15	253331.21	295092.44	325457.36	257567.75	199450.96999999997	298710.53	251076.7	177375.74	149260.03	268108.38	212558.4	357379.21	300098.08	263985.25	351999.25	148684.54	337689.75	359992.59	299050.73	287365.51	390123.75	388865.16	127793.21999999994	242069.16999999995	183991.06	244399.98	184928.25000000003
EC:1.1.1.100	3-oxoacyl-[acyl-carrier-protein] reductase	151265.20000000007	137003.89	117449.35	113144.62	172872.5	133824.77	188419.14	217771.76	103540.5	281782.69	177652.6	298584.46	129340.25	159438.03999999998	99532.46000000002	125947.62	161346.69	146352.16000000006	210635.61	130408.04	157556.19999999998	205612.9	180787.49	170110.55	170633.19	149550.3	172061.83000000002	173980.81	128955.37	136691.93	152881.22	129109.5	175521.89999999994	149842.31999999998	162689.8	113702.84	96965.83	157133.58	130784.13000000002	264684.67	220979.15	203626.25	282236.5	144730.15	250320.32000000004	242824.48	244164.09	246770.64	273479.75	303928.83	136967.28000000006	249149.53	215836.33000000007	248255.38999999996	200057.79
EC:1.1.1.102	3-dehydrosphinganine reductase	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	11.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	3.0	0.0	0.0	0.0	0.0	0.0	2.5	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
EC:1.1.1.103	L-threonine 3-dehydrogenase	789.2	13938.93	71.53	49.05	34.1	136.93	501.5100000000001	153.2	69.5	92.93	4448.36	717.04	40.5	66.57	143.98999999999995	410.6	111.0	263.54	60.0	2.25	43.0	1005.12	75.0	64.0	7.0	8.86	30.0	10.5	118.4	37.14	159.5	3.5	1149.62	99.21	669.5	2520.34	143.3	37.0	2409.98	16.5	73.83	14.0	60.0	731.93	64.17	7.0	4.29	16.0	99.75	19.33	985.28	953.33	2647.050000000001	2675.96	1428.33
EC:1.1.1.105	All-trans-retinol dehydrogenase (NAD(+))	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	47.0	0.0	0.0	35.0	0.0	15.0	0.0	0.0	0.0	2.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	4.0	0.0	18.0	0.0	0.0	0.0	0.0
EC:1.1.1.107	Pyridoxal 4-dehydrogenase	0.0	2.0	0.0	0.0	0.0	1.5	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
EC:1.1.1.108	Carnitine 3-dehydrogenase	344.75	45.0	151.0	68.5	25.0	89.55	371.0	167.0	32.0	200.0	1692.5	444.0	36.75	56.5	149.25	265.0	82.0	137.5	0.0	0.0	24.0	1253.75	43.0	31.0	0.0	24.0	33.5	19.0	66.75	61.25	20.0	38.0	1548.75	21.0	68.0	16.0	25.0	0.75	19.5	39.75	28.5	22.75	0.0	644.25	0.83	35.0	61.0	10.0	0.0	0.0	712.75	103.61	3386.25	310.25	131.25
EC:1.1.1.11	D-arabinitol 4-dehydrogenase	702.91	12637.43	30.63	508.38	21.0	214.1	157.9	27.0	422.0	48.0	6639.980000000001	417.37	497.14	1496.14	1704.16	1711.86	96.15	205.52	74.67	84.57	66.0	918.27	362.0	136.0	415.0	22.43	87.28999999999998	50.71	109.0	42.14	182.0	0.0	558.69	30.0	94.0	1456.13	271.63	100.5	398.15	195.0	1166.71	325.0	1994.0	361.03	670.0	7.0	7.29	80.0	409.75	208.0	181.51	554.22	1722.4100000000005	2207.9	2405.33
EC:1.1.1.122	D-threo-aldose 1-dehydrogenase	392.25	625.0	38.0	220.5	47.0	1279.42	98.0	95.5	67.0	81.0	1835.89	1153.0	36.75	39.5	65.42	101.0	41.0	90.0	0.0	0.0	9.0	1397.42	0.0	43.0	0.0	0.0	0.0	8.0	39.75	60.25	10.0	12.0	1652.08	35.0	1117.0	1194.0	620.0	166.75	347.5	34.75	0.0	2.0	0.0	795.58	7.0	3.0	3.29	19.0	62.0	0.0	996.08	343.75	8691.14	110.75	264.03000000000003

ITS dataset

Command

nextflow run nf-core/ampliseq -name SeroFluidMetagenomic5 -profile singularity --input "data" --FW_primer "GGAAGTAAAAGTCGTAACAAGG" --RV_primer "GCTGCGTTCTTCATCGATGC" --metadata "metadata.tsv" --picrust --email "[email protected]" -resume --illumina_pe_its

Output

.
├── cutadapt
├── dada2
├── fastqc
├── multiqc
├── overall_summary.tsv
├── pipeline_info
└── qiime2

Picrust did not run on ITS
ITS is not a marker gene

PacBio dataset

Command

nextflow run nf-core/ampliseq -profile singularity --input "data" --FW_primer "AGAGTTTGATCCTGGCTCAG" --RV_primer "GGTTACCTTGTTACGACTT" --metadata "metadata.tsv" --pacbio --email "[email protected]" --extension '/*_R_001.fastq.gz' --picrust -resume

Output

.
├── cutadapt
├── dada2
├── fastqc
├── multiqc
├── overall_summary.tsv
├── picrust
├── pipeline_info
└── qiime2

IonTorrent dataset

Command

nextflow run nf-core/ampliseq -profile singularity --input "data" --FW_primer "CAGCMGCCGCGGTAATAC" --RV_primer "CCGTCAATTCMTTTGAGTTT" --metadata "metadata.tsv" --iontorrent --picrust --email "[email protected]" --extension "/*_R_001.fastq.gz" -resume

Output

.
├── cutadapt
├── dada2
├── fastqc
├── multiqc
├── overall_summary.tsv
├── picrust
├── pipeline_info
└── qiime2

18s RNA data

We run analysis from 18S single end datasets, but ran into an error

Command error:
  Error rates could not be estimated (this is usually because of very few reads).
  Error in getErrors(err, enforce = TRUE) : Error matrix is NULL.
  Calls: learnErrors -> dada -> getErrors
  Execution halted

MBBU 16S Accreditation Qiime

Test data

We created a test dataset for MBBU 16S Accreditation for end users to use to verify that the pipeline works on their machine before running their own data. A configuration file "test.config" specifies the parameters of the test datasets

Test.config file

params{
	outdir = 'Test-Results'
	primers = "../test_data/primers.txt"
	metadata = "../test_data/metadata.tsv"
	reads = "../test_data/data/*_R{1,2}_001.fastq.gz"
	tracedir = "${params.outdir}/pipeline_info"
}

The command for running test data is

nextflow -C test.config run main.nf

Test dataset output

.
├── Rdb
├── artifacts
├── chimeras
├── dereps
├── fastqc_post
├── fastqc_raw
├── filter
├── merge
├── multiqc_postfastqc
├── multiqc_raw
├── orient
├── otus
├── pipeline_info
├── trimming
└── visualization

Flags

The running parameters which include the path for primers, metadata and reads are specified in the nextflow.config file. We tested the use of flags for specifying these prameters rather than editing the configuration file.

--primers "path/to/primers.txt"
--metadata "path/to/metadata.tsv"
--reads "path/to/reads/*_R{1,2}_001.fastq" We tested the used of flags on the stingless bee microbiome data

Command

 nextflow run main.nf --primers primers.txt --metadata metadata.tsv --reads microbiome-data/*_R{1,2}.fastq.gz

Output

.
├── Rdb
├── artifacts
├── chimeras
├── dereps
├── fastqc_post
├── fastqc_raw
├── filter
├── merge
├── multiqc_postfastqc
├── multiqc_raw
├── orient
├── otus
├── pipeline_info
├── trimming
└── visualization

Documentation

We improved the Documentation for the MBBU 16S Accreditation Qiime2, to include a brief of what the pipeline is about and the step by step instruction of how to get started. We will send a PR soon!\

start of README.md *

Introduction

This pipeline is for analysis of 16s rRNA

Pipeline Summary

This pipeline performs the following in order:

Quality check of raw reads
Trimming of adapters from reads
Merging/Stitching
Filtering and Primer removal
Orientation
Dereplication
Chimera detection
Clustering OTUs
Phylogeny
Taxonomy
Alpha diversity and Beta diversity

Quick Start

Install Nextflow(>21.11.3)
Install Trimmomatic(>V0.39), qiime2(>V2020.6), multiqc(>V1.4), vsearch(>V2.16.0)

Download the pipeline into your directory with the command

git clone [email protected]:mbbu/16S_Accreditation.git

Test on the given test data with the command
```
nextflow -C test.config run main.nf
```

Start running your own analysis

nextflow run main.nf --primers <path/to/primers.txt> --metadata <path/to/metadata.tsv> --reads <path/to/reads/*_R{1,2}.fastq.gz>

Credits

Check this contributors page for collaborators. All are listed in the report, including those who did not commit to GitHub.

End of README *

Remaining Tasks

Nextflow script for MBBU 16S Accreditation Dada2
Yosef's pipeline
Test pipeline on cloud
Final report

Week 5 tests

MBBU/16S-Accreditation DADA2 pipeline

Week 4 Issue

Testing the pipeline

Continuation of changes from those indicated in week 4:

Output

nf-core/ampliseq

Stingless bee microbiome

Command

Output

Overall

Alpha rarefaction

Barplot

Alpha diversity

Beta diversity

Functional analysis results

ITS dataset

Command

Output

PacBio dataset

Command

Output

IonTorrent dataset

Command

Output

18s RNA data

MBBU 16S Accreditation Qiime

Test data

Test dataset output

Flags

Command

Output

Documentation

Introduction

Pipeline Summary

Quick Start

Credits

Remaining Tasks

Clone this wiki locally