-
Notifications
You must be signed in to change notification settings - Fork 0
Dada2 Documentation
We use the DADA2 software to assign taxonomy to amplicon sequences. However, some processing is necassary before we can assign taxonomy. A brief overview of the process is as follows:
- Determine quality trimming length
- Filter and trim for quality and length
- Dereplicate
- Learn error rates
- Sample inference
- Merge paired reads
- Remove chimeras
- Filter merged reads by size
- Assign Taxonomy
- Create and Save Output Files
A lot of our DADA2 script is pulled from DADA2's Tutorial, which you can check out for further explanation. However, we slightly modified/added to their existing tools to further automate and refine the process for our project's needs. Below is documentation for code that deviates from DADA2's tutorial.
This is a sliding window algorithm that determines where to truncate reads based on quality score. With Illumina reads, quality tends to drop off towards the end of the read. This is especially true of longer reads (300+ bp).
In a typical DADA2 workflow, you would look at these quality plots to determine by eye where to trim. Ideally, you would want to strike a balance between keeping enough of the sequences (so you can merge reads) and trimming off low quality bases. As you would guess, there is a lot of discourse about how to decide these parameters.
To aid in reproducability and automation, we created a simple algorithm that decides where to trim forward and reverse reads. Below is the code:
n <- 500000
trimsF <- c()
for(f in fnFs[!is.na(fnFs)]) {
srqa <- qa(f, n=n)
df <- srqa[["perCycle"]]$quality # Calculate summary statistics at each position.. takes the longest amount of time
means <- rowsum(df$Score*df$Count, df$Cycle)/rowsum(df$Count, df$Cycle) #calculate mean quality at each cycle
window_values <- c()
window_size <- 10
for(j in 1:(length(means)-window_size)){
window_values[j] <- mean(means[j:(j+window_size)]) #calculate mean qual score in each window and add to window_values vector
}
where_to_cut <- min(which(window_values < primer.data$F_qual[i])) #trim at first value of the window where mean qual dips below 30
trimsF[f] <- where_to_cut
}
where_trim_all_Fs <- median(trimsF) #get median of all trims - use this as Trunclen forwards
trimsR <- c()
for(r in fnRs[!is.na(fnRs)]) {
srqa <- qa(r, n=n)
df <- srqa[["perCycle"]]$quality
means <- rowsum(df$Score*df$Count, df$Cycle)/rowsum(df$Count, df$Cycle)
indexes <- seq(1,length(means),10)
window_values <- c()
for(j in 1:(length(means)-window_size)){
window_values[j] <- mean(means[j:(j+window_size)]) #calculate mean qual score in each window and add to window_values vector
}
where_to_cut <- min(which(window_values < primer.data$R_qual[i]))
trimsR[r] <- where_to_cut
}
where_trim_all_Rs <- median(trimsR)
# Trim to max_trim value if the obtaided length is greater that max_trim
if(where_trim_all_Fs > primer.data$max_trim[i]){
where_trim_all_Fs <- primer.data$max_trim[i]
}
if(where_trim_all_Rs > primer.data$max_trim[i]){
where_trim_all_Rs <- primer.data$max_trim[i]
}
What this does is:
- Determines the quality at each base - The qa() function calculates the quality score at each position of the sequence for each read
- Find the average quality for each base - Then, rowsum() to find the average quality at each base in the sequence
-
Use Sliding Window to Determine where to trim - Use a window size of 10 to iterate along the vector of average qualities. Within the window, we find the mean qual score within the window. Once that mean quality reaches below a certain threshold (supplied by
primer_data.csv
), we take the first value of that window and add it to a vector calledtrimsF/R
. -
Repeat with all Samples - Repeat this sliding window calculation with each sample and append where to trim to
trimsF/R
. -
Use Median to Determine Where to Trim All Samples - We then take the median of
trimsF/R
to use as the input totruncLen
to trim the right end of reads based on quality. Notice that we do this calculation for Forward and Reverse reads separately, as reverse reads tend to have lower quality overall/have an earlier quality dropoff.
After merging reads, there will sometimes be sequences that are longer or shorter than your target sequence length. This can be for many reasons: sample contamination, low quality reads, etc. For example, when we were processing our data, we found bacterial sequences that were being amplified by primers made to amplify fish DNA. However, these sequences were 100 bp longer than our target sequence, so we were able to filter the bacterial sequences out here.
We use this code to filter out reads by size:
### Filter by Size ---------------------------------------------------------------
indexes.to.keep <- which((nchar(colnames(seqtab.nochim)) <= primer.data$max_amplicon_length[i]) & ((nchar(colnames(seqtab.nochim))) >= primer.data$min_amplicon_length[i]))
cleaned.seqtab.nochim <- seqtab.nochim[,indexes.to.keep]
filteredout.seqtab.nochim <- seqtab.nochim[,!indexes.to.keep]
write.csv(filteredout.seqtab.nochim,paste0(output_location,"/logs/","filtered_out_asv.csv"))
The upper and lower bounds of the filtering are provided by primer_data.csv
. The filtered out ASVs are written to the file filtered_out_asv.csv
, which is in the logs directory.
To aid in keeping track of sequences in multi-sample experiments, we applied a hashing algorithm to the sequences DADA2 finds. Hashing "translates" our DNA sequences into a shorter sequence of letters, numbers, and symbols. These sequences are shorter and easier to store. To learn more about hashing algorithms, check out this link.
Here is the code that implements the hashing on found ASV's
### Create Hashing ---------------------------------------------------------------
# define output files
conv_file <- file.path(output_location,"csv_output",paste0(run_name,"_",primer.data$locus_shorthand[i],"_hash_key.csv"))
conv_file.fasta <- file.path(output_location,"csv_output",paste0(run_name,"_",primer.data$locus_shorthand[i],"_hash_key.fasta"))
ASV_file <- file.path(output_location,"csv_output",paste0(run_name,"_",primer.data$locus_shorthand[i],"_ASV_table.csv"))
taxonomy_file <- file.path(output_location,"csv_output",paste0(run_name,"_",primer.data$locus_shorthand[i],"_taxonomy_output.csv"))
bootstrap_file <- file.path(output_location,"logs",paste0(run_name,"_",primer.data$locus_shorthand[i],"_tax_bootstrap.csv"))
# create ASV table and hash key
print(paste0("creating ASV table and hash key...", Sys.time()))
seqtab.nochim.df <- as.data.frame(cleaned.seqtab.nochim)
conv_table <- tibble( Hash = "", Sequence ="")
Hashes <- map_chr (colnames(seqtab.nochim.df), ~ digest(.x, algo = "sha1", serialize = F, skip = "auto"))
conv_table <- tibble (Hash = Hashes,
Sequence = colnames(seqtab.nochim.df))
write_csv(conv_table, conv_file) # write hash key into a csv
write.fasta(sequences = as.list(conv_table$Sequence), # write hash key into a fasta
names = as.list(conv_table$Hash),
file.out = conv_file.fasta)
sample.df <- tibble::rownames_to_column(seqtab.nochim.df,"Sample_name")
sample.df <- data.frame(append(sample.df,c(Label=primer.data$locus_shorthand[i]), after = 1))
current_asv <- bind_cols(sample.df %>%
dplyr::select(Sample_name, Label),
seqtab.nochim.df)
current_asv <- current_asv %>%
pivot_longer(cols = c(- Sample_name, - Label),
names_to = "Sequence",
values_to = "nReads") %>%
filter(nReads > 0)
current_asv <- merge(current_asv,conv_table, by="Sequence") %>%
select(-Sequence) %>%
relocate(Hash, .after=Label)
write_csv(current_asv, ASV_file) # write asv table into a csv
We also create a known sequence database for each primer. These databases (.csv files) are located in the muri_metabarcoding/metadata/known_hashes
directory. Every time we assign taxonomy and find a new ASV with an assignment that has 100% bootstrap confidence, we add that hash and its taxonomic assignments to that primer's known_hashes
database. The most computationally heavy part of the pipeline is assigning taxonomy, and having these known hashes allows us to save computational energy. Before assigning taxonomy, we first check to see if any of the new ASVs can be found in the known_hashes
database. If so, we use that taxonomic assignment instead of assigning taxonomy again using DADA2.