Skip to content

Dada2 Documentation

alexandriai168 edited this page Oct 22, 2023 · 1 revision

Quality Control, Filtering, & Taxonomic Assignment

We use the DADA2 software to assign taxonomy to amplicon sequences. However, some processing is necassary before we can assign taxonomy. A brief overview of the process is as follows:

  • Determine quality trimming length
  • Filter and trim for quality and length
  • Dereplicate
  • Learn error rates
  • Sample inference
  • Merge paired reads
  • Remove chimeras
  • Filter merged reads by size
  • Assign Taxonomy
  • Create and Save Output Files

A lot of our DADA2 script is pulled from DADA2's Tutorial, which you can check out for further explanation. However, we slightly modified/added to their existing tools to further automate and refine the process for our project's needs. Below is documentation for code that deviates from DADA2's tutorial.

Determine quality trimming length

This is a sliding window algorithm that determines where to truncate reads based on quality score. With Illumina reads, quality tends to drop off towards the end of the read. This is especially true of longer reads (300+ bp).

photo of quality plots

In a typical DADA2 workflow, you would look at these quality plots to determine by eye where to trim. Ideally, you would want to strike a balance between keeping enough of the sequences (so you can merge reads) and trimming off low quality bases. As you would guess, there is a lot of discourse about how to decide these parameters.

To aid in reproducability and automation, we created a simple algorithm that decides where to trim forward and reverse reads. Below is the code:

n <- 500000
 trimsF <- c()
 for(f in fnFs[!is.na(fnFs)]) {
   srqa <- qa(f, n=n)
   df <- srqa[["perCycle"]]$quality # Calculate summary statistics at each position.. takes the longest amount of time
   means <- rowsum(df$Score*df$Count, df$Cycle)/rowsum(df$Count, df$Cycle) #calculate mean quality at each cycle
   window_values <- c()
   window_size <- 10
   for(j in 1:(length(means)-window_size)){
     window_values[j] <- mean(means[j:(j+window_size)]) #calculate mean qual score in each window and add to window_values vector
   }
   where_to_cut <- min(which(window_values < primer.data$F_qual[i])) #trim at first value of the window where mean qual dips below 30
   trimsF[f] <- where_to_cut
 }
 where_trim_all_Fs <- median(trimsF) #get median of all trims - use this as Trunclen forwards
 
 trimsR <- c()
 for(r in fnRs[!is.na(fnRs)]) {
   srqa <- qa(r, n=n)
   df <- srqa[["perCycle"]]$quality
   means <- rowsum(df$Score*df$Count, df$Cycle)/rowsum(df$Count, df$Cycle)
   indexes <- seq(1,length(means),10)
   window_values <- c()
   for(j in 1:(length(means)-window_size)){
     window_values[j] <- mean(means[j:(j+window_size)]) #calculate mean qual score in each window and add to window_values vector
   }
   where_to_cut <- min(which(window_values < primer.data$R_qual[i]))
   trimsR[r] <- where_to_cut
 }
 where_trim_all_Rs <- median(trimsR)

# Trim to max_trim value if the obtaided length is greater that max_trim
if(where_trim_all_Fs > primer.data$max_trim[i]){
  where_trim_all_Fs <- primer.data$max_trim[i]
}
if(where_trim_all_Rs > primer.data$max_trim[i]){
   where_trim_all_Rs <- primer.data$max_trim[i]
} 

What this does is:

  • Determines the quality at each base - The qa() function calculates the quality score at each position of the sequence for each read
  • Find the average quality for each base - Then, rowsum() to find the average quality at each base in the sequence
  • Use Sliding Window to Determine where to trim - Use a window size of 10 to iterate along the vector of average qualities. Within the window, we find the mean qual score within the window. Once that mean quality reaches below a certain threshold (supplied by primer_data.csv), we take the first value of that window and add it to a vector called trimsF/R.
  • Repeat with all Samples - Repeat this sliding window calculation with each sample and append where to trim to trimsF/R.
  • Use Median to Determine Where to Trim All Samples - We then take the median of trimsF/R to use as the input to truncLen to trim the right end of reads based on quality. Notice that we do this calculation for Forward and Reverse reads separately, as reverse reads tend to have lower quality overall/have an earlier quality dropoff.

Filter Merged Reads by Size

After merging reads, there will sometimes be sequences that are longer or shorter than your target sequence length. This can be for many reasons: sample contamination, low quality reads, etc. For example, when we were processing our data, we found bacterial sequences that were being amplified by primers made to amplify fish DNA. However, these sequences were 100 bp longer than our target sequence, so we were able to filter the bacterial sequences out here.

We use this code to filter out reads by size:

### Filter by Size ---------------------------------------------------------------
    indexes.to.keep <- which((nchar(colnames(seqtab.nochim)) <= primer.data$max_amplicon_length[i]) & ((nchar(colnames(seqtab.nochim))) >= primer.data$min_amplicon_length[i]))
    cleaned.seqtab.nochim <- seqtab.nochim[,indexes.to.keep]
    filteredout.seqtab.nochim <- seqtab.nochim[,!indexes.to.keep]
    write.csv(filteredout.seqtab.nochim,paste0(output_location,"/logs/","filtered_out_asv.csv"))

The upper and lower bounds of the filtering are provided by primer_data.csv. The filtered out ASVs are written to the file filtered_out_asv.csv, which is in the logs directory.

Sequence Hashing & Known Sequence Database

To aid in keeping track of sequences in multi-sample experiments, we applied a hashing algorithm to the sequences DADA2 finds. Hashing "translates" our DNA sequences into a shorter sequence of letters, numbers, and symbols. These sequences are shorter and easier to store. To learn more about hashing algorithms, check out this link.

Here is the code that implements the hashing on found ASV's

### Create Hashing  ---------------------------------------------------------------
    
    # define output files
    conv_file <- file.path(output_location,"csv_output",paste0(run_name,"_",primer.data$locus_shorthand[i],"_hash_key.csv"))
    conv_file.fasta <- file.path(output_location,"csv_output",paste0(run_name,"_",primer.data$locus_shorthand[i],"_hash_key.fasta"))
    ASV_file <-  file.path(output_location,"csv_output",paste0(run_name,"_",primer.data$locus_shorthand[i],"_ASV_table.csv"))
    taxonomy_file <- file.path(output_location,"csv_output",paste0(run_name,"_",primer.data$locus_shorthand[i],"_taxonomy_output.csv"))
    bootstrap_file <- file.path(output_location,"logs",paste0(run_name,"_",primer.data$locus_shorthand[i],"_tax_bootstrap.csv"))
    
    # create ASV table and hash key 
    print(paste0("creating ASV table and hash key...", Sys.time()))
    seqtab.nochim.df <- as.data.frame(cleaned.seqtab.nochim)
    conv_table <- tibble( Hash = "", Sequence ="")
    Hashes <- map_chr (colnames(seqtab.nochim.df), ~ digest(.x, algo = "sha1", serialize = F, skip = "auto"))
    conv_table <- tibble (Hash = Hashes,
                          Sequence = colnames(seqtab.nochim.df))
    
    write_csv(conv_table, conv_file) # write hash key into a csv
    write.fasta(sequences = as.list(conv_table$Sequence), # write hash key into a fasta
                names     = as.list(conv_table$Hash),
                file.out = conv_file.fasta)
    sample.df <- tibble::rownames_to_column(seqtab.nochim.df,"Sample_name")
    sample.df <- data.frame(append(sample.df,c(Label=primer.data$locus_shorthand[i]), after = 1))
    current_asv <- bind_cols(sample.df %>%
                                    dplyr::select(Sample_name, Label),
                                  seqtab.nochim.df)
    current_asv <- current_asv %>%
      pivot_longer(cols = c(- Sample_name, - Label),
                   names_to = "Sequence",
                   values_to = "nReads") %>%
          filter(nReads > 0)
    current_asv <- merge(current_asv,conv_table, by="Sequence") %>%
      select(-Sequence) %>%
      relocate(Hash, .after=Label)
  
    write_csv(current_asv, ASV_file) # write asv table into a csv

We also create a known sequence database for each primer. These databases (.csv files) are located in the muri_metabarcoding/metadata/known_hashes directory. Every time we assign taxonomy and find a new ASV with an assignment that has 100% bootstrap confidence, we add that hash and its taxonomic assignments to that primer's known_hashes database. The most computationally heavy part of the pipeline is assigning taxonomy, and having these known hashes allows us to save computational energy. Before assigning taxonomy, we first check to see if any of the new ASVs can be found in the known_hashes database. If so, we use that taxonomic assignment instead of assigning taxonomy again using DADA2.