Cluster raw Illumina 16S rRNA amplicon data to generate OTUs.
Lederhosen is free and open source under the MIT open source license
-
Obtain & Install UCLUST
-
Obtain & Install BLAT
-
Get a copy of TaxCollector
-
Install Lederhosen by typing:
sudo gem install lederhosen
-
Check installation by typing
lederhosen
. You should see some help text.
- Sequence trimming (paired-end Illumina).
- K-mer filtering.
- Clustering w/ UCLUST.
- UCLUST output filtering.
- Separation of representative reads.
- Separation of all reads belonging to each cluster.
- Identification of clusters using TaxCollector.
- Generation of OTU abundancy matrices.
Lederhosen is just a convenient wrapper for UCLUST and BLAT with some scripts for quality filtering, de-noising of data as well as creation of nice tables. It is similar to QIIME but meant for paired-end Illumina data rather than single-end 454. The basic lederhosen pipeline consists of: trimming, joining, sorting, filtering, clustering, more filtering, and output generation (OTU tables, representative reads, reads by cluster, and taxonomic descriptions for clusters). See the example pipeline in pipeline.sh
.
Lederhosen is invoked by typing lederhosen [TASK]
Trim (Illumina) reads using quality scores. Output will be a directory of fasta files. Reads can optionally be gzipped.
lederhosen trim --reads_dir=reads/*.txt --out_dir=trimmed/
Join paired reads from all samples end-to-end. This method enables the use of uclust with paired-end data. Output will be a single fasta file.
lederhosen join --trimmed=trimmed/*.fasta --output=joined.fasta
If your reads are not paired, then you do not need to do this step. Instead, concatenate all of the trimmed reads files.
cat trimmed/*.fasta > joined.fasta
Sort reads by length. This is a requirement for uclust's single-linkage clustering algorithim.
lederhosen sort --input=joined.fasta --output=sorted.fasta
K-mer abundance noise filtering. This step is experimental and optional. It may reduce the time it takes to perform the clustering.
lederhosen k_filter --input=joined.fasta --output=filtered.fasta --k=10 --cutoff=50
Cluster reads using UCLUST. Output is a uc file.
lederhosen cluster --input=sorted.fasta --identity=0.80 --output=clusters.uc
Filter UC file removing singleton clusters or clusters that are only present in a few samples. This greatly reduces the noise of the data without removing many of the reads.
lederhosen uc_filter --input=clusters.uc --output=clusters.uc.filtered --reads=50 --samples=10
Create an OTU abundance table where rows are samples and columns are clusters. The entries are the number of reads for that cluster in a sample.
lederhosen otu_table --clusters=clusters.uc --output=otu_prefix.csv
Get representative reads for each cluster. Output is a single fasta file.
lederhosen rep_reads --clusters=clusters.uc --joined=joined.fasta --output=representative_reads.fasta
Get all reads belonging to each cluster. Output is a directory containing a fasta file for each cluster. The fasta file contains the joined reads.
lederhosen split --clusters=clusters.uc --reads=joined.fasta --min-clst-size=100
Identify clusters in a database using the representative reads. This is a simple wrapper for BLAT. The output is a tab-delimited file similar to a BLAST output file. For this step you need to have BLAT installed and also a TaxCollector database.
lederhosen name --reps=representative_reads.fasta --database taxcollector.fa --output blast_like_output.txt
Add phylogenetic classification of clusters to OTU abundance file.
lederhosen add_names --blat=blat_output.txt --level=taxonomic_level --table=otu_file.csv --output=named_out_file.csv
Where taxonomic_level
can be: kingdom, domain, phylum, class, order, family, genus or species. This method only works with a TaxCollector database.
Squish an OTU abundance file by column name (phylogenetic description)
lederhosen squish --csv-file=named_out_file.csv --output=squished_named_out_file.csv