Version 2.0.8
A pipeline for profiling TE-derived small RNAs.
Created by Wen-Wei Liao, Kat O'Neill & Molly Gale Hammell, March 2017
Contact: [email protected]
$ wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash Miniconda3-latest-Linux-x86_64.sh
$ conda config --add channels conda-forge
$ conda config --add channels bioconda
$ git clone https://github.com/mhammell-laboratory/TEsmall.git
$ cd TEsmall
$ conda env create -f environment.yaml -n TEsmall
$ conda activate TEsmall
$ python setup.py install
-
Before executing TEsmall, make sure you have activated the environment
$ conda activate TEsmall
-
For example, you would like to apply TEsmall on 2 FASTQ files:
Parental_1.fastq.gz
andDroKO_1.fastq.gz
$ TEsmall -f Parental_1.fastq.gz DroKO_1.fastq.gz -l Parental DroKO
-
When it's done, deactivate the environment
$ conda deactivate
-
If you would like to specify the directory to which the genomes TEsmall uses for annotation are downloaded and read from, you can specify it at runtime using the
--dbfolder
parameter$ TEsmall -f Parental_1.fastq.gz DroKO_1.fastq.gz -g hg19 -l Parental DroKO --dbfolder /path/to/another/folder/
The files used by TEsmall will be downloaded to/access from the
genomes
folder inside/path/to/another/folder/
.The default location is
$HOME/TEsmall_db/
$ TEsmall -h
usage: TEsmall [-h] [-a STR] [-m INT] [-M INT] [-g STR] [--maxaln INT]
[--mismatch INT] [-o STR [STR ...]] [-p INT] [-f STR [STR ...]]
[-l STR [STR ...]] [--dbfolder STR] [--verbose INT] [-v]
optional arguments:
-h, --help show this help message and exit
-a STR, --adapter STR
Sequence of an adapter that was ligated to the 3' end.
The adapter itself and anything that follows is
trimmed. (default: TGGAATTCTCGGGTGCCAAGG)
-m INT, --minlen INT Discard trimmed reads that are shorter than INT. Reads
that are too short even before adapter removal are
also discarded. (default: 16)
-M INT, --maxlen INT Discard trimmed reads that are longer than INT. Reads
that are too long even before adapter removal are also
discarded. (default: 36)
-g STR, --genome STR Version of reference genome (default: hg38)
--maxaln INT Suppress all alignments for a particular read if more
than INT reportable alignments exist for it. (default:
100)
--mismatch INT Report alignments with at most INT mismatches.
(default: 0)
-o STR [STR ...], --order STR [STR ...]
Annotation priority. (default: structural_RNA miRNA
hairpin exon TE intron piRNA_cluster)
-p INT, --parallel INT
Parallel execute by INT CPUs. (default: 1)
-f STR [STR ...], --fastq STR [STR ...]
Input in FASTQ format. Compressed input is supported
and auto-detected from the filename extension (.gz).
-l STR [STR ...], --label STR [STR ...]
Unique label for each sample.
--dbfolder STR Custom location of TEsmall database folder (containing the "genomes" folder).
DEFAULT: $HOME/TEsmall_db/
--verbose INT Set verbose level.
0: only show critical message
1: show additional warning message
2: show process information
3: show debug messages.
DEFAULT: 2
-v, --version show program's version number and exit
Here are some brief explanations of the output files generated by TEsmall
count_summary.txt - This is the file containing the combined count table
of all libraries processed by TEsmall. This is typically
the file you want to use for differential analysis.
report.html - HTML report of QC and annotation statistics
For the following files, they are generated for each library, using the -l, --label
parameter the user provided.
[label].trimmed1.fastq - FASTQ file after 3' adapter trimming
[label].cutadapt1.log - Cutadapt log from 3' adapter trimming
[label].trimmed2.fastq - FASTQ file after 3' & 5' adapter trimming
[label].cutadapt2.log - Cutadapt log from 5' adapter trimming
[label].bam - BAM output for reads that aligned to rRNA (in older versions)
[label].rRNA.bam - BAM output for reads that aligned to rRNA
[label].rRNA.log - Bowtie log for rRNA mapping
[label].rm_rRNA.fastq - FASTQ file depleted for rRNA reads
Used for subsequent analysis
[label].log - Bowtie log for genome alignment (in older versions)
[label].genome.log - Bowtie log for genome alignment
[label].unaligned.fastq - FASTQ containing reads that failed to align to genome
[label].exceeded.fastq - FASTQ containing reads that aligned too many times to genome
[label].rinfo - Length & alignment counts for each aligned read (in older versions)
[label].aligned.rinfo - Length & alignment counts for each aligned read
[label].multi.bam - BAM output for reads aligned to genome (in older versions)
[label].genome.bam - BAM output for reads aligned to genome
[label].cca.fa - FASTA file containing aligned reads terminating with CCA, with CCA tail cleaved
[label].tRNA.bam - BAM output for CCA-trimmed reads that aligned to tRNA
[label].3trf.log - Bowtie log for CCA-trimmed reads aligning to tRNA (in older versions)
[label].tRNA.log - Bowtie log for CCA-trimmed reads aligning to tRNA
[label].unaligned.cca.fa - FASTA file containing CCA-trimmed reads that failed to align
[label].trna_for_intersect.bam - BAM file of CCA-trimmed reads that aligned to tRNA, converted to genomic coordinates
[label].3trf_free.bam - BAM file of reads aligned to genome that are not tRF
[label].3trf.bam - BAM file of reads aligned to genome that are tRF
[label].anno - Annotation of aligned reads that are not tRF
[label].3trf.struc.mapper.anno - tRF that annotated to structural RNA (e.g. tRNA)
[label].3trf.TE.mapper.anno - tRF that annotated to TE
[label].comp - Length distribution of reads based on annotation (in older versions)
[label].anno.rlen.info - Length distribution of reads based on annotation
[label].bedgraph - BEDgraph of annotated reads weighted by EM
TEsmall is part of TEToolkit suite.
TEsmall is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with TEsmall. If not, see this website.
If using the software in a publication, please cite the following:
O'Neill K, Liao WW, Patel A, Hammell MG. (2018) TEsmall Identifies Small RNAs Associated With Targeted Inhibitor Resistance in Melanoma. Front Genet. Oct 5;9:461.