Trinity is a de novo assembler of RNA-seq reads which can handle pretty large datasets written by Brian Haas and collaborators. It also has a setting called Genome Guided Trinity which works by first aligning reads to the genome and binning the genome into segements and assembly reads only in a region - this helps cut down on paralogous genes being co-assembled and also can be more accurate and a little faster since the problem is now subdivided into smaller portions for the assembler.
Nearly all the info you need for running Trinity is on the website supported by the developed so it will not be repeated here, but instead will emphasize some points of clarification and
The tutorial developed by Running Trinity
These data should work just fine out of the gate. Remembering there are some which will be paired end and some single end.
Trinity expect data to
Fix read names ...
One critical aspect is knowing if the data are strand specific and the organization of the reads as RF or FR. Guessing this can be done but it requires a reference genome. However the Trinity documentation discuss this so it may require you to run a whole assembly iteration with Trinity and then return and re-map reads against this transcript assembly.
Several tools will help guess this, though generally this is only going to be guessable if you also have a genome to align the reads to.
- RSeQC - RNA-seq Quality Control Package has a tool called infer_experiment.py
Read about Genome Guided mode which improves accuracy of assembled transcripts by first aligning reads to a genome assembly and then building clusters of aligned reads. To run this you will need to have first aligned the reads to the genome
The results of this run is a Trinity-GG.fasta
file instead of Trinity.fasta
For large read set (eg 200M reads or more) you will want to assign a lot of memory. The intel
queue max memory that can be requested is 500 Gb and the highmem
queue can request up to 1Tb (1000Gb).
The tool Transdecoder (also written by Brian Haas) can be used to infer open reading frames (ORFs) from transcript assembly.
module load transdecoder
TransDecoder.LongOrfs -t Trinity.fasta