forked from csmiller/EMIRGE
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.txt
155 lines (112 loc) · 6.47 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
EMIRGE: Expectation-Maximization Iterative Reconstruction of Genes from the Environment
Copyright (C) 2010-2012 Christopher S. Miller ([email protected])
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>
https://github.com/csmiller/EMIRGE
EMIRGE reconstructs full length ribosomal genes from short read
sequencing data. In the process, it also provides estimates of the
sequences' abundances.
EMIRGE uses a modification of the EM algorithm to iterate between
estimating the expected value of the abundance of all SSU sequences
present in a sample and estimating the probabilities for each read
that a specific sequence generated that read. At the end of each
iteration, those probabilities are used to re-calculate (correct) a
consensus sequence for each reference SSU sequence, and the mapping is
repeated, followed by the estimations of probabilities. The
iterations should usually stop when the reference sequences no longer
change from one iteration to the next. Practically, 40-80 iterations is
usually sufficient for many samples. Right now EMIRGE uses Bowtie
alignments internally, though in theory a different mapper could be
used.
EMIRGE was designed for Illumina reads in FASTQ format, from pipeline
version >= 1.3
There are two versions of EMIRGE:
1. emirge.py -- this version was designed for metagenomic data
2. emirge_amplicon.py -- this version was designed for rRNA amplicon data, and can handle up to a few million reads where the entire sequencing allocation is devoted to a single gene. In theory it could also be used for RNASeq data where rRNA makes up a large percentage of the reads. There is a publication that has been submitted describing this application of EMIRGE.
CITATION
------------------------------
If you use EMIRGE in your work, please cite:
Miller, C.S., B. J. Baker, B. C. Thomas, S. W. Singer and J. F. Banfield (2011). "EMIRGE: reconstruction of full-length ribosomal genes from microbial community short read sequencing data." Genome Biology 12(5): R44.
DEPENDENCIES
------------------------------
EMIRGE expects the following programs to be installed and available in your path:
-python (tested with version 2.6), with the following packages installed:
-BioPython
-Cython
-pysam
-scipy / numpy
-usearch (www.drive5.com/usearch/ -- tested with usearch version 6.0.203; versions earlier than this are incompatible).
-samtools (http://samtools.sourceforge.net/ -- tested with verison 0.1.18)
-bowtie (http://bowtie-bio.sourceforge.net/index.shtml -- tested with version 0.12.7 and 0.12.8)
INSTALLATION
------------------------------
After installing the dependencies listed above, type the following to build EMIRGE:
$ python setup.py build
To install (you may skip straight to this step), type the following as root, or with sudo:
$ python setup.py install
You can also type the following for more options:
$ python setup.py --help install
For example, to install to a location in your home directory where you
have permission to write, you might type something like:
$ python setup.py install --prefix=$HOME/software
HELP
------------------------------
There is a google group (similar to a mailing list) for asking questions
about EMIRGE:
https://groups.google.com/group/emirge-users
Also, there is some additional information (including a Frequently
Asked Questions section) on the github wiki:
https://github.com/csmiller/EMIRGE/wiki
Although I encourage use of the google group due to increased volume
of support emails, please feel free to contact me directly
([email protected]) with any problems, bug reports, or questions
At the moment, there is no manual, though running the following is helpful:
emirge.py --help
EMIRGE OUTPUT
------------------------------
Once an EMIRGE run is completed, run emirge_rename_fasta.py on the
final iterations directory, for example:
emirge_rename_fasta.py iter.40 > renamed.fasta
Also see:
emirge_rename_fasta.py --help
Running emirge_rename_fasta.py will provide you with a fasta file with
EMIRGE output. Dissecting a single example header:
>3326|AF427479.1.1480_m01 Prior=0.000367 Length=1480 NormPrior=0.000414
1 2 3 4 5 6
1. The internal EMIRGE ID -- unique for each sequence
2. The accession number of the starting candidate sequence
3. an optional suffix indicating this sequence was split out from another due to evidence in the mapping reads of 2 or more "strains."
4. The Prior, or abundance estimate (used in original publication)
5. The length of the sequence
6. The length-normalized abundance estimate (anecdotally, this is sometimes more accurate if there are lots of different sequence lengths)
CANDIDATE SSU DATABASE
------------------------------
SSU_candidate_db.fasta.gz is included with this distribution. This was
made using Silva release SSURef_108_NR (http://www.arb-silva.de/). Sequences
were clustered using uclust at 97% sequence identity, short and long
sequences were removed, and non-standard characters were changed to be
within {ACTG} (using utils/fix_nonstandard_chars.py).
You can use any reference SSU database with emirge, though this one is
recommended. No matter your choice, you should run
utils/fix_nonstandard_chars.py on your fasta file. You will also need
to first build a bowtie index, with something like:
$ bowtie-build SSU_candidate_db.fasta SSU_candidate_db_btindex
You might also consider changing the offrate (see
http://bowtie-bio.sourceforge.net/manual.shtml)
OTHER
------------------------------
** A note about single-end sequencing:
EMIRGE was designed for and tested on paired-end sequencing reads.
However, you can now use EMIRGE on single-end reads as well: simply
omit the -2 parameter. Although I have done some basic testing on
single-end reads, runs with single reads have NOT been as extensively
tested as runs with paired reads. Please let me know how it works for
you if you try EMIRGE with single-end reads.