forked from dib-lab/khmer
-
Notifications
You must be signed in to change notification settings - Fork 0
/
CITATION
210 lines (180 loc) · 9.4 KB
/
CITATION
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
.. vim: set filetype=rst
.. If you update this file then you may need to update the citations in
scripts/galaxy/macro.xml and khmer/khmer_args.py as well
Citations
---------
Software Citation
^^^^^^^^^^^^^^^^^
If you use the khmer software, you must cite:
Crusoe et al., The khmer software package: enabling efficient sequence
analysis. 2014. http://dx.doi.org/10.6084/m9.figshare.979190
.. code-block:: tex
@article{khmer2014,
author = "Crusoe, Michael and Edvenson, Greg and Fish, Jordan and Howe,
Adina and McDonald, Eric and Nahum, Joshua and Nanlohy, Kaben and
Ortiz-Zuazaga, Humberto and Pell, Jason and Simpson, Jared and Scott, Camille
and Srinivasan, Ramakrishnan Rajaram and Zhang, Qingpeng and Brown, C. Titus",
title = "The khmer software package: enabling efficient sequence
analysis",
year = "2014",
month = "04",
publisher = "Figshare",
url = "http://dx.doi.org/10.6084/m9.figshare.979190"
}
If you use any of our published scientific methods, you should *also*
cite the relevant paper(s), as directed below. Additionally some scripts use
the `SeqAn library <http://www.seqan.de>`_ for read parsing: the full citation
for that library is also included below.
To see a quick summary of papers for a given script just run it without using
any command line arguments.
Graph partitioning and/or compressible graph representation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The load-graph.py, partition-graph.py, find-knots.py, load-graph.py,
and partition-graph.py scripts are part of the compressible graph
representation and partitioning algorithms described in:
Pell J, Hintze A, Canino-Koning R, Howe A, Tiedje JM, Brown CT.
Scaling metagenome sequence assembly with probabilistic de Bruijn graphs
Proc Natl Acad Sci U S A. 2012 Aug 14;109(33):13272-7.
http://dx.doi.org/10.1073/pnas.1121464109.
PMID: 22847406
.. code-block:: tex
@article{Pell2012,
author = "Pell, Jason and Hintze, Arend and Canino-Koning, Rosangela and
Howe, Adina and Tiedje, James M. and Brown, C. Titus",
title = "Scaling metagenome sequence assembly with probabilistic de Bruijn
graphs",
volume = "109",
number = "33",
pages = "13272-13277",
year = "2012",
doi = "10.1073/pnas.1121464109",
abstract ="Deep sequencing has enabled the investigation of a wide range of
environmental microbial ecosystems, but the high memory requirements for de
novo assembly of short-read shotgun sequencing data from these complex
populations are an increasingly large practical barrier. Here we introduce a
memory-efficient graph representation with which we can analyze the k-mer
connectivity of metagenomic samples. The graph representation is based on a
probabilistic data structure, a Bloom filter, that allows us to efficiently
store assembly graphs in as little as 4 bits per k-mer, albeit inexactly. We
show that this data structure accurately represents DNA assembly graphs in low
memory. We apply this data structure to the problem of partitioning assembly
graphs into components as a prelude to assembly, and show that this reduces the
overall memory requirements for de novo assembly of metagenomes. On one soil
metagenome assembly, this approach achieves a nearly 40-fold decrease in the
maximum memory requirements for assembly. This probabilistic graph
representation is a significant theoretical advance in storing assembly graphs
and also yields immediate leverage on metagenomic assembly.",
URL = "http://www.pnas.org/content/109/33/13272.abstract",
eprint = "http://www.pnas.org/content/109/33/13272.full.pdf+html",
journal = "Proceedings of the National Academy of Sciences"
}
Digital normalization
^^^^^^^^^^^^^^^^^^^^^
The normalize-by-median.py and count-median.py scripts are part of
the digital normalization algorithm, described in:
A Reference-Free Algorithm for Computational Normalization of
Shotgun Sequencing Data
Brown CT, Howe AC, Zhang Q, Pyrkosz AB, Brom TH
arXiv:1203.4802 [q-bio.GN]
http://arxiv.org/abs/1203.4802
.. code-block:: tex
@unpublished{diginorm,
author = "C. Titus Brown and Adina Howe and Qingpeng Zhang and Alexis B.
Pyrkosz and Timothy H. Brom",
title = "A Reference-Free Algorithm for Computational Normalization of
Shotgun Sequencing Data",
year = "2012",
eprint = "arXiv:1203.4802",
url = "http://arxiv.org/abs/1203.4802",
}
K-mer counting
^^^^^^^^^^^^^^
The abundance-dist.py, filter-abund.py, and load-into-counting.py scripts
implement the probabilistic k-mer counting described in:
These Are Not the K-mers You Are Looking For: Efficient Online K-mer
Counting Using a Probabilistic Data Structure
Zhang Q, Pell J, Canino-Koning R, Howe AC, Brown CT.
http://dx.doi.org/10.1371/journal.pone.0101271
.. code-block:: tex
@article{khmer-counting,
author = "Zhang, Qingpeng AND Pell, Jason AND Canino-Koning, Rosangela
AND Howe, Adina Chuang AND Brown, C. Titus",
journal = "PLoS ONE",
publisher = "Public Library of Science",
title = "These Are Not the K-mers You Are Looking For: Efficient Online
K-mer Counting Using a Probabilistic Data Structure",
year = "2014",
month = "07",
volume = "9",
url = "http://dx.doi.org/10.1371%2Fjournal.pone.0101271",
pages = "e101271",
abstract = "<p>K-mer abundance analysis is widely used for many purposes in
nucleotide sequence analysis, including data preprocessing for de novo
assembly, repeat detection, and sequencing coverage estimation. We present the
khmer software package for fast and memory efficient <italic>online</italic>
counting of k-mers in sequencing data sets. Unlike previous methods based on
data structures such as hash tables, suffix arrays, and trie structures, khmer
relies entirely on a simple probabilistic data structure, a Count-Min Sketch.
The Count-Min Sketch permits online updating and retrieval of k-mer counts in
memory which is necessary to support online k-mer analysis algorithms. On
sparse data sets this data structure is considerably more memory efficient than
any exact data structure. In exchange, the use of a Count-Min Sketch introduces
a systematic overcount for k-mers; moreover, only the counts, and not the
k-mers, are stored. Here we analyze the speed, the memory usage, and the
miscount rate of khmer for generating k-mer frequency distributions and
retrieving k-mer counts for individual k-mers. We also compare the performance
of khmer to several other k-mer counting packages, including Tallymer,
Jellyfish, BFCounter, DSK, KMC, Turtle and KAnalyze. Finally, we examine the
effectiveness of profiling sequencing error, k-mer abundance trimming, and
digital normalization of reads in the context of high khmer false positive
rates. khmer is implemented in C++ wrapped in a Python interface, offers a
tested and robust API, and is freely available under the BSD license at
github.com/ged-lab/khmer.</p>",
number = "7",
doi = "10.1371/journal.pone.0101271"
}
FASTA and FASTQ reading
^^^^^^^^^^^^^^^^^^^^^^^
Several scripts use the SeqAn library for FASTQ and FASTA reading as described
in:
SeqAn An efficient, generic C++ library for sequence analysis
Döring A, Weese D, Rausch T, Reinert K.
http://dx.doi.org/10.1186/1471-2105-9-11
.. code-block:: tex
@Article{SeqAn,
AUTHOR = {Doring, Andreas and Weese, David and Rausch, Tobias and Reinert,
Knut},
TITLE = {SeqAn An efficient, generic C++ library for sequence analysis},
JOURNAL = {BMC Bioinformatics},
VOLUME = {9},
YEAR = {2008},
NUMBER = {1},
PAGES = {11},
URL = {http://www.biomedcentral.com/1471-2105/9/11},
DOI = {10.1186/1471-2105-9-11},
PubMedID = {18184432},
ISSN = {1471-2105},
ABSTRACT = {BACKGROUND: The use of novel algorithmic techniques is pivotal
to many important problems in life science. For example the sequencing of
the human genome [1] would not have been possible without advanced assembly
algorithms. However, owing to the high speed of technological progress and
the urgent need for bioinformatics tools, there is a widening gap between
state-of-the-art algorithmic techniques and the actual algorithmic
components of tools that are in widespread use. RESULTS: To remedy this
trend we propose the use of SeqAn, a library of efficient data types and
algorithms for sequence analysis in computational biology. SeqAn comprises
implementations of existing, practical state-of-the-art algorithmic
components to provide a sound basis for algorithm testing and development.
In this paper we describe the design and content of SeqAn and demonstrate
its use by giving two examples. In the first example we show an application
of SeqAn as an experimental platform by comparing different exact string
matching algorithms. The second example is a simple version of the well-
known MUMmer tool rewritten in SeqAn. Results indicate that our
implementation is very efficient and versatile to use. CONCLUSION: We
anticipate that SeqAn greatly simplifies the rapid development of new
bioinformatics tools by providing a collection of readily usable, well-
designed algorithmic components which are fundamental for the field of
sequence analysis. This leverages not only the implementation of new
algorithms, but also enables a sound analysis and comparison of existing
algorithms.},
}