-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcustom_db.html
277 lines (258 loc) · 17.9 KB
/
custom_db.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
<!DOCTYPE html>
<html lang="en" data-content_root="./">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" />
<title>5. Building a custom database — ProPhyle 0.3.3.2 documentation</title>
<link rel="stylesheet" type="text/css" href="_static/pygments.css?v=d10597a4" />
<link rel="stylesheet" type="text/css" href="_static/sphinx13.css?v=ec60a4c3" />
<script src="_static/documentation_options.js?v=e7e3125d"></script>
<script src="_static/doctools.js?v=9a2dae69"></script>
<script src="_static/sphinx_highlight.js?v=dc90522c"></script>
<link rel="search" type="application/opensearchdescription+xml"
title="Search within ProPhyle 0.3.3.2 documentation"
href="_static/opensearch.xml"/>
<link rel="index" title="Index" href="genindex.html" />
<link rel="search" title="Search" href="search.html" />
<link rel="next" title="6. Classifying reads" href="classify.html" />
<link rel="prev" title="4. Building a standard database" href="standard_db.html" />
<link href='https://fonts.googleapis.com/css?family=Open+Sans:300,400,700'
rel='stylesheet' type='text/css' />
<style type="text/css">
table.right { float: right; margin-left: 20px; }
table.right td { border: 1px solid #ccc; }
</style>
<script type="text/javascript">
// intelligent scrolling of the sidebar content
$(window).scroll(function() {
var sb = $('.sphinxsidebarwrapper');
var win = $(window);
var sbh = sb.height();
var offset = $('.sphinxsidebar').position()['top'];
var wintop = win.scrollTop();
var winbot = wintop + win.innerHeight();
var curtop = sb.position()['top'];
var curbot = curtop + sbh;
// does sidebar fit in window?
if (sbh < win.innerHeight()) {
// yes: easy case -- always keep at the top
sb.css('top', $u.min([$u.max([0, wintop - offset - 10]),
$(document).height() - sbh - 200]));
} else {
// no: only scroll if top/bottom edge of sidebar is at
// top/bottom edge of window
if (curtop > wintop && curbot > winbot) {
sb.css('top', $u.max([wintop - offset - 10, 0]));
} else if (curtop < wintop && curbot < winbot) {
sb.css('top', $u.min([winbot - sbh - offset - 20,
$(document).height() - sbh - 200]));
}
}
});
</script>
</head><body>
<div class="pageheader">
<ul>
<li><a href="index.html">Home</a></li>
<li><a href="install.html">Get it</a></li>
<li><a href="contents.html">Docs</a></li>
<li><a href="http://github.com/prophyle/prophyle">Extend/Develop</a></li>
</ul>
<div>
<a href="index.html">
<!--
<img src="_static/sphinxheader.png" alt="SPHINX" />
-->
<div style="color:white;font-size:240%">ProPhyle</div>
<div style="color:white;font-size:140%">DNA sequence classification</div>
</a>
</div>
</div>
<div class="related" role="navigation" aria-label="Related">
<h3>Navigation</h3>
<ul>
<li class="right" style="margin-right: 10px">
<a href="genindex.html" title="General Index"
accesskey="I">index</a></li>
<li class="right" >
<a href="classify.html" title="6. Classifying reads"
accesskey="N">next</a> |</li>
<li class="right" >
<a href="standard_db.html" title="4. Building a standard database"
accesskey="P">previous</a> |</li>
<li><a href="index.html">ProPhyle home</a> |</li>
<li><a href="contents.html">Documentation</a> »</li>
<li class="nav-item nav-item-this"><a href=""><span class="section-number">5. </span>Building a custom database</a></li>
</ul>
</div>
<div class="sphinxsidebar" role="navigation" aria-label="Main">
<div class="sphinxsidebarwrapper">
<div>
<h3><a href="contents.html">Table of Contents</a></h3>
<ul>
<li><a class="reference internal" href="#">5. Building a custom database</a><ul>
<li><a class="reference internal" href="#building-a-custom-database-with-user-provided-data">5.1. Building a custom database with user-provided data</a></li>
<li><a class="reference internal" href="#building-a-custom-database-with-refseq-genomes-and-the-ncbi-taxonomy">5.2. Building a custom database with RefSeq genomes and the NCBI taxonomy</a><ul>
<li><a class="reference internal" href="#download-sequences-from-ncbi">Download sequences from NCBI</a></li>
<li><a class="reference internal" href="#build-a-tree">Build a tree</a></li>
<li><a class="reference internal" href="#build-the-index">Build the index</a></li>
</ul>
</li>
</ul>
</li>
</ul>
</div>
<div>
<h4>Previous topic</h4>
<p class="topless"><a href="standard_db.html"
title="previous chapter"><span class="section-number">4. </span>Building a standard database</a></p>
</div>
<div>
<h4>Next topic</h4>
<p class="topless"><a href="classify.html"
title="next chapter"><span class="section-number">6. </span>Classifying reads</a></p>
</div>
<div role="note" aria-label="source link">
<h3>This Page</h3>
<ul class="this-page-menu">
<li><a href="_sources/custom_db.rst.txt"
rel="nofollow">Show Source</a></li>
</ul>
</div>
<search id="searchbox" style="display: none" role="search">
<h3 id="searchlabel">Quick search</h3>
<div class="searchformwrapper">
<form class="search" action="search.html" method="get">
<input type="text" name="q" aria-labelledby="searchlabel" autocomplete="off" autocorrect="off" autocapitalize="off" spellcheck="false"/>
<input type="submit" value="Go" />
</form>
</div>
</search>
<script>document.getElementById('searchbox').style.display = "block"</script>
</div>
</div>
<div class="document">
<div class="documentwrapper">
<div class="bodywrapper">
<div class="body" role="main">
<section id="building-a-custom-database">
<span id="custom-db"></span><h1><span class="section-number">5. </span>Building a custom database<a class="headerlink" href="#building-a-custom-database" title="Link to this heading">¶</a></h1>
<nav class="contents local" id="contents">
<ul class="simple">
<li><p><a class="reference internal" href="#building-a-custom-database-with-user-provided-data" id="id2">Building a custom database with user-provided data</a></p></li>
<li><p><a class="reference internal" href="#building-a-custom-database-with-refseq-genomes-and-the-ncbi-taxonomy" id="id3">Building a custom database with RefSeq genomes and the NCBI taxonomy</a></p>
<ul>
<li><p><a class="reference internal" href="#download-sequences-from-ncbi" id="id4">Download sequences from NCBI</a></p></li>
<li><p><a class="reference internal" href="#build-a-tree" id="id5">Build a tree</a></p></li>
<li><p><a class="reference internal" href="#build-the-index" id="id6">Build the index</a></p></li>
</ul>
</li>
</ul>
</nav>
<section id="building-a-custom-database-with-user-provided-data">
<h2><span class="section-number">5.1. </span>Building a custom database with user-provided data<a class="headerlink" href="#building-a-custom-database-with-user-provided-data" title="Link to this heading">¶</a></h2>
<ul>
<li><p>Put all your genomes in a separate directory</p></li>
<li><p>Create a phylogenetic/taxonomic tree in the newick format,
with leaves named the same way as the FASTA files (e.g., a leaf named <cite>genome1</cite> will correspond to
<cite>genome1.fa</cite>). Ensure that all leaves have their respective FASTA files.</p></li>
<li><dl>
<dt>Build a ProPhyle index by</dt><dd><div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>prophyle<span class="w"> </span>index<span class="w"> </span>-A<span class="w"> </span>-k<span class="w"> </span><kmer_length><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>-g<span class="w"> </span><span class="o">[</span><dir_with_genomes><span class="o">]</span><span class="w"> </span><tree_1.nw><span class="w"> </span><index_dir>
</pre></div>
</div>
</dd>
</dl>
<p>The <cite>-A</cite> parameter will ensure that your tree gets autocompleted, which includes creating
names for internal nodes.</p>
</li>
</ul>
</section>
<section id="building-a-custom-database-with-refseq-genomes-and-the-ncbi-taxonomy">
<h2><span class="section-number">5.2. </span>Building a custom database with RefSeq genomes and the NCBI taxonomy<a class="headerlink" href="#building-a-custom-database-with-refseq-genomes-and-the-ncbi-taxonomy" title="Link to this heading">¶</a></h2>
<section id="download-sequences-from-ncbi">
<h3>Download sequences from NCBI<a class="headerlink" href="#download-sequences-from-ncbi" title="Link to this heading">¶</a></h3>
<ul class="simple">
<li><p>Create a subdirectory in ProPhyle’s main directory, and change the working directory to the newly created one</p></li>
<li><p>Download an <code class="docutils literal notranslate"><span class="pre">assembly_summary.txt</span></code> file from NCBI’s FTP server. For RefSeq’s bacterial genomes, follow this <a class="reference external" href="ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt">link</a>. More information about these files can be found <a class="reference external" href="ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt">here</a></p></li>
<li><p>Select genomes of interest; the following command will select all complete genomes, but it is sufficient to add more conditions to <code class="docutils literal notranslate"><span class="pre">awk</span></code> to select a subset. Fields 1, 6 and 20 contain respectively the accession number, taxonomic identifier and ftp directory of each genome. <code class="docutils literal notranslate"><span class="pre">acc2taxid.tsv</span></code> will therefore be a tab separated file containing the taxid corresponding to each sequence’s accession number. Please refer to <a class="reference external" href="https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq">NCBI’s genomes download FAQ</a> for further information). Since strain taxids are deprecated, we are working on a solution to create artificial nodes for each genome. For the moment, you can associate the sequences to the species node, by selecting field 7 instead of 6 in the assembly summary, which corresponds to the species taxid.</p></li>
</ul>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>awk<span class="w"> </span>-F<span class="w"> </span><span class="s2">"\t"</span><span class="w"> </span>-v<span class="w"> </span><span class="nv">OFS</span><span class="o">=</span><span class="s2">"\t"</span><span class="w"> </span><span class="s1">'$12=="Complete Genome" && $11=="latest"\</span>
<span class="s1"> {split($1, a, "."); print a[1], $7, $20}'</span><span class="w"> </span>assembly_summary.txt<span class="w"> </span>>ftpselection.tsv
cut<span class="w"> </span>-f<span class="w"> </span><span class="m">3</span><span class="w"> </span>ftpselection.tsv<span class="w"> </span><span class="p">|</span><span class="w"> </span>awk<span class="w"> </span><span class="s1">'BEGIN{FS=OFS="/";filesuffix="genomic.fna.gz"}\</span>
<span class="s1"> {ftpdir=$0;asm=$10;file=asm"_"filesuffix;print ftpdir,file}'</span><span class="w"> </span>>ftpfilepaths.tsv
cut<span class="w"> </span>-f<span class="w"> </span><span class="m">1</span>,2<span class="w"> </span>ftpselection.tsv<span class="w"> </span>>acc2taxid.tsv
</pre></div>
</div>
<ul class="simple">
<li><p>Download selected sequences using <code class="docutils literal notranslate"><span class="pre">parallel</span> <span class="pre">--gnu</span> <span class="pre">-j</span> <span class="pre">32</span> <span class="pre">-a</span> <span class="pre">ftpfilepaths.tsv</span> <span class="pre">wget</span></code>. Number of jobs (<code class="docutils literal notranslate"><span class="pre">-j</span></code> option) should be changed according to the number of available cores and bandwidth.</p></li>
<li><p>Extract them using <code class="docutils literal notranslate"><span class="pre">parallel</span> <span class="pre">--gnu</span> <span class="pre">-j</span> <span class="pre">32</span> <span class="pre">-a</span> <span class="pre">ftpfilepaths.tsv</span> <span class="pre">gunzip</span> <span class="pre">{/}</span></code></p></li>
<li><p>Create a <a class="reference external" href="http://www.htslib.org/doc/faidx.html">fasta index</a> for each file using <code class="docutils literal notranslate"><span class="pre">find</span> <span class="pre">.</span> <span class="pre">-name</span> <span class="pre">'*.fna'</span> <span class="pre">|</span> <span class="pre">parallel</span> <span class="pre">-j</span> <span class="pre">32</span> <span class="pre">samtools</span> <span class="pre">faidx</span> <span class="pre">{}</span></code></p></li>
</ul>
</section>
<section id="build-a-tree">
<h3>Build a tree<a class="headerlink" href="#build-a-tree" title="Link to this heading">¶</a></h3>
<p>Build a taxonomic tree for the downloaded sequences using</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>prophyle_ncbi_tree.py<span class="w"> </span><library_name><span class="w"> </span><library_main_dir><span class="se">\</span>
<span class="w"> </span><output_file><span class="w"> </span>acc2taxid.tsv<span class="w"> </span>-l<span class="w"> </span><log_file>
</pre></div>
</div>
<p><code class="docutils literal notranslate"><span class="pre">library_name</span></code> is the name of the subdirectory where the sequences are downloaded (e.g. bacteria, archaea, viral for the standard DBs, or any name of your choice), which will be prepended to the filenames in the tree. <code class="docutils literal notranslate"><span class="pre">library_main_dir</span></code> is the directory where all the sequences are downloaded, which usually is the parent of library_name. This should be used at indexing time (the -g parameter of prophyle index), unless the tree is placed in the very same directory.
Taxonomic identifiers are assigned to the sequences first, and then the tree is
built using <a class="reference external" href="http://etetoolkit.org/">ETE Toolkit</a> and saved with newick format
1. Necessary node attributes are:</p>
<ul class="simple">
<li><p><code class="docutils literal notranslate"><span class="pre">name</span></code>: unique node name (for RefSeq DB: the taxid of the node)</p></li>
<li><p><code class="docutils literal notranslate"><span class="pre">path</span></code>: paths of the sequences’ fasta files, separated by @ (relative paths from ProPhyle’s home directory)</p></li>
</ul>
<p>Other optional attributes are <code class="docutils literal notranslate"><span class="pre">taxid</span></code>, <code class="docutils literal notranslate"><span class="pre">sci_name</span></code>, <code class="docutils literal notranslate"><span class="pre">named_lineage</span></code>, <code class="docutils literal notranslate"><span class="pre">lineage</span></code>, <code class="docutils literal notranslate"><span class="pre">rank</span></code> (more info
<a class="reference external" href="http://etetoolkit.org/docs/latest/tutorial/tutorial_ncbitaxonomy.html#automatic-tree-annotation-using-ncbi-taxonomy">here</a>
). Using ETE library or modifying the <code class="docutils literal notranslate"><span class="pre">prophyle_ncbi_tree.py</span></code> script it is
easy to adapt any tree to match the requirements above.</p>
</section>
<section id="build-the-index">
<h3>Build the index<a class="headerlink" href="#build-the-index" title="Link to this heading">¶</a></h3>
<p>Run the standard command to build ProPhyle index</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>prophyle<span class="w"> </span>index<span class="w"> </span>-k<span class="w"> </span><kmer_length><span class="w"> </span><span class="o">[</span>-g<span class="w"> </span><library_main_dir><span class="o">]</span><span class="w"> </span><tree_1.nw><span class="w"> </span><span class="o">[</span><tree_2.nw><span class="w"> </span>...<span class="o">]</span><span class="w"> </span><index_dir>
</pre></div>
</div>
</section>
</section>
</section>
<div class="clearer"></div>
</div>
</div>
</div>
<div class="clearer"></div>
</div>
<div class="related" role="navigation" aria-label="Related">
<h3>Navigation</h3>
<ul>
<li class="right" style="margin-right: 10px">
<a href="genindex.html" title="General Index"
>index</a></li>
<li class="right" >
<a href="classify.html" title="6. Classifying reads"
>next</a> |</li>
<li class="right" >
<a href="standard_db.html" title="4. Building a standard database"
>previous</a> |</li>
<li><a href="index.html">ProPhyle home</a> |</li>
<li><a href="contents.html">Documentation</a> »</li>
<li class="nav-item nav-item-this"><a href=""><span class="section-number">5. </span>Building a custom database</a></li>
</ul>
</div>
<div class="footer" role="contentinfo">
© Copyright 2015-2024, Karel Břinda, Kamil Salikhov, Simone Pignotti, Gregory Kucherov.
Created using <a href="https://www.sphinx-doc.org/">Sphinx</a> 8.0.2.
</div>
<!-- Global site tag (gtag.js) - Google Analytics -->
<script async src="https://www.googletagmanager.com/gtag/js?id=UA-112241191-1"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'UA-112241191-1');
</script>
</body>
</html>