-
Notifications
You must be signed in to change notification settings - Fork 49
/
Copy pathdocumentation.html
755 lines (672 loc) · 34.7 KB
/
documentation.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
</head>
<body>
<h1>miRDeep2 documentation</h1>
<h3>What is miRDeep2</h3>
miRDeep2 is a software package for identification of novel and known miRNAs in deep sequencing data. Furthermore, it can be used for miRNA expression profiling across samples. Last, a new module for preprocessing of raw Illumina sequencing data produces files
for downstream analysis with the miRDeep2 or quantifier module. Colorspace sequencing data is currently not supported by the preprocessing module but it is planed to be implemented. Preprocessing is performed with the
<a href="#mapper">mapper.pl</a> script. Quantification and expression profiling is done by the
<a href="#quantifier">quantifier.pl</a> script. miRNA identification is done by the
<a href="#mirdeep2">miRDeep2.pl</a> script.
<h2><a href="#File-formats">File formats</a></h2>
<h2>Script Reference:</h2>
<a name="mirdeep2"></a>
<h4>name:</h4>
<a href="#top">miRDeep2.pl</a>
<h4>description:</h4>
Wrapper function for the miRDeep2.pl program package. The script runs all necessary scripts of the miRDeep2 package to perform a microRNA detection deep sequencing data anlysis.
<h4>input:</h4>
A fasta file with deep sequencing reads, a fasta file of the corresponding genome, a file of mapped reads to the genome in miRDeep2 arf format, an optional fasta file with known miRNAs of the analysing species and an option fasta file of known miRNAs of related
species
<h4>output:</h4>
A spreadsheet and a html file with an overview of all detected miRNAs in the deep sequencing input data.
<h4>options:</h4><pre>
-a int minimum read stack height that triggers analysis. Using this option disables
automatic estimation of the optimal value.
-b int minimum score cut-off for predicted novel miRNAs to be displayed in the overview
table. This score cut-off is by default 0.
-c disable randfold analysis
-g int maximum number of precursors to analyze when automatic excision gearing is used.
default=50000, if set to -1 all precursors will be analyzed
-t species species being analyzed - this is used to link to the appropriate UCSC browser
-u output list of UCSC browser species that are supported and exit
-v remove directory with temporary files
-s file File with known miRBase star sequences
-z tag Additional tag appended to current time stamp
-P use this switch if mature_ref_miRNAs contain miRBase v18 identifiers (5p and 3p) instead of previous ids from v17
</pre>
<h4>Examples: </h4>
The miRDeep2 module identifies known and novel miRNAs in deep sequencing data. The output of the mapper module can be directly plugged into the miRDeep2 module.
<h4>Example use 1:</h4>
The user wishes to identify miRNAs in mouse deep sequencing data, using default options. The miRBase_mmu_v14.fa file contains all miRBase mature mouse miRNAs, while the miRBase_rno_v14.fa file contains all the miRBase mature rat miRNAs. The '2>' will pipe all
progress output to the report.log file.
<pre>miRDeep2.pl reads_collapsed.fa genome.fa reads_collapsed_vs_genome.arf miRBase_mmu_v14.fa miRBase_rno_v14.fa precursors_ref_this_species.fa -t Mouse 2>report.log</pre>
This command will generate a directory with pdfs showing
the structures, read signatures and score breakdowns of novel and known miRNAs in the data, an html webpage that links to all results generated (result.html), a copy of the novel and known miRNAs contained in the webpage but in text format which allows easy
parsing (result.csv), a copy of the performance survey contained in the webpage but in text format (survey.csv) and a copy of the miRNA read signatures contained in the pdfs but in text format (output.mrd). The ids in files miRBase_mmu_v14.fa and precursors_ref_this_species.fa
need to be similar to each other. This is usually no problem if you downloaded both files from miRBase. Otherwise it can happen that the quantifier fails to produce results.
<h4>Example use 2:</h4>
As in example use 1, except that the user has already run quantifier.pl and wants to use this output to get information on the miRNAs not detected by miRDeep2 included in the html webpage. miRBase.mrd is a file generated by quantifier.pl:
<pre>miRDeep2.pl reads_collapsed.fa genome.fa reads_collapsed_vs_genome.arf miRBase_mmu_v14.fa miRBase_rno_v14.fa -t Mouse -q miRBase.mrd 2>report.log</pre>
This command will generate the same type of files as example use 1 above. The user wishes to identify miRNAs in deep sequencing data from an animal with no related species in miRBase:
<pre>miRDeep2.pl reads_collapsed.fa genome.fa reads_collapsed_vs_genome.arf none none none 2>report.log</pre>
This command will generate the same type of files as example use 1 above. Note that there it will in practice always improve miRDeep2 performance if miRNAs from some related species is input, even if it is not closely related.
<h4>
<pre>
<a name="mapper"></a>
</pre>
</h4>
<h4>name:</h4>
<a href="#top">mapper.pl</a>
<h4>description:</h4>
Processes reads and/or maps them to the reference genome.
<h4>input:</h4>
Default input is a file in fasta, seq.txt or qseq.txt format. More input can be given depending on the options used.
<h4>output:</h4>
The output depends on the options used (see below). Either a fasta file with processed reads or an arf file with with mapped reads, or both, are output.
<h4>options:</h4><pre>
Read input file:
-a input file is seq.txt format
-b input file is qseq.txt format
-c input file is fasta format
-e input file is fastq format
-d input file is a config file (see miRDeep2 documentation).
options -a, -b or -c must be given with option -d.
Preprocessing/mapping:
-g three-letter prefix for reads (by default 'seq')
-h parse to fasta format
-i convert rna to dna alphabet (to map against genome)
-j remove all entries that have a sequence that contains letters
other than a,c,g,t,u,n,A,C,G,T,U,N
-k seq clip 3' adapter sequence
-l int discard reads shorter than int nts
-m collapse reads
-p genome map to genome (must be indexed by bowtie-build). The 'genome'
string must be the prefix of the bowtie index. For instance, if
the first indexed file is called 'h_sapiens_37_asm.1.ebwt' then
the prefix is 'h_sapiens_37_asm'.
-q map with one mismatch in the seed (mapping takes longer)
-r int a read is allowed to map up to this number of positions in the genome
default is 5
Output files:
-s file print processed reads to this file
-t file print read mappings to this file
Other:
-u do not remove directory with temporary files
-v outputs progress report
-n overwrite existing files
</pre>
<h4>Examples:</h4>
The mapper module is designed as a tool to process deep sequencing reads and/or map them to the reference genome. The module works in sequence space, and can process or map data that is in sequence fasta format. A number of the functions of the mapper module
are implemented specifically with Solexa/Illumina data in mind. For example on how to post-process mappings in color space, see example use 5:
<h4>Example use 1:</h4>
The user wishes to parse a file in qseq.txt format to fasta format, convert from RNA to DNA alphabet, remove entries with non-canonical letters (letters other than a,c,g,t,u,n,A,C,G,T,U,N), clip adapters, discard reads shorter than 18 nts and collapse the reads:
<pre>mapper.pl reads_qseq.txt -b -h -i -j -k TCGTATGCCGTCTTCTGCTTGT -l 18 -m -s reads_collapsed.fa
</pre>
<h4>Example use 2:</h4>
The user wishes to map a fasta file against the reference genome. The genome has already been indexed by bowtie-build. The first of the indexed files is named genome.1.ebwt:
<pre>mapper.pl reads_collapsed.fa -c -p genome -t reads_collapsed_vs_genome.arf
</pre>
<h4>Example use 3:</h4>
The user wishes to process the reads as in example use 1 and map the reads as in example use 2 in a single step, while observing the progress:
<pre>mapper.pl reads_qseq.txt -b -h -i -j -k TCGTATGCCGTCTTCTGCTTGT -l 18 -m -p genome -s reads_collapsed.fa -t reads_collapsed_vs_genome.arf -v
</pre>
<h4>Example use 4:</h4>
The user wishes to parse a GEO file to fasta format and process it as in example use 1. The GEO file is in tabular format, with the first column showing the sequence and the second column showing the read counts:
<pre>geo2fasta.pl GSM.txt > reads.fa
mapper.pl reads.fa -c -h -i -j -k TCGTATGCCGTCTTCTGCTTGT -l 18 -m -s reads_collapsed.fa
</pre>
<h4>Example use 5 (experimental):</h4>
The user has already removed 3' adapters in color space and has mapped the reads against the genome using bwa/bowtie resulting in a sam file. Note that each genome locus to which a read was aligned has to occur in its own line. Otherwise only the first genome locus of each line will be taken! The mapping output file is named mapped.sam. The user wishes to generate the files 'reads_collapsed.fa' and 'reads_collapsed_vs_genome.arf' as input to miRDeep2:
<p></p>
<pre>perl sam_reads_collapse.pl mapped.sam reads_collapsed.fa</pre>
<pre>perl bwa_sam_converter.pl -i mapped.sam -t read_1_to_1.txt -o reads_collapsed_vs_genome.arf</pre>
<p></p>
If read ids are already collapsed and in correct miRDeep2 format (eg. ">ABC_1_x10", see file formats) then the sam file just needs to be converted:
<p></p>
<pre>perl bwa_sam_converter.pl -i mapped.sam -o reads_collapsed_vs_genome.arf</pre>
<h4>Example use 6:</h4>
The user has sequencing data from different samples e.g. different cell-types. A config.txt file has to be created in which each line designates file locations and a unique 3 letter code. For instance:
<pre>sequencing_data_sample1.fa sd1
sequencing_data_sample2.fa sd2
sequencing_data_sample3.fa sd3
</pre>
The user wishes then to pool these files and use the generated files reads.fa and reads_vs_genome.fa for the miRDeep2 analysis.
<pre>mapper.pl config.txt -d -c -i -j -l 18 -m -p genome_index -s reads.fa -t reads_vs_genome.arf
</pre>
Since the reads_vs_genome.arf still contains the 3 letter code for each read mapped to genome the user can then later on dilute the contribution of the different samples to a predicted or known miRNA. It can also be used for example to define 'high confident' predictions if the results are filtered for miRNAs that have sequencing evidence from at least two samples.
<h4>
<pre>
<a name="quantifier"></a>
</pre>
</h4>
<h4>name:</h4>
<a href="#top">quantifier.pl</a>
<h4>description:</h4>
The module maps the deep sequencing reads to predefined miRNA precursors and determines by that the expression of the corresponding miRNAs. First, the predefined mature miRNA sequences are mapped to the predefined precursors. Optionally, predefined star sequences
can be mapped to the precursors too. By that the mature and star sequence in the precursors are determined. Second, the deep sequencing reads are mapped to the precursors. The number of reads falling into an interval 2nt upstream and 5nt downstream of the
mature/star sequence is determined.
<h4>input:</h4>
A fasta file with precursor sequences, a fasta file with mature miRNA sequences, a fasta file with deep sequencing reads and optionally a fasta file with star sequences and the 3 letter code of the species of interest
<h4>output:</h4><pre>
A tab separated file called miRNAs_expressed_all_samples.csv with miRNA identifiers and its read count, a signature file called miRBase.mrd, a file called expression.html that gives an overview of all miRNAs the input data and a directory called pdfs that contains
for each miRNA a pdf file showing its signature and structure.</pre>
<h4>options:</h4><pre>
[mandatory parameters]
-u list all values allowed for the species parameter that have an entry at UCSC
-p precursor.fa miRNA precursor sequences from miRBase
-m mature.fa miRNA sequences from miRBase
-r reads.fa your read sequences
[optional parameters]
-c [file] config.txt file with different sample ids... or just the one sample id
-s [star.fa] optional star sequences from miRBase
-t [species] e.g. Mouse or mmu
if not searching in a specific species all species in your files will be analyzed
else only the species in your dataset is considered
-y [time] optional otherwise its generating a new one
-d if parameter given pdfs will not be generated, otherwise pdfs will be generated
-o if parameter is given reads were not sorted by sample in pdf file, default is sorting
-k also considers precursor-mature mappings that have different ids, eg let7c
would be allowed to map to pre-let7a
-n do not do file conversion again
-x do not do mapping against precursor again
-g [int] number of allowed mismatches when mapping reads to precursors, default 1
-e [int] number of nucleotides upstream of the mature sequence to consider, default 2
-f [int] number of nucleotides downstream of the mature sequence to consider, default 5
-j do not create an output.mrd file and pdfs if specified
-w considers the whole precursor as the 'mature sequence'
-U discard all read multimapper
</pre>
<h4>example usage:</h4>
Assume we want to quantify C.elegans miRNAs then we would run the command
<pre>quantifier.pl -p precursors.fa -m mature.fa -r reads.fa -s star.fa -y now -t cel
</pre>
<h4></h4>
<h4>
<pre>
</pre>
</h4>
<h4>name:</h4>
make_html.pl
<h4>description:</h4>
It creates a file called result.html that gives an overview of miRDeep2 detected miRNAs (known and novel ones). The html file lists up each detected miRNA and provides among others information on its miRDeep2 score, reads mapped to its mature, loop and star
sequence, the mature, star and consensus precursor sequences themselfes and provides links to BLAST, BLAT, mirbase for miRBase miRNAs and to a pdf file that shows the signature and structure.
<h4>input:</h4>
A miRDeep2 output.mrd file, a miRDeep2 survey.csv file
<h4>output:</h4>
A result.html file with an entry for each provisional miRNA that contains information about its assigned Id, miRDeep2 score, estimated probability that the miRNA candidate is a true positive, rfam alert, total read count, mature read count, loop read count,
star read count, significant randfold p-value, miRBase miRNA, example miRBase miRNA with the same seed, BLAT, BLAST, consensus mature sequence, consensus star sequence and consensus precursor sequence. Furthermore, the miRBase miRNAs existent in the input
data but not scored by miRDeep2 are listed. A directory called pdfs that contains for each provisional miRNA ID a pdf with its signature and structure. A file called result.csv (when option -c is used) that contains the same entrys as the html file.
<h4>options:</h4><pre>
-v int only output hairpins with score above int
-c also create overview in excel format.
-k file supply file with known miRNAs
-s file supply survey file if score cutoff is used to get information about how big is the confidence of resulting reads
-f file miRDeep2 output mrd file
-e report complete survey file
-g report survey for current score cutoff
-r file Rfam file to check for already reported small RNA sequences
-q file miRBase.mrd file produced by quantifier module
-x file signature.arf file with mapped reads to precursors
-t org specify the organism from which your sequencing data was obtained
-u print all available UCSC input organisms
-d do not generate pdfs
-y timestamp
-z switch is automatically used when script is called by quantifier.pl
-o print reads in pdf signature sorted by their 3 letter code in front of their identifier
-a print genomic coordinates of mature sequence
-b supply confidence file
</pre>
example usage:
<pre>perl make_html.pl -f miRDeep_outfile -s survey.csv -c -e -y 123456789</pre>
<h4>
<pre>
</pre>
</h4>
<h4>name:</h4>
clip_adapters.pl
<h4>description:</h4>
Removes 3' end adaptors from deep sequenced small RNAs. The script searches for occurrences of the six first nucleotides of the adapter in the read sequence, starting after position 18 in the read sequence (so the shortest clipped read will be 18 nts). If no
matches to the first six nts of the adapter are identified in a read, the 3' end of the read is searched for shorter matches to the 5 to 1 first nts of the adapter.
<h4>input:</h4>
A fasta file with the deep sequencing reads and the adapter sequence (both in RNA or DNA alphabet).
<h4>output:</h4>
A fasta file with the clipped reads. Fasta ids are retained. If no matches to the adapter prefixes are identified in a given read, the unclipped read is output.
<h4>options:</h4>
-
<h4>example usage:</h4>
<pre>clip_adapters.pl reads.fa TCGTATGCCGTCTTCTGCTTGT > reads_clipped.fa</pre>
<h4>notes:</h4>
It is possible to clip adapters using more sophisticated methods. Users are encouraged to test other methods with the miRDeep2 modules.
<h4>
<pre>
</pre>
</h4>
<h4>name:</h4>
collapse_reads.pl
<h4>description:</h4>
Collapses reads in the fasta file to ensure that each sequence only occurs once. To indicate how many times reads the sequence represents, a suffix is added to each fasta identifier. E.g. a sequence that represents ten reads in the data will have the '_x10'
suffix added to the identifier.
<h4>input:</h4>
A fasta file, either in standard format or in the collapsed suffix format.
<h4>output:</h4>
A fasta file in the collapsed suffix format.
<h4>options:</h4><pre>
-a outputs progress
</pre>
<h4>example usage:</h4>
<pre>collapse_reads.pl reads.fa > reads_collapsed</pre>
<h4>notes:</h4>
Since the script reads all fasta entries into a hash using the sequence as key, it can potentially use more than 3 GB memory when collapsing very big datasets, >50 million reads. In this case, the user can partition the reads (for instance based on the 5' nucleotide),
collapse separately and concatenate.
<h4>
<pre>
</pre>
</h4>
<h4>name:</h4>
excise_precursors_iterative.pl
<h4>description:</h4>
This script is a wrapper for excise_precursors.pl, which it calls one or more times, incrementing the height of the read stack required for initiating excision until the number of excised precursors falls below a given threshold.
<h4>input:</h4>
The reference genome in fasta format, the mapped reads in .arf format, a file to which the excised precursors will be written and the maximal number of precursors that should be reported.
<h4>output:</h4>
The excised precursors in fasta format.
<h4>options:</h4><pre>
-a Output progress to screen
</pre>
<h4>example usage:</h4>
<pre>excise_precursors_iterative.pl genome.fa reads_vs_genome.arf potential_precursors.fa 50000 -a</pre>
<h4>
<pre>
</pre>
</h4>
<h4>name:</h4>
excise_precursors.pl
<h4>description:</h4>
Excises precursors from the genome using the mapped reads as guidelines.
<h4>input:</h4>
The reference genome in fasta format and the mapped reads in .arf format.
<h4>output:</h4>
The excised precursors in fasta format.
<h4>options:</h4><pre>
-a integer Only excise if the highest local read stack is 'integer' reads high (default 2).
-b Output progress to screen
</pre>
<h4>example usage:</h4>
<pre>excise_precursors.pl genome.arf reads_vs_genome.arf -b
</pre>
<pre>
</pre>
<h4>name:</h4>
fastaparse.pl
<h4>description:</h4>
Performs simple filtering of entries in a fasta file.
<h4>input:</h4>
A fasta file
<h4>output:</h4>
A filtered fasta file
<h4>options:</h4><pre>
-a int only output entries where the sequence is minimum int nts long
-b remove all entries that have a sequence that contains letters other than
a,c,g,t,u,n,A,C,G,T,U,N.
-s output progress
</pre>
<h4>example usage:</h4>
<pre>fastaparse.pl reads.fa -a 18 -s > reads_no_short.fa</pre>
<h4>
<pre>
</pre>
</h4>
<h4>name:</h4>
fastaselect.pl
<h4>description:</h4>
This script only prints out the fasta entries that match an id in the id file.
<h4>input:</h4>
A fastafile and a file with ids, one id per line.
<h4>output:</h4>
A fastafile containing the fasta entries that match an id.
<h4>options:</h4><pre>
-a only prints out entries that has an id that is not present in the id file.</pre>
<h4>example usage:</h4>
<pre>fastaselect.pl reads.fa reads_select.ids > reads_select.fa</pre>
<h4>
<pre>
</pre>
</h4>
<h4>name:</h4>
find_read_count.pl
<h4>description:</h4>
Scans a file searching for the suffixes that are generated by collapse_reads.pl (e.g. '_x10'). It sums up the integer values in the suffixes and outputs the sum. If a given id occurs multiple times in the file, it will multi-count the integer value of the id.
It will also only count the first integer occurrence in a given line.
<h4>input:</h4>
Any file containing the suffixes that are generated by collapse_reads.pl. This will typically be a fasta file or a list of ids.
<h4>output:</h4>
The sum of integer values (the total read count).
<h4>options:</h4><pre>
-</pre>
<h4>example usage:</h4>
<pre>find_read_count.pl reads_collapsed.fa</pre>
<h4>
<pre>
</pre>
</h4>
<h4>name:</h4>
geo2fasta.pl
<h4>description:</h4>
Parses GSM format files into fasta format.
<h4>input:</h4>
GSM files in tabular format. The first column should be sequences and the second column the number of times the sequence occurs in the data.
<h4>output:</h4>
A fasta file, one sequence per line (the sequences are expanded). options: - example usage: geo2fasta.pl GSM.txt > reads.fa
<h4>
<pre>
</pre>
</h4>
<h4>name:</h4>
illumina_to_fasta.pl
<h4>description:</h4>
Parses seq.txt or qseq.txt output from the Solexa/Illumina platform to fasta format.
<h4>input:</h4>
A seq.txt or qseq.txt file. By default seq.txt.
<h4>output:</h4>
A fasta file, one entry for each line of seq.txt. The entries are named 'seq' plus a running number that is incremented by one for each entry. Any '.'characters in the seq.txt file is substituted with a 'N'.
<h4>options:</h4><pre>
-a format is qseq.txt</pre>
<h4>example usage:</h4>
<pre>illumina_to_fasta.pl s_1.qseq.txt -a > reads.fa</pre>
<h4>
<pre>
</pre>
</h4>
<h4>name:</h4>
miRDeep2_core_algorithm.pl
<h4>description:</h4>
For each potential miRNA precursor input, the miRDeep2 core algorithm either discards it or assigns it a log-odds score that reflects the probability that the precursor is a genuine miRNA.
<h4>input:</h4>
Default input is an arf file with the read signatures and an RNAfold output file with the structures of the potential miRNA precursors.
<h4>output:</h4>
A .mrd file with all potential miRNA precursors that are scored.
<h4>options:</h4><pre>
-h print this usage
-s fasta file with reference mature miRNAs from one or more related species
-t print filtered
-u limited output (only ids)
-v cut-off (default 1)
-x sensitive option for Sanger sequences
-y file with randfold p-values
-z consider Drosha processing
</pre>
<h4>example usage:</h4>
<pre>miRDeep2_core_algorithm.pl signature.arf potential_precursors.str -s miRBase_related_species.fa -y potential_precursors.rand > output.mrd notes: The -z option has not been thoroughly tested.</pre>
<h4>
<pre>
</pre>
</h4>
<h4>name:</h4>
parse_mappings.pl
<h4>description:</h4>
Performs simple filtering of entries in an .arf file.
<h4>input:</h4>
Default input is an .arf file.
<h4>output:</h4>
A filtered .arf file.
<h4>options:</h4><pre>
-a int Discard mappings of edit distance higher than this
-b int Discard mappings of read queries shorter than this
-c int Discard mappings of read queries longer than this
-d file Discard read queries not in this file
-e file Discard read queries in this file
-f file Discard reference dbs not in this file
-g file Discard reference dbs in this file
-h Discard remaining suboptimal mappings
-i int Discard remaining suboptimal mappings and discard any reads that have more remaining mappings
than this
-j Remove any unmatched nts in the very 3' end
-k Output progress to standard output
</pre>
<h4>example usage:</h4>
<pre>parse_mappings.pl reads_vs_genome.arf -a 0 -b 18 -c 25 -i 5 > reads_vs_genome_parsed.arf</pre>
<pre>
</pre>
<h4>name:</h4>
perform_controls.pl
<h4>description:</h4>
Performs a designated number of rounds of permuted controls (for details, see Friedlaender et al., Nature Biotechnology, 2008).
<h4>input:</h4>
The permutation controls estimate the number of false positives produced by a miRDeep2_core_algorithm.pl run. The input to perform_controls.pl should be a file containing the exact command line used to initiate the miRDeep2_core_algorithm.pl run, the structure
file input to miRDeep2_core_algorithm.pl and the desired rounds of controls.
<h4>output:</h4>
A file in .mrd format. The output of each control run is separated by a line 'permutation integer'. The mean number of entries output by the control runs gives an estimate of the false positives produced. The further contents (besides the number of entries)
of the .mrd output by perform_controls.pl is not biologically meaningful.
<h4>options:</h4><pre>
-a Output progress to screen
</pre>
<h4>example usage:</h4>
<pre>perform_controls.pl line potential_precursors.str 100 > output_controls.mrd</pre>
<pre>
</pre>
<h4>name:</h4>
permute_structure.pl
<h4>description:</h4>
In a file output by RNAfold, each entry can be partitioned into an 'id' part and an 'other' part, consisting of the dot-bracket structure, sequence, mfe etc. This scripts reads all 'id' parts into a hash and pairs them with 'other' parts from random entries.
This is used by the perform_controls.pl script.
<h4>input:</h4>
An RNAfold output file.
<h4>output:</h4>
An RNAfold output file with ids moved to random entries.
<h4>options:</h4><pre>
-
</pre>
<h4>example usage:</h4>
<pre>permute_structure.pl potential_precursors.str > potential_precursors_permuted.str</pre>
<pre>
</pre>
<h4>name:</h4>
prepare_signature.pl
<h4>description:</h4>
Prepares the signature file to be input to the miRDeep2_core_algorithm.pl script.
<h4>input:</h4>
A fasta file with deep sequencing reads and a fasta file with precursors.
<h4>output:</h4>
A signature file in .arf format.
<h4>options:</h4><pre>
-a file Fasta file with the sequences of known mature miRNAs for the species. These sequences will not
influence the miRDeep scoring, but will subsequently make it easy to estimate sensitivity of the run.
-b Output progress to screen
</pre>
<h4>example usage:</h4>
<pre>prepare_signature.pl reads_collapsed.fa potential_precursors.fa -a miRBase_this_species.fa > signature.arf</pre>
<pre>
</pre>
<h4>name:</h4>
rna2dna.pl
<h4>description:</h4>
Substitutes 'u's and 'U's to 'T's. This is useful since bowtie does not match 'U's to 'T's.
<h4>input:</h4>
A fasta file.
<h4>output:</h4>
A substituted fasta file.
<h4>options:</h4><pre>
-</pre>
<h4>example usage:</h4>
<pre>rna2dna.pl reads_RNA_alphabet.fa > reads_DNA_alphabet.fa</pre>
<pre>
</pre>
<h4>name:</h4>
select_for_randfold.pl
<h4>description:</h4>
This script identifies potential precursors whose structure is basically consistent with Dicer recognition. Since running randfold is time-consuming, it is practical only to estimate p-values for those potential precursors that actually fold into hairpin structures.
<h4>input:</h4>
An arf file with the read signatures and an RNAfold output file with the structures of the potential miRNA precursors.
<h4>output:</h4>
A list of ids, separated by newlines.
<h4>options:</h4><pre>
-</pre>
<h4>example usage:</h4>
select_for_randfold.pl signature.arf potential_precursors.str > potential_precursors_for_randfold.ids
<pre>
</pre>
<h4>name:</h4>
survey.pl
<h4>description:</h4>
Surveys miRDeep2 performance at score cut-offs from -10 to 10.
<h4>input:</h4>
Default input is a .mrd file output by the miRDeep2_core_algorithm.pl script.
<h4>output:</h4>
A .csv file with performace statistics.
<h4>options:</h4><pre>
-a file file outputted by controls
-b file mature miRNA fasta reference file for the species
-c file signature file
-d int read stack height necessary for triggering excision
</pre>
<h4>example usage:</h4>
<pre>survey.pl output.mrd -a output_controls.mrd -b miRBase_this_species.fa -c signature.arf -d 2 > survey.csv</pre>
<pre>
</pre>
<h4>name:</h4>
convert_bowtie_output.pl
<h4>description:</h4>
It converts a bowtie 'bwt' mapping file to a mirdeep 'arf' file.
<h4>input:</h4>
A file in 'bwt' format.
<h4>output:</h4>
A file in mirdeep 'arf' format.
<h4>options:</h4><pre>
-
notes:
-
</pre>
<pre>
</pre>
<h4>name:</h4>
bwa_sam_converter.pl
<h4>description:</h4>
It converts a bwa 'sam' mapping file to a mirdeep 'arf' file.
<h4>input:</h4>
A bwa created file in 'sam' format.
<h4>output:</h4>
A file in mirdeep 'arf' format.
<h4>options:</h4><pre>
-
notes:
-
</pre>
<pre>
</pre>
<h4>name:</h4>
samFLAGinfo.pl
<h4>description:</h4>
It gives information about the bwa FLAG in a bwa created mapping file in 'sam' format.
<h4>input:</h4>
A FLAG number created by bwa.
<h4>output:</h4>
Information about the alignment created by bwa.
<h4>options:</h4><pre>
-
notes:
-
</pre>
<pre>
</pre>
<h4>name:</h4>
clip_adapters.pl
<h4>description:</h4>
Removes 3' end adaptors from deep sequenced small RNAs. The script searches for occurrences of the six first nucleotides of the adapter in the read sequence, starting after position 18 in the read sequence (so the shortest clipped read will be 18 nts). If no
matches to the first six nts of the adapter are identified in a read, the 3' end of the read is searched for shorter matches to the 5 to 1 first nts of the adapter.
<h4>input:</h4>
A fasta file with the deep sequencing reads and the adapter sequence (both in RNA or DNA alphabet).
<h4>output:</h4>
A fasta file with the clipped reads. Fasta ids are retained. If no matches to the adapter prefixes are identified in a given read, the unclipped read is output.
<h4>options:</h4><pre>
-
</pre>
<h4>example usage:</h4>
<pre>clip_adapters.pl reads.fa TCGTATGCCGTCTTCTGCTTGT > reads_clipped.fa</pre>
<h4>notes:</h4>
It is possible to clip adapters using more sophisticated methods. Users are encouraged to test other methods with the miRDeep2 modules.
<pre>
</pre>
<h4>name:</h4>
sanity_check_genome.pl
<h4>description:</h4>
It checks the supplied genome fasta file for its correctness. Identifier lines are not allowed to contain whitespaces and must be unique. Sequence lines are not allowed to contain characters others than [A|C|G|T|N|a|c|g|t|n].
<h4>input:</h4>
A genome file in fasta format
<h4>output:</h4>
-
<h4>options:</h4><pre>
-
notes:
-
</pre>
<pre>
</pre>
<h4>name:</h4>
sanity_check_mapping_file.pl
<h4>description:</h4>
It checks the supplied mapping file for its correctness. Each line in the file must be in the arf format.
<h4>input:</h4>
A mapping file in arf format
<h4>output:</h4>
-
<h4>options:</h4><pre>
-
notes:
-
</pre>
<pre>
</pre>
<h4>name:</h4>
sanity_check_mature_ref.pl
<h4>description:</h4>
It checks the supplied mature_miRNA fasta file for its correctness. Identifier lines are not allowed to contain whitespaces and must be unique. Sequence lines are not allowed to contain characters others than [A|C|G|T|N|a|c|g|t|n].
<h4>input:</h4>
A mature miRNA file in fasta format
<h4>output:</h4>
-
<h4>options:</h4><pre>
-
notes:
-
</pre>
<pre>
</pre>
<h4>name:</h4>
sanity_check_reads_ready.pl
<h4>description:</h4>
It checks the supplied reads file for its correctness. Each identifier line must have the format of name_uniqueNumber_xNumber' e.g. >xyz_1_x20. See also file format_descriptions.txt for more detailed informations.
<h4>input:</h4>
A mapping file in arf format
<h4>output:</h4>
-
<h4>options:</h4><pre>
-
notes:
-</pre>
<pre>
</pre>
<a name="File-formats"></a>
<h3><a href="#top">File Formats</a></h3>
<h4>.fa </h4>
The fasta files that contain sequencing reads used by miRDeep2 are ordinary fasta files with a predefined identifier format. It comprises three values separated by underscore. The first value is a three letter code which is intended to be a tag for the sample
a read is coming from. The second value is a running number that is used to make sure that identifiers are uniquely assigned to sequences from the same sample. The third value starts with and 'x' followed by an integer number that indicates the occurrence
of a read sequence in a sample. The sequence in a fasta file that is supplied to miRDeep2 is not allowed to contain characters others than A, C, G, T and N. If the id line or the sequence line do not follow these conventions miRDeep2 will abort with a warning
message. Example entry from a fasta file that can be supplied to miRDeep2
<pre>>PAN_123456_x969696
ATACAATCTACTGTCTTTCCT
</pre>
<h4>.arf</h4>
The arf format is a proprietary file format generated and processed by miRDeep2. It contains information of reads mapped to a reference genome. Each line in such a file contains 13 columns. Example line:
<pre>#1 2 3 4 5 6 7 8 9 10 11 12 13
PAN_123456_x969696 21 1 21 ATACAATCTACTGTCTTTCCT chr22 21 46508682 46508702 ATACAATCTACTGTCTTTCCT + 1 mmmmmmmmmmmmmmmmmmmmm
</pre>
<ol>
<li>read identifier </li><li>length of read sequence </li><li>start position in read sequence that is mapped </li><li>end position in read sequence that is mapped </li><li>read sequence </li><li>identifier of the genome-part to which a read is mapped to. This is either a scaffold id or a chromosome name
</li><li>length of the genome sequence a read is mapped to </li><li>start position in the genome where a read is mapped to </li><li>end position in the genome where a read is mapped to </li><li>genome sequence to which a read is mapped </li><li>genome strand information. Plus means the read is aligned to the sense-strand of the genome. Minus means it is aligned to the antisense-strand of the genome.
</li><li>Number of mismatches in the read mapping </li><li>Edit string that indicates matches by lowercase 'm' and mismatches by uppercase 'M'
</li></ol>
</body>
</html>