-
Notifications
You must be signed in to change notification settings - Fork 16
/
TODO
274 lines (217 loc) · 12.6 KB
/
TODO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
TODO list
---------
GENERAL
- Get rid of all 'R CMD check' warnings.
- Re-enable the unit tests that were disabled and add tests for all the
functionalities that are still not covered. There is a lot of them!
Right now we use the RUnit framework for the unit tests but feel free
to switch to testthat or any other framework of your choice.
- An R problem that might be worth reporting: Loading a serialized
XStringViews object without prior loading the Biostrings package seems
to have problems: the object loads, but then calling show() on the object
dispatches on the show() method for List objects (defined in S4Vectors)
instead of the method for XStringViews objects.
BASIC CONTAINERS
- Make sure that validObject() on an XStringSet object is actually performing
an extensive validation of the object with no possibility for false
positives (i.e. objects that are actually invalid get caught). This might
involve looking at validity methods for XStringSet parent/ancestor
classes like XRawList and/or XVectorList (both defined in XVector).
- DNA/RNA/AAStringSet constructor functions:
o DNA/RNA/AAStringSet(x, start=c(1,5), end=c(3,7)) when x is a single
string (character or XString). The result should be a DNA/RNA/AAStringSet
object of the length of 'start' (or 'end').
o Low priority: Make sure that DNA/RNA/AAStringSet(x) takes the shortest
possible path to return 'x' when 'x' is already a DNA/RNA/AAStringSet
_instance_ (i.e. when class(x) == "DNAStringSet" or "RNAStringSet"
or "AAStringSet"). Right now it seems that it goes thru some unnecessary
operations. Does it slow it down in a significant way though? I don't
know and some benchmarking would be nice to have. It's just not a
satisfying situation from a theoretical point of view when a no-op
wastes CPU cycles in performing pointless calculations.
- Maybe define a DNAorRNA virtual class (or NucleotideString) that would be
the common parent of the DNAString and RNAString classes. Advantage of this
is that it allows some code simplification by defining stuff shared by
the DNAString and RNAString classes at the DNAorRNA level. For example,
the two following methods:
setMethod("alphabetFrequency", "DNAString", ...)
setMethod("alphabetFrequency", "RNAString", ...)
can be replaced with a single method:
setMethod("alphabetFrequency", "NucleotideString", ...)
Also this:
if (is(x@subject, "DNAString") || is(x@subject, "RNAString")) ...
can be replaced with:
if (is(x@subject, "NucleotideString")) ...
Etc...
- Improve the MIndex container. It needs to support storage of the nb of
mismatches for each hit. No need to store the locations of those mismatches
though: this would require a huge amount of memory and there are other ways
to retrieve these locations on user demand so maybe it's not worth it.
BASIC OPERATIONS
- Several problems with comparison operators (==, !=, <=, >=, < and >)
when at least one of the two operands is an XString derivative.
Issue https://github.com/Bioconductor/Biostrings/issues/51 is just the
tip of the iceberg :-(
For example:
> DNAString("A") != "A"
Error: C stack usage 7969796 is too close to the limit
> DNAString("A") <= "A"
Error: C stack usage 7969828 is too close to the limit
> DNAString("A") >= "A"
Error in order(...) : unimplemented type 'list' in 'orderVector1'
> DNAString("A") < "A"
Error in order(...) : unimplemented type 'list' in 'orderVector1'
> DNAString("A") > "A"
Error: C stack usage 7969796 is too close to the limit
> DNAString("AAA") == "AAA"
Error in FUN(X[[i]], ...) :
coercion of character object to DNAString didn't preserve its length
> DNAString("AAA") <= "AAA"
Error in FUN(X[[i]], ...) :
coercion of character object to DNAString didn't preserve its length
> BString("AAA") <= "AAA"
Error in FUN(X[[i]], ...) :
coercion of character object to BString didn't preserve its length
> DNAString("A") > DNAString("A")
Error: C stack usage 7969796 is too close to the limit
> DNAString("C") > DNA_ALPHABET
Error in .charToXString(seqtype, x, start, end, width) :
input must be a single non-NA string
etc...
All of these should work and behave the same way as when replacing the
XString derivative with a character vector of length 1.
There will be many 'x <op> y' cases to test:
o 'x' is a B/DNA/RNA/AAString and 'y' is a character vector of
arbitrary length;
o 'x' is a character vector of arbitrary length and 'y' is a
B/DNA/RNA/AAString;
o both 'x' and 'y' are XString derivatives of the same type or not;
o <op> is any of the 6 comparison operators.
So there's a little bit of a combinatorial explosion here but we should
be able to come up with a solid set of unit tests that will cover it all.
UTILITIES
- Move lcprefix()/lcsuffix() out of pmatchPattern.R to a file of their own.
(This stuff needs to belong to the UTILITIES component of the package, not
to the STRING ALIGNMENT component.)
- Add patternFrequency() methods for XStringSet and XStringViews objects.
- Maybe add a new generic that combines a subject and an IRanges (or
IRanges-like) object to return an XStringViews object. The IRanges-like
object could be MIndex object and the method for it would do as if it had
received unlist(MIndex). Currently my problem is that I can't come up with
a good name for such generic :-/ Maybe I could just use views(subject, x) for
this (dispatch would be on x, not subject). And the current views function
could be renamed (or maybe it's not needed at all, maybe a fancy
new("IRanges", ...) could replace it).
STRING MATCHING
- Revisit matchLRPatterns() semantic and improve its performance.
The current behaviour is to return *all* L/R match pairs that satisfy
the search criteria. This leads to poor performance, and, maybe more
importantly, it tends to return redundant match pairs (e.g. 2 match
pairs can overlap, or one can be within the limits of the other).
Maybe, by default, a better semantic would be one similar to what
matchProbePair() does, that is, only match pairs that don't contain
another match pair are returned (the assumption here is that those are
the most relevant match pairs). Also, should L/R match pairs where the
left and right matches overlap be accepted?
- When algo="auto", use "naive_inexact" instead of "shift-or" when max.mismatch
is high (e.g. >= 8, finding the exact cut-value requires some testing).
Correct Robert's RBioinf book reporting that the performance decrease
significantly when max.mismatch becomes to large. Should not be the case
anymore.
- Add some convenience function (e.g. a wrapper to .valid.algos()) to let the
curious user know which algos are available/used for a given search problem.
- PDict()/matchPDict()/countPDict()/whichPDict():
o PDict(), matchPDict(): Support fuzzy matching with variable width
dictionaries.
The trick used internally that consists in splitting the patterns into N+1
Trusted Bands (where N is max.mismatch) could be restricted to a "not
really trusted band" specified by the user (not necessarily a prefix), and
then brute force could be used on the head and tail of this "not really
trusted band". The user would also need a way to specify 2 max.mismatch
values: one for the "not really trusted band" and one for the head/tail.
From a user point of view, it could look something like this:
# Right now the following is not allowed (you cannnot specify both
# 'max.mismatch' and 'tb.end'). Also the names of the
# tb.start/tb.end/tb.width args would need to change because it's
# not about the Trusted Band anymore (strictly speaking):
pdict <- PDict(dict, max.mismatch=2, tb.end=min(width(dict)))
# Then to allow up to 2 mismatches on the "not really trusted band"
# and up to 1 mismatch on the tail ('pdict' has no head):
mi <- matchPDict(pdict, subject, max.mismatch=c(2, 1))
The notion of Trusted Band as it is defined right now would not need
to be exposed anymore and would become an entirely internal thing.
From a user point of view it would be replaced by this more general kind
of constant-width band where a small number of mismatches is allowed.
I need a better name than "not really trusted band" for it.
o Document the "allow mismacthes anywhere in the patterns" feature (activated
via the 'max.mismatch' argument of PDict()).
o Support IUPAC ambiguity letters in the DNAStringSet object passed to
PDict().
o Harris suggestion: treat a max.mismatch value that is strictly between 0
and 1 like an error rate (so that the actual max number of mismatches
adjust to the length of each pattern).
o Patrick's suggestion: give the user the option to make matchPDict() return
directly the coverage of the hits (apparently a common use case). That
would avoid the overhead of storing the hits in the (generally big) MIndex
object first.
Maybe put this in a separate function e.g. coveragePDict().
o _match_tbACtree2() doesn't need to walk until the end of the subject: it
could stop when the number of remaining chars to read is < to the
difference between the depth of the AC tree (i.e. the width of the
Trusted Band) and the current depth. This should speed up matchPDict()
(and family) substantially when the length of the subject is very small.
A typical use case where this could be of great benefit is when finding
the neighbors of a given pattern with e.g.
whichPDict(pdict, pdict[[99]], max.mismatch=2).
o Implement the skip.invalid.patterns arg in PDict() (so the user can build
a PDict object from Solexa data that would contain Ns if he wants, reads
with an N would just be skipped).
o Implement "duplicated" and "patternFrequency" methods for PDict objects
with a head or a tail. Add 'as.prob' arg (default FALSE) to
patternFrequency() like for alphabetFrequency().
o extractAllMatches() fails on a very big MIndex object (can't allocate
vector of size 5.2Gb).
o C code improvement: no need to use temporary storage for 'dups_buf' and
'match_count' in match_pdict.c, store directly in the returned
INTEGER vector.
o MIndex objects: at some point the user will want to be able to combine
"compatible" MIndex objects. 2 MIndex objects are "compatible" if they
are describing 2 set of matches coming from the same original dict and on
the same target (subject). In practice, it will be enough that they have
the same index i.e. they have the same pids() or, if the pids() is NULL,
they have the same length.
Then methods like "union", "rangesect", "setdiff", "setequal", etc...
could be defined. The set operation would be performed between the 2
subsets of matches of each input pattern. Of course, 2 matches are
considered equal if their start/end are the same.
o Make reverseComplement() work on a PDict object.
- Maybe add a Biostrings.Rnw vignette with a short overview of the string
matching/aligning capabilities and a "how to choose the right string
matching/aligning function" diagram.
LOW PRIORITY
------------
- Maybe add a no.match.length argument to gregexpr2() that is FALSE by
default so gregexpr(pattern, text, fixed=TRUE) is more interchangeable
with gregexpr2(pattern, text) and use no.match.length=TRUE in matchPattern's
internal code.
- Look at apse.c in R/src/main for Levenshtein. It's coming from
http://search.cpan.org/dist/String-Approx/
and is what Perl and Python are using.
It should stop and return an error code after some max.distance has been
reached. We definitely want to be able to do this if we're going to use it
on the millions of elements of the head and tail of a TB_PDict object.
- Maybe: add a specific containers for results returned by matchPattern
(and family). Would derive from the XStringViews class with at least one
additional slot, the @call slot (of type "language"), that would contain
the value of match.call(), so that one knows what parameters were supplied
for a given matching task.
- Fix pb with length(x) <- 2 screwing up x if it's an XString or XStringViews
object (prevent people from doing this by defining a length() setter that
retuns an error).
- Still have to think about it (Robert suggestion): make [[ work on "out
of limits" views with a warning (we keep issuing an error only when the
view is all blank).
- Point raised in https://github.com/Bioconductor/Biostrings/issues/93 about
having a sequence to letter conversion for AAStrings (e.g., MetThyGly -> MTG).
Seems to be relatively low interest, but maybe it could be implemeneted as
a helper function in the future.