-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Enforcing character set for AAString
objects
#97
Conversation
Also has some other fixes for various minor things:
|
I'm going to modify the codec values so allow for bit-wise comparisons so we can also support fixed pattern matching as requested in #34. Only 4 ambiguity codes exist in the set, so it should be possible to do in less than 16 bits. edit: this doesn't seem to be the case. Trying to think of a solution that would support bitwise comparisons as in Doing this in fewer bits is possible, but tricky. A bytewise match scheme means that we would need values such that any non-ambiguous code has an overlap of zero. Maybe there's a trick that could be done to transform lower dimensional space into higher prior to comparison in This runs into an additional problem, that the lookup table is only 256x256 bits. This means we can't use larger than 8 bit numbers without completely overhauling the lookup tables (which I'm very hesitant to do). The simplest solution may be to just create a fifth lookup table in |
Update on the above--this doesn't seem to be as trivial as I had initially expected; I'll work on it for a separate PR. Including it with this would be too much for a single submission. The current codec values are fine, if they need to be changed it can be done in the future. |
Thanks @ahl27 The question is: do we really want to encode the letters of an AAString object at construction time? What's the benefit? It seems to me that we could simply use the codec facilities to check that the input letters belong to Also I suspect that there are many places in Biostrings code (and probably in packages that depend on Biostrings) that just assume no internal ecoding of the AA letters, so changing this would probably mean breaking/fixing a lot of code. For example:
Unfortunately, a major problem for embarking into serious refactoring of the package is that it totally lacks good unit tests. I'm not proud of that 😞 That's actually something that should be put near the top of the list of things to improve. Other than that, please add Thanks again! |
That's a fair point, I hadn't thought about that. Would it be sufficient to just "encode" the values as their ASCII equivalents? ex. encode I suppose the benefit of encode/decode would be consistency with DNA/RNA, but I'm realizing that the main advantage of the RNA/DNA schema is that you get the benefit of bitwise-comparisons for ambiguity codes. Re: C code, I'll make those adjustments tomorrow! Thanks for the explanation, that definitely makes sense. I also have a fix for #34 that I can PR after this feature is completed. |
One consideration is using encode/decode may make it easier to resolve #93 by just using an alternate encode function to map the input into the standard AAString format, although there’d be a multicharacter issue so I’m not sure. |
Yes, that's what I meant by not encoding the incoming letters. The codec would be something like this:
That's for sure one of them.
Another reason for encoding RNA/DNA sequences was to be able to switch back and forth between RNA and DNA without having to re-encode anything. Combined with the fact that Biostrings objects use pointers to share their sequence data and you get a coercion between RNA and DNA that doesn't need to copy the sequence data at all. This means that it's virtually costless whatever the size of the object. Merely for the fun of it though, as I don't know of any application that takes benefit of that! Finally, I vaguely remember that maybe this encoding scheme was also somewhat driven by the requirements of the shift-or algorithm (one of the algos supported by
Excellent, thanks again! BTW I'll be offline starting tomorrow (Wed) until the end of the week, chaperoning for a 3-day field trip to a remote area with no internet and no cell reception. So I won't be able to look again at this until next week. |
Ah okay, all that makes sense! This is teaching me a lot haha, I appreciate the detailed explanation. I’ll make that change and then keep working on some of the other open issues; no worries on responding, enjoy the field trip! |
Fixes are pushed for these!
I've left in the decode logic, I felt like it made more sense to have both for consistency in the code and I was having trouble figuring out how to selectively disable This is not completely backwards compatible, the following breaks:
Obviously users shouldn't be manually changing the classes, but this would be an issue if someone had previously saved an Older saved objects can always be forward converted with a small helper function like:
I'm not sure, let me know what you'd like to see for the package. |
Thanks @ahl27
Sounds good.
We're going to take that risk, hoping that not too many saved AAString/AAStringSet objects contain letters not in BTW Bioconductor defines the We should probably have
BTW I'm hesitant to merge this PR for the 3.17 release, timing is not ideal. Hope you don't mind if we aim for the beginning of the BioC 3.18 devel cycle for the merge. Thanks again. |
Thanks for the feedback! That all makes sense. I'll add in updated methods for
Absolutely, sounds great! |
This has been updated with the requested changes, as well as a small adjustment to the corresponding alphabetFrequency
hasOnlyBaseLetters
updateObjectValid input:
Invalid input:
I'm having issues with displaying the correct characters--I'm guessing this has something to do with how characters are stored internally. However, the error message is still a lot more informative than previously. |
Thanks. I don't think |
Oops, thanks for the catch. I'm actually not sure how that ended up in the output--the current version is working as you mentioned:
I'm not sure how I managed to get the |
Fixed! Thanks, Erik will be happy to hear that haha |
Back to this after a long pause. Sorry for the slow response. Here's some feedback about the 3 new
Thanks, |
Ah okay, thanks for all the feedback! Very helpful for getting familiar with the underlying Biostrings code; I'll make these changes first thing tomorrow. |
All good. Thanks! |
Adds a new codec for
AAString
objects to enforce the character set, resolving bugs #84 and #10Lowercase characters are correctly converted to uppercase, and characters not in
AA_ALPHABET
(including multibyte characters) throw errors.Let me know if the coding style looks okay, RStudio has been doing some funky stuff auto-reformatting all my tabs on any file I open (and ignoring my requests to use 4-length tabs!) but I think I've fixed it for future PRs.