-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AAString - Amino acid code enforced? #10
Comments
Hi Felix, Sorry for not responding earlier. You're right, the A simple way to enforce a given alphabet without using any of the infrastructure defined in Biostrings is by doing something like this:
However taking advantage of the Biostrings infrastructure will allow you to implement a much more efficient solution and to reduce memory footprint significantly. For example, the following function:
is much faster:
and produces an informative error message on the first invalid input letter:
After you've defined your own XString extension, you can use the above function to quickly construct instances of this new class and enforce the alphabet:
Note that the
To make it work, you need to make
Then:
Now
Hope this helps, |
Hi Herve, Thanks for the examples. This is in part what I was looking for. I looked at the funtions defined for the Xstring subclasses and I think I have all the things covered such as Also, I will have a lookup at the behavior of I don’t know if the single integer is a prerequisite for the C backend, but with this check for an atomic vector in This is actually what I am interested in, since the alphabets I want to use are a bit more complex. I also don't need to encode them via a lookup table necessarily, but the alphabet check would be nice. I will test again via your examples and report back. Felix |
Hi Herve, I am already stuck with the first example. I also have the greek letter "α" in my alphabet, which gets returned by safeExplode as "a".
This seems to me that I don't realize the full set of capabilities of R regarding string encoding and behavior. The Rmd HTML output also looks different to the console output (the first line it is shown as "α") If the letter alpha trips me up, than I don't think it is wise to ask for features/changes in Biostrings at this point in time. I think at this point in time I need to figure out the general R behavior before I can do anything useful with this. Sorry for taking your time regarding this case. However, I hope the alphabet checking in general might be implemented in the next version. Maybe I can use it than. Thanks for all the great features in this and all the other packages. Felix |
Hi Felix,
Biostrings only supports single-byte characters. When multi-byte characters are present in the input of the
H. |
@hpages , I am running R version 3.5.0 with Bioconductor version '3.7' I see above that this was slated to have been address in BioC 3.8 Did that actually happen? I see no mention of it in https://bioconductor.org/news/bioc_3_8_release/#news-from-new-and-existing-software-packages Assuming it did not, I think this issue should be re-opened until it is addressed/fixed Ultimately, at least the documentation should be fixed - It currently states unambiguously and incorrectly that "the AAString container can only store a string based on the Amino Acid alphabet" I believed the documentation and coded something as though it were TRUE. The fact that it was not true led me on a wild goose chase. Recommend saving this pain for "the next guy". In the mean time, looking for a best workaround, for finding the indices of AAStringSet members which violate AA_STANDARD, I'm doing this:
Am I missing an optimization? |
Sometimes with a hammer everything looks like a nail but
gets a LogicalList with the same geometry as |
@malcook Sorry this didn't happen in BioC 3.8. I guess other priorities went in the way. I agree this issue should stay opened in the meantime. FWIW an even more efficient solution is to use the
But yes, this needs to go to Biostrings. |
@mtmorgan - your hammer does not quite work in my hands because for starters the first assignment generates an error - you apparently cannot unlist an AAStringSet:
however, we can
at which point, we can proceed as you suggest, and, as well, most usefully to me:
|
@malcook Note that you can I realize now that the
This should be a lot more efficient than trying to validate the AAStringSet object a posteriori with |
Fixed in PR #97 |
Thanks @ahl27 for solving this. |
I worte on the BioC devel list a few days ago about a problem/behavior I have/I encounter with the AAString class. I haven't received a reply, so I am adding this as an issue, because in my opinion it is one. Thanks for having a look at it.
Below are the contents of the mail describing the issue:
I tried the following code, which should according to ?AAString not work, since ÜÖÄ are not part of any AA code.
I don’t have access right now to the devel version of Biostrings, bit I checked out the current Code in the github repo and its recent changes. I am pretty sure, that this behavior is also in the current devel branch. Can someone confirm this?
My current interest is in using the XString classes and methods for an additional biological string representation. The initial question was, how can I restrict this to a certain character set, if the characters are not saved byte encoded? The latter option is not available to me, since characters like ‚«‘ or ‚=‘ result in a two byte code using the charToRaw function. This trips up the build of the internal lookup table, which are passed down to the C backend.
Therefore I looked into, how this is done for an AAString differing from a BString. I discovered, that it currently doesn‘t. I also looked into the current 2.47.12 repo, which as far as I can tell does not use the AMINO_ACID_CODE constant in the creation of an AAString object.
So my questions are:
Thanks in advance for any help and suggestions.
Best regards,
Felix
PS: regarding the second question: One could change „as.integer(charToRaw(paste(letters, collapse="")))“ to „lapply(lapply(letters,charToRaw),as.integer)“ in .letterAsByteVal, but in any case it will not be atomic anymore, which I think is required to be excepted by the C backend. I didn’t test it.
The text was updated successfully, but these errors were encountered: