Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Amino Acid sequence to letter function #93

Closed
LiNk-NY opened this issue Feb 20, 2023 · 10 comments
Closed

Amino Acid sequence to letter function #93

LiNk-NY opened this issue Feb 20, 2023 · 10 comments
Assignees

Comments

@LiNk-NY
Copy link
Contributor

LiNk-NY commented Feb 20, 2023

Hi Hervé, @hpages
Is there a function that takes a string, e.g., MetThrGly and converts to "MTG"?
If not and within scope, I can work on implementing one using AMINO_ACID_CODE.
Best,.
Marcel

@LiNk-NY LiNk-NY self-assigned this Feb 20, 2023
@hpages
Copy link
Contributor

hpages commented Feb 24, 2023

Hmm.. interesting! I've never seen amino acid sequences in that format. Out of curiosity, may I ask how/where people retrieve amino acid sequences that are in the "MetThrGly" format?

@LiNk-NY
Copy link
Contributor Author

LiNk-NY commented Feb 27, 2023

I don't have a good answer.

Perhaps Laurent @lgatto or Johannes @jorainer can provide some insight?

FWIW, this type of functionality is available on webpages and even in matlab:
https://www.mathworks.com/help/bioinfo/ref/aminolookup.html

I don't think it would hurt to include it out of convenience given that AMINO_ACID_CODE is in the package.

@lgatto
Copy link

lgatto commented Feb 27, 2023

I don't know any such functionality and I have never had a need for it. AA codes and other info is available from PSMatch::getAminoAcids(). PSMatch is a package that deals with peptide spectrum matches, i.e. peptides/protein identification from mass spectrometry experiments.

@hpages
Copy link
Contributor

hpages commented Feb 27, 2023

Good to know about PSMatch.

I'd rather have some good use case before adding something like this to Biostrings, or at least have some user requests it. I agree that in theory it doesn't hurt to have it, but still, I'm not a big fan of adding functionalities that nobody is going to use.

Anyways, since I just spent some time playing with this a little, I'll put what I came up with here, for the record:

.prepare_invalid_abbrev3_fancy_msg <- function(x30, bad_idx, n=5L)
{   
    nbad <- length(bad_idx)
    idx <- head(bad_idx, n=n)
    bad_abbrev3 <- x30[idx]
    details <- paste0("\"", bad_abbrev3, "\" at position ", idx, collapse=", ")
    if (nbad > n)            
        details <- paste0(details, " etc... (", nbad - n, " more)")
    paste0("input contains invalid three-letter abbreviation(s): ", details)
}

makeAAStringFromAbbrev3Seq <- function(x, ignore.case=FALSE)
{
    if (!isSingleString(x))
        stop(wmsg("'x' must be a single string"))
    if (nchar(x) %% 3L != 0L)
        stop(wmsg("number of characters in input must be a multiple of 3"))
    if (!isTRUEorFALSE(ignore.case))
        stop(wmsg("'ignore.case' must be TRUE or FALSE"))
    x <- BString(x)  
    x3 <- x30 <- as.character(successiveViews(x, rep.int(3L, length(x) %/% 3L)))
    ALL_ABBREV3 <- c(AMINO_ACID_CODE, `*`="END", `-`="GAP")
    if (ignore.case) {
        x3 <- tolower(x3)
        ALL_ABBREV3 <- tolower(ALL_ABBREV3)
    }
    m <- match(x3, ALL_ABBREV3)
    bad_idx <- which(is.na(m))
    if (length(bad_idx) != 0L) 
        stop(wmsg(.prepare_invalid_abbrev3_fancy_msg(x30, bad_idx)))
    codes <- names(ALL_ABBREV3)[m]
    AAString(paste(codes, collapse=""))
}

makeAAStringFromAbbrev3Seq("MetTrpLysGlnAlaGluAspIleArgAspIleTyrAspPhe")
# 14-letter AAString object
# seq: MWKQAEDIRDIYDF

Thanks guys.

@LiNk-NY
Copy link
Contributor Author

LiNk-NY commented Apr 25, 2023

Thanks for your work on this. Any updates for this issue based on #97? From what I read, it should be easier to implement with the encoding framework.

@ahl27
Copy link
Collaborator

ahl27 commented Apr 28, 2023

I'm not sure, it's quite a bit more work than I had initially expected--XStringSets assume single byte character input, so we'd have to rewrite quite a bit of stuff to get it to support a multi-character input value. I'm not sure if I'll be able to get to this in the near future, there are other Biostrings issues that are higher priority at the moment on top of my research.

I'd echo Hervé's point that the functionality doesn't seem to be requested by users aside from just having it to have it. If you have a use case that it would be relevant for please let me know, or if you have an implementation feel free to open a PR.

End-users can already get this functionality with something simple like:

# Assume that CONVERSION_STRING is a named character string 
# like c("M","T","G",...) with names c("met","thr","gly",...)

convertAA <- function(aastr){
    converted <- CONVERSION_STRING[strsplit(gsub('([a-z]{3})', '\\1 ', tolower(aastr), ' ')[[1]]]
    AAString(paste(converted, collapse=''))
}

Hervé's function is definitely a lot safer with regard to error checking.

Implementing it in a robust and clean way within Biostrings would be a lot harder; it would likely require a custom method since these characters will all map to amino acids (ex. AAString("MetThrGly") == AAString("METTHRGLY") == AAString("metthrgly").

@ahl27
Copy link
Collaborator

ahl27 commented Apr 28, 2023

On second thought, it could be pretty simple to just add an optional argument like useThreeLetterCodes=FALSE to the AAString method, and then if true to call a preprocessing function like above (or Hervé's better implementation) to reformat the string from three letter codes to single letter codes.

At that point though, I guess the question is if people are actually doing that, and if so, if that functionality is needed in Biostrings or if end-users can just preprocess it themselves.

@ahl27
Copy link
Collaborator

ahl27 commented Jun 6, 2024

Sorry for the slow follow up--I think for now I'm going to leave this as unimplemented. I'm not sure it makes sense to change the constructor AAString method to have an additional argument for this case. If people are interested in this functionality I can add it to my backlog to address, but for now it's unplanned. I'll keep the issue open in case other people have further thoughts.

@LiNk-NY
Copy link
Contributor Author

LiNk-NY commented Jun 6, 2024

Thanks for following up. I'm okay with leaving it unimplemented since there are no follow ups from the community.

@ahl27
Copy link
Collaborator

ahl27 commented Jun 7, 2024

I added this to the TODO file so I don't forget about it in the future--I'll look into revisiting this when I have more bandwidth and the higher priority tasks are cleared up.

@ahl27 ahl27 closed this as not planned Won't fix, can't repro, duplicate, stale Jun 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants