Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alias not mapped if gene is correct #14

Open
Aggarwal-Ayush opened this issue Jan 20, 2021 · 6 comments
Open

Alias not mapped if gene is correct #14

Aggarwal-Ayush opened this issue Jan 20, 2021 · 6 comments

Comments

@Aggarwal-Ayush
Copy link
Contributor

Example

> library(HGNChelper)

> hgnc.table[hgnc.table$Symbol == "AAVS1", ]
   Symbol Approved.Symbol chromosome
1:  AAVS1           AAVS1         19
2:  AAVS1        PPP1R12C         19

> checkGeneSymbols("AAVS1")
Maps last updated on: Mon Sep 28 18:31:21 2020
      x Approved Suggested.Symbol
1 AAVS1     TRUE            AAVS1

> mouse.table[mouse.table$Symbol %in% c("Cpamd8", "Mug2"), ]
      Symbol Approved.Symbol
6156    Mug2          Cpamd8
6157  Cpamd8            Mug2
78973   Mug2            Mug2

> checkGeneSymbols(c("Cpamd8", "Mug2"), species = "mouse")
Maps last updated on: Mon Sep 28 18:31:21 2020
       x Approved Suggested.Symbol
1 Cpamd8     TRUE           Cpamd8
2   Mug2     TRUE             Mug2

According to me the desired behaviour for a unit test should be this:

library(testthat)
library(HGNChelper)
res <- checkGeneSymbols("AAVS1")
expect_identical(res$Approved, TRUE)
expect_identical(res$Suggested.Symbol, "AAVS1 /// PPP1R12C")

res <- checkGeneSymbols(c("Cpamd8", "Mug2"), species = "mouse")
expect_identical(res$Approved, c(TRUE, TRUE))
expect_identical(res$Suggested.Symbol, c("Cpamd8 /// Mug2", "Mug2 /// Cpamd8"))

I can apply the fix for it if you think this issue needs fixing. We'll also need to change some unit test for this to apply.

@lwaldron
Copy link
Member

If I understand correctly, AAVS1 is a virus integration site within the PPP1R12C gene, and Cpamd8/Mug2 are both synonyms for a protein-coding gene. In the AAVS1/PPP1R12C case they are different things, and don't need to be "corrected" to show both. In the second case, my reading of https://www.alliancegenome.org/gene/MGI:99836 is that both are correct, so expansion would be informational but not necessary.

If I'm right about these cases, I would be inclined to keep the current default behavior. But you could add a flag like expand.ambiguous (default FALSE) that would do the expansion you're talking about when set to TRUE (using the word "ambiguous" rather than "synonym" or "equivalent" since we don't have the information in the maps to know why the mapping is present in both directions. Does this make sense @Aggarwal-Ayush @shbrief ?

@shbrief
Copy link
Contributor

shbrief commented Jan 20, 2021

@lwaldron Yes, it makes sense. I also prefer the current behavior - as far as the input is an approved symbol, providing additional aliases seems to be an unnecessary complication. It would be fine to add a expand.ambiguous flag with default FALSE.

@Aggarwal-Ayush
Copy link
Contributor Author

Yes, this makes sense for the cases that I have presented. I was just worried about other such cases if present and also as the second reviewer mentioned, it could be that an older dataset is referring to an alias of the gene which is now also an approved symbol. In any case, expand.ambigous should solve this issue.

@Rendan86
Copy link

Rendan86 commented Feb 5, 2021

Hi, I am quite the n00b around here. So first of all, please accept my apologies if I am writing in the wrong place.
I just found this package and I am so very happy, as I have been struggling with Symbol standardization.

I don´t think I should open a new issue here, as my question is related to this current one.
I have gone through all the documentation but I still find myself clueless...
For Symbols in "x" that HGNC does not have a current "Approved" symbol, and more than one are suggested (maybe previous or alias according to HGNC database), how would you suggest to proceed so I end with a 1:1 symbol output that I can use to normalize nomenclature through several datasets?

Thanks in advance!

EDIT: After working a little more with the package, I may have found a real issue. When a list of symbols contains 2 entries for "2-Mar", they can stand for MARC2 or MARCH2, which are currently identified with an approved symbol in HGNC as MTARC2 and MARCHF2, respectively. However, $findExcelGeneSymbols() maps both only as MTARC2

EDIT2: After further inspection, I found that actually the output gene list of findExcelGeneSymbols() only changes the first "2-Mar" to MTARC2, the other one remains "2-Mar"....

@lwaldron
Copy link
Member

lwaldron commented Feb 8, 2021 via email

@lwaldron
Copy link
Member

I see what you mean @Rendan86 :

> checkGeneSymbols("2-Mar")
Maps last updated on: Thu Mar 25 08:36:49 2021
      x Approved   Suggested.Symbol
1 2-Mar    FALSE MARCHF2 /// MTARC2
Warning messages:
1: In checkGeneSymbols("2-Mar") :
  Human gene symbols should be all upper-case except for the 'orf' in open reading frames. The case of some letters was corrected.
2: In checkGeneSymbols("2-Mar") : x contains non-approved gene symbols
> findExcelGeneSymbols("2-Mar")
[1] "MTARC2"
Warning message:
In findExcelGeneSymbols("2-Mar") :
  Transmogrified gene symbols found.  Returning the following corrections: 2-Mar to MTARC2
> 

findExcelGeneSymbols does not do a check against all aliases, and I actually think that it no longer serves a useful purpose in the package. I would recommend using checkGeneSymbols instead, and I'll deprecate findExcelGeneSymbols unless I hear complaints.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants