-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug fix: Synonym Sync: Remove recapitalization of acronyms #724
base: bugfix-syn-sync-dupe-rows-684
Are you sure you want to change the base?
Bug fix: Synonym Sync: Remove recapitalization of acronyms #724
Conversation
Fixed bug where this operation was being done for -updated and -confirmed, but it should only be done for -added.
9a5a058
to
b8c0679
Compare
I need to come back to this. Might just want to get rid of this code here and introduce a warning elsewhere if Mondo acronyms are lowercase. |
Can you create a ticket and explain what is happening and what the current consequences are and which of these consequences this PR fixes? Your comment:
makes it sound like there is more to think through here. |
- Bug fix: I don't think we should have ever had this in the first place. I don't see that it actually fixes anything; only should cause bugs.
# - Acronyms: Use source case | ||
# This operation prevents capitalization from being lost, as sometimes Mondo has used lowercase. However, | ||
# ignore cases where 'synonym_case_source' is multi-value. | ||
df['synonym'] = df.apply(lambda row: row['synonym_case_source'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General explanation & review
I don't think this code makes sense, and should be removed. I only see it as being able to cause bugs.
This operation prevents capitalization from being lost, as sometimes Mondo has used lowercase.
In other words, changes the capitalization of synonym
away from what is in Mondo, using instead what is in the source.
Reasons why I think this code was problematic, by template:
-confirmed
: If the synonym
spelling is different between mondo-edit
and what is shown in one of these rows, then the operation where we update mondo-edit
to update source evidence for that synonym should fail.
-updated
: If the synonym
spelling is different between mondo-edit
and what is shown in one of these rows, then the operation where we update mondo-edit
to update scope and source evidence for that synonym should fail.
-added
: Doesn't make sense to have this code run for -added
. There would have been no Mondo variation spelling for the given synonym anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@twhetzel RE: your comment 12 min ago. I had to step out, but I just got back, and here is the explanation. This should be sufficient but if you still want a ticket, let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joeflack4 where are examples where the original code has caused an issue or where this code is "fixing" things?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The immediate bug in #720 that I noticed as a result of this looked like this. Here were all the cases:
(If the table below is hard to read, can also view: example.csv)
mondo_id | synonym_scope_source | synonym | synonym_case_mondo | synonym_case_diff_mondo | synonym_case_mondo_is_many | synonym_case_source | synonym_case_diff_source | synonym_case_source_is_many | source_id | synonym_type | case |
---|---|---|---|---|---|---|---|---|---|---|---|
MONDO:0003210 | oio:hasExactSynonym | ICC|iCC | ICC | FALSE | ICC|iCC | ICC|iCC | TRUE | NCIT:C35417 | http://purl.obolibrary.org/obo/mondo#ABBREVIATION | confirmed | |
MONDO:0013730 | oio:hasExactSynonym | GVHD|GvHD | GVHD | FALSE | GVHD|GvHD | GVHD|GvHD | TRUE | NCIT:C3063 | http://purl.obolibrary.org/obo/mondo#ABBREVIATION | confirmed | |
MONDO:0009348 | oio:hasExactSynonym | CHL|cHL | Chl | Chl | FALSE | CHL|cHL | CHL|cHL | TRUE | NCIT:C7164 | http://purl.obolibrary.org/obo/mondo#ABBREVIATION | updated |
You can see that synonym
in these rows was set to the same value as synonym_case_diff_source
. Previous to #720, this would have resulted in the following changes for each of the rows shown above:
Before:
ICC -> iCC
GVHD -> GvHD
Chl -> CHL
Chl -> cHL
"Chl" would've had 2 rows because before #720, there were these duplicate rows.
But now, since I'm combining rows when the source has multiple capitalizations for a single synonym, it was setting synonym
to these values:
After:
ICC -> ICC|iCC
GVHD -> GVHD|GvHD
Chl -> CHL|cHL
Obviously, "after" is a bug. None of those |
-delimited strings should themselves be set to the value of synonym
.
But even "before" was bad. We shouldn't be changing the value of the synonym
field at all, I don't think. And even if we had the good intention of addressing situations where Mondo has a lowercase abbreviation, none of these cases in "before" are instances of that either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for how this code was behaving before #720, here are some examples from -confirmed
:
mondo_id | synonym_scope_source | synonym | synonym_case_diff_mondo | synonym_case_diff_source | source_id |
---|---|---|---|---|---|
MONDO:0000923 | oio:hasExactSynonym | PIE | pie | PIE | NCIT:C34571 |
MONDO:0002429 | oio:hasExactSynonym | IIP | IIp | IIP | NCIT:C35714 |
MONDO:0004991 | oio:hasExactSynonym | BAC | bac | BAC | NCIT:C2923 |
The synonym_case_diff_mondo
row represents the capitalization of the synonym as it actually exists in Mondo. You can see that for these rows, the code detected that these were acronyms, and therefore it did it's thing where it used the source capitalization instead. But that can't be good. With the capitalization changed, we won't be able to properly run the -confirmed
template on these synonyms, because the capitalization won't match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RE: GVHD
This one looks correct, at least per how we have designed the synonym sync thus far. The "primary key" that is joined between Mondo and the source is a full lowercasing of the synonym. So in this case, the PK is gvhd
. That is how these Orphanet and DO terms are getting linked up here.
Edit: Note: A consequence of joining on the lowercasing of synonyms is that when these matches occur, entries will appear in -confirmed
or -updated
, whereas they otherwise would have shown up in -added
as a new synonym.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RE: PIE
Hmm, it seems OLS is behind. In mondo-edit
, only "pie" exists but "PIE" does not exist for MONDO:0000923.
If both did exist, then the bug I am describing above, would have manifested as 2 rows where synonym
would have been "PIE" instead of 1 row with "PIE" and 1 row with "pie".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RE: GVHD
For synonyms that are not abbreviations I think lowercasing the "primary key" is ok, but I am not yet convinced that is the correct approach for synonyms that are abbreviations since this is adding provenance for an entity/triple (synonym) that does not exist. I would like to get feedback from @matentzn and/or @sabrinatoro on this to confirm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RE: PIE
Ahh. This is missing from my mondo-edit
because it was recently added. But the reason it's missing from v9 of the google sheet is because v9 comes from the "mini build", which is ran just to ensure the code is running, but doesn't refresh the sources. I ran a couple builds on this since and would have uploaded the output, but I got the empty synonym_sync_combined_cases.robot.tsv
issue (#691), so I didn't bother. I just finagled it into creating this file though, and I do now see it now.
I'll upload this new output as v9.2.
Incidentally I was just to / will begin working on #691 momentarily!
RE: GVHD
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments inline.
Overview
Fixed bug where this operation was being done for -updated and -confirmed, but it should only be done for -added.Removed this recapitalization.
Pre-merge checklist
Documentation
Was the documentation added/updated under
docs/
?QC
Was the full pipeline run before submitting this PR using
sh run.sh make build-mondo-ingest
on this branch (afterdocker pull obolibrary/odkfull:dev
), and no errors occurred?New Packages
Were any new Python packages added?
Were any other non-Python packages added?
PR Review and Conversations Resolved
Has the PR been sufficiently reviewed by at least 1 team member of the Mondo Technical team and all threads resolved?