Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug fix: Synonym Sync: Remove recapitalization of acronyms #724

Open
wants to merge 3 commits into
base: bugfix-syn-sync-dupe-rows-684
Choose a base branch
from

Conversation

joeflack4
Copy link
Contributor

@joeflack4 joeflack4 commented Dec 17, 2024

Overview

Fixed bug where this operation was being done for -updated and -confirmed, but it should only be done for -added.
Removed this recapitalization.

Pre-merge checklist

Documentation

Was the documentation added/updated under docs/?

  • Yes
  • No, updates to the docs were not necessary after careful consideration

QC

Was the full pipeline run before submitting this PR using sh run.sh make build-mondo-ingest on this branch (after
docker pull obolibrary/odkfull:dev), and no errors occurred?

  • Yes
  • No, there are no functional (code-related) changes to the pipeline in the PR, so no re-run is necessary

New Packages

Were any new Python packages added?

Were any other non-Python packages added?

PR Review and Conversations Resolved

Has the PR been sufficiently reviewed by at least 1 team member of the Mondo Technical team and all threads resolved?

  • Yes

Fixed bug where this operation was being done for -updated and -confirmed, but it should only be done for -added.
@joeflack4 joeflack4 force-pushed the bugfix-syn-sync-acronym-recap branch from 9a5a058 to b8c0679 Compare December 17, 2024 21:33
@joeflack4 joeflack4 changed the title Bug fix: Synonym Sync: Recapitalization of acronyms Bug fix: Synonym Sync: Accidental recapitalization of acronyms Dec 17, 2024
@joeflack4 joeflack4 self-assigned this Dec 17, 2024
@joeflack4
Copy link
Contributor Author

I need to come back to this. Might just want to get rid of this code here and introduce a warning elsewhere if Mondo acronyms are lowercase.

@joeflack4 joeflack4 changed the title Bug fix: Synonym Sync: Accidental recapitalization of acronyms Bug fix: Synonym Sync: Remove recapitalization of acronyms Dec 17, 2024
@joeflack4 joeflack4 added the bug Something isn't working label Dec 17, 2024
@twhetzel
Copy link
Contributor

Can you create a ticket and explain what is happening and what the current consequences are and which of these consequences this PR fixes?

Your comment:

I need to come back to this. Might just want to get rid of this code here and introduce a warning elsewhere if Mondo acronyms are lowercase.

makes it sound like there is more to think through here.

- Bug fix: I don't think we should have ever had this in the first place. I don't see that it actually fixes anything; only should cause bugs.
@joeflack4 joeflack4 mentioned this pull request Dec 17, 2024
9 tasks
# - Acronyms: Use source case
# This operation prevents capitalization from being lost, as sometimes Mondo has used lowercase. However,
# ignore cases where 'synonym_case_source' is multi-value.
df['synonym'] = df.apply(lambda row: row['synonym_case_source']
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General explanation & review

I don't think this code makes sense, and should be removed. I only see it as being able to cause bugs.

This operation prevents capitalization from being lost, as sometimes Mondo has used lowercase.

In other words, changes the capitalization of synonym away from what is in Mondo, using instead what is in the source.

Reasons why I think this code was problematic, by template:
-confirmed: If the synonym spelling is different between mondo-edit and what is shown in one of these rows, then the operation where we update mondo-edit to update source evidence for that synonym should fail.

-updated: If the synonym spelling is different between mondo-edit and what is shown in one of these rows, then the operation where we update mondo-edit to update scope and source evidence for that synonym should fail.

-added: Doesn't make sense to have this code run for -added. There would have been no Mondo variation spelling for the given synonym anyway.

Copy link
Contributor Author

@joeflack4 joeflack4 Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@twhetzel RE: your comment 12 min ago. I had to step out, but I just got back, and here is the explanation. This should be sufficient but if you still want a ticket, let me know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joeflack4 where are examples where the original code has caused an issue or where this code is "fixing" things?

Copy link
Contributor Author

@joeflack4 joeflack4 Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The immediate bug in #720 that I noticed as a result of this looked like this. Here were all the cases:

(If the table below is hard to read, can also view: example.csv)

mondo_id synonym_scope_source synonym synonym_case_mondo synonym_case_diff_mondo synonym_case_mondo_is_many synonym_case_source synonym_case_diff_source synonym_case_source_is_many source_id synonym_type case
MONDO:0003210 oio:hasExactSynonym ICC|iCC ICC FALSE ICC|iCC ICC|iCC TRUE NCIT:C35417 http://purl.obolibrary.org/obo/mondo#ABBREVIATION confirmed
MONDO:0013730 oio:hasExactSynonym GVHD|GvHD GVHD FALSE GVHD|GvHD GVHD|GvHD TRUE NCIT:C3063 http://purl.obolibrary.org/obo/mondo#ABBREVIATION confirmed
MONDO:0009348 oio:hasExactSynonym CHL|cHL Chl Chl FALSE CHL|cHL CHL|cHL TRUE NCIT:C7164 http://purl.obolibrary.org/obo/mondo#ABBREVIATION updated

You can see that synonym in these rows was set to the same value as synonym_case_diff_source. Previous to #720, this would have resulted in the following changes for each of the rows shown above:

Before:
ICC -> iCC
GVHD -> GvHD
Chl -> CHL
Chl -> cHL

"Chl" would've had 2 rows because before #720, there were these duplicate rows.

But now, since I'm combining rows when the source has multiple capitalizations for a single synonym, it was setting synonym to these values:

After:
ICC -> ICC|iCC
GVHD -> GVHD|GvHD
Chl -> CHL|cHL

Obviously, "after" is a bug. None of those |-delimited strings should themselves be set to the value of synonym.

But even "before" was bad. We shouldn't be changing the value of the synonym field at all, I don't think. And even if we had the good intention of addressing situations where Mondo has a lowercase abbreviation, none of these cases in "before" are instances of that either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for how this code was behaving before #720, here are some examples from -confirmed:

mondo_id synonym_scope_source synonym synonym_case_diff_mondo synonym_case_diff_source source_id
MONDO:0000923 oio:hasExactSynonym PIE pie PIE NCIT:C34571
MONDO:0002429 oio:hasExactSynonym IIP IIp IIP NCIT:C35714
MONDO:0004991 oio:hasExactSynonym BAC bac BAC NCIT:C2923

The synonym_case_diff_mondo row represents the capitalization of the synonym as it actually exists in Mondo. You can see that for these rows, the code detected that these were acronyms, and therefore it did it's thing where it used the source capitalization instead. But that can't be good. With the capitalization changed, we won't be able to properly run the -confirmed template on these synonyms, because the capitalization won't match.

Copy link
Contributor Author

@joeflack4 joeflack4 Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RE: GVHD
This one looks correct, at least per how we have designed the synonym sync thus far. The "primary key" that is joined between Mondo and the source is a full lowercasing of the synonym. So in this case, the PK is gvhd. That is how these Orphanet and DO terms are getting linked up here.

Edit: Note: A consequence of joining on the lowercasing of synonyms is that when these matches occur, entries will appear in -confirmed or -updated, whereas they otherwise would have shown up in -added as a new synonym.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RE: PIE
Hmm, it seems OLS is behind. In mondo-edit, only "pie" exists but "PIE" does not exist for MONDO:0000923.

If both did exist, then the bug I am describing above, would have manifested as 2 rows where synonym would have been "PIE" instead of 1 row with "PIE" and 1 row with "pie".

Copy link
Contributor

@twhetzel twhetzel Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RE: GVHD

For synonyms that are not abbreviations I think lowercasing the "primary key" is ok, but I am not yet convinced that is the correct approach for synonyms that are abbreviations since this is adding provenance for an entity/triple (synonym) that does not exist. I would like to get feedback from @matentzn and/or @sabrinatoro on this to confirm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RE: PIE

Here is a screenshot from the latest mondo-edit:
Screenshot 2024-12-18 at 10 59 05 AM

Copy link
Contributor Author

@joeflack4 joeflack4 Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RE: PIE
Ahh. This is missing from my mondo-edit because it was recently added. But the reason it's missing from v9 of the google sheet is because v9 comes from the "mini build", which is ran just to ensure the code is running, but doesn't refresh the sources. I ran a couple builds on this since and would have uploaded the output, but I got the empty synonym_sync_combined_cases.robot.tsv issue (#691), so I didn't bother. I just finagled it into creating this file though, and I do now see it now.

Screenshot 2024-12-18 at 3 11 29 PM

I'll upload this new output as v9.2.

Incidentally I was just to / will begin working on #691 momentarily!


RE: GVHD

Copy link
Contributor

@twhetzel twhetzel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments inline.

@joeflack4 joeflack4 changed the base branch from bugfix-syn-sync-dupe-rows-684 to develop December 19, 2024 22:48
@joeflack4 joeflack4 changed the base branch from develop to bugfix-syn-sync-dupe-rows-684 December 19, 2024 22:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants