Bug fix: Build error due to breaking data model change #27

joeflack4 · 2024-05-19T19:53:56Z

Changes

Bug fix: Field rename
- MedGen source made breaking data model change; a field was renamed. Updated code to match.

General
- Bug fixes: for 'make all', which we're not currently using. But the goal still exists and, if run, was bugging out because the dependency pipeline was broken. Certain necessary goals were commented out, and some references were incorrect.
- Update: Some files in the release/ dir were not actually release files, so they were moved to output/. Code in various files updated for this.
- Update: Added and removed some comments and todos.
- Update: Some goal prereqs to make it dependencies clearer
- Rename: goal 'sssom' to 'sssom-validate'
- Update: goal 'sssom' is now a different goal and there for convenience

joeflack4 · 2024-05-19T20:40:21Z

src/utils.py

-    df = pd.read_csv(inpath, sep='|').rename(columns={'#CUI': 'xref_id'})
+    df = pd.read_csv(inpath, sep='|').rename(columns={
+        '#CUI_or_CN_id': 'xref_id',
+        '#CUI': 'xref_id'  # 2024/05/19: MedGen Renamed to "#CUI_or_CN_id". Leaving this in case they change back.


The bug fix

MedGen renamed a field and the last build failed, so I needed to update the code to match.

@twhetzel After fixing, I ran a new release off this branch and it passed, so we should be good to merge this if you approve.

I agree with the change on Line 46 as the MedGen format or data model as you mention has changed.

nit: I would not have left Line 47 in and would prefer to review the new MedGen data format/model if it changes again in the future. However, you do have the reason for this nicely commented and so far the remaining code would be ok (albeit possibly redundant) if a CUI column were added back and only included CUIs.

Actually, you bring up a good point. What if they do add a CUI column and it's not the same as it was before; like it only includes CUIs and not CNs, as it was doing before? Well, then me leaving this line in here would create problems. So I just removed it now!

would prefer to review the new MedGen data format/model if it changes again in the future.

If you would like I could upload a copy of the mapping file for you to take a look at, but I didn't keep the old version from before the column name changed.

src/utils.py

joeflack4 · 2024-05-19T20:48:03Z

makefile

+output/medgen.obographs.json: output/medgen-disease-extract.owl | output/
+	robot convert -i $< -o $@
+
+output/medgen.sssom.tsv: output/medgen.obographs.json | output/


All other, superfluous changes

@twhetzel The essential change in this PR is basically a one-liner. But I wanna explain why there's a lot in the diff.

As I was working on the bug fix I was getting other errors, so I started to fix those as well. But then I realized that these errors are actually non-issues because they were happening when I was running make all locally. However, for this repo, the build runs off of make minimal, which has no errors now that I've fixed that bug. So, in any case, these fixes and optimizations I've made will be good if we ever go back to having a use for make all. I suppose I should have split those into a separate PR, but I'm just realizing that now, sorry. If you'd like I can still do that.

Thanks for the explanation.

Do all the make goals still work as expected?

Yep! As mentioned in my meeting, make minimal, which is for the build, works. The latest release that's currently up was created by running make minimal from this branch.

As I was explaining in the meeting, I also fixed parts of make all before remembering that I didn't need to do that. make all has been broken for a while and will continue to be broken until it ever might be needed. Though I think I really ought to do #28 (quick change) to make this clearer / not cause undue worry.

twhetzel

Look ok. Please do review the comments.

FYI - I am using nit as a small issue noticed in the code review.

.github/workflows/buid_and_release.yml

twhetzel · 2024-05-26T00:34:47Z

src/create_sssom.py

@@ -54,8 +56,7 @@ def run(input_mappings: str = INPUT_MAPPINGS, input_sssom_config: str = INPUT_CO
    #  move the col removals below (umls) to above
    # - add mapping_justification
    df_hpo_mesh['mapping_justification'] = 'semapv:ManualMappingCuration'
-    write_sssom(df_hpo_mesh, input_sssom_config,
-                OUTPUT_FILE_HPO_MESH.replace('.sssom.tsv', '-non-matches-included.sssom.tsv'))
+    write_sssom(df_hpo_mesh, input_sssom_config, OUTPUT_FILE_HPO_MESH_WITH_NON_MATCHES)


Is the file hpo-mesh_non-matches-included.sssom.tsv still saved to the same location so that any downstream processes that need this file can still access it?

And was the change in the filename from ...mesh-non... (with a hyphen) to ...mesh_non... (with an underscore) intentional? It looks like this is accounted for in all places I can see, but want to confirm.

nit: There are a few levels of directories to visually navigate here with OUTPUT_FILE_HPO_MESH_WITH_NON_MATCHES so the readability is a bit difficult. Since write_sssom expects outpath: Union[Path, str] you may be able to to use Path in something like full_path = Path(base_dir) / sub_dir1 / sub_dir2 / filename. However, as long as the file is still saved in the same place or any downstream processes know the new location, it is ok to leave this as-is.

Is the file hpo-mesh_non-matches-included.sssom.tsv still saved to the same location so that any downstream processes that need this file can still access it?
It is saved at the same location, yes! I just decided to make a variable representing its path, as opposed to the hacky way I was creating the filename/path beforehand.

FYI though, it is not actually a release artefact, and it is also not used anywhere right now. If I remember correctly, Nico asked me to filter those non-matches out of the original release artefact hpo-mesh.sssom.tsv, and I think he gave me the option to save them if I needed. IIRC, I think I wanted to upload them to a google sheet as part of some evaluation process, I don't really remember. So if we want, we probably don't even need to save them, but it's not doing any harm either. The file isn't released and it isn't committed either.

And was the change in the filename from ...mesh-non... (with a hyphen) to ...mesh_non... (with an underscore) intentional? It looks like this is accounted for in all places I can see, but want to confirm.
Yeah, it was intentional! I figured using a _ would help separate out the two parts of this filename, the hpo-mesh, which is the primary identifier for the matching set, and non-matches-included, which is more of a qualifier.

Also I checked and I did indeed rename the path consistently across the codebase (only 2 locations).

There are a few levels of directories to visually navigate here with OUTPUT_FILE_HPO_MESH_WITH_NON_MATCHES so the readability is a bit difficult.

Maybe you mean how I have this dir output/ and I also have output/release/. Yeah, this is a new pattern I'm trying. I also have in some other repo I think a setup like io/input/ and io/output/.... I've been wanting to explore these minor tweaks and see which pattern I end up liking better.

something like full_path = Path(base_dir...

Yeah, I have also been inconsistent in how I'm doing these paths in various repos / places. I am trying to always do the type declaration Union[str, Path] though, I think that's good. But I should probably use Path instead of str whenever possible, too.

twhetzel · 2024-05-26T00:41:21Z

makefile

+output/medgen.obographs.json: output/medgen-disease-extract.owl | output/
+	robot convert -i $< -o $@
+
+output/medgen.sssom.tsv: output/medgen.obographs.json | output/


Do all the make goals still work as expected?

- MedGen source made breaking data model change; a field was renamed. Updated code to match. General - Bug fixes: for 'make all', which we're not currently using. But the goal still exists and, if run, was bugging out because the dependency pipeline was broken. Certain necessary goals were commented out, and some references were incorrect. - Update: Some files in the release/ dir were not actually release files, so they were moved to output/. Code in various files updated for this. - Update: Added and removed some comments and todos. - Update: Some goal prereqs to make it dependencies clearer - Rename: goal 'sssom' to 'sssom-validate' - Update: goal 'sssom' is now a different goal and there for convenience

joeflack4 marked this pull request as draft May 19, 2024 19:54

joeflack4 self-assigned this May 19, 2024

joeflack4 added bug Something isn't working medgen labels May 19, 2024

joeflack4 force-pushed the build-errors branch from 392c4cb to b663b0d Compare May 19, 2024 20:39

joeflack4 changed the title ~~Bug fix: Build error(s)~~ Bug fix: Build error due to breaking data model change May 19, 2024

joeflack4 marked this pull request as ready for review May 19, 2024 20:39

joeflack4 requested a review from twhetzel May 19, 2024 20:39

joeflack4 commented May 19, 2024

View reviewed changes

joeflack4 force-pushed the build-errors branch 2 times, most recently from a508bc9 to a5fcbdb Compare May 19, 2024 21:19

twhetzel approved these changes May 26, 2024

View reviewed changes

joeflack4 force-pushed the build-errors branch from a5fcbdb to ac66788 Compare May 26, 2024 19:57

joeflack4 merged commit 3d1779c into main May 29, 2024

joeflack4 deleted the build-errors branch May 29, 2024 21:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug fix: Build error due to breaking data model change #27

Bug fix: Build error due to breaking data model change #27

joeflack4 commented May 19, 2024 •

edited

Loading

joeflack4 May 19, 2024 •

edited

Loading

twhetzel May 25, 2024 •

edited

Loading

joeflack4 May 26, 2024

joeflack4 May 19, 2024

twhetzel May 25, 2024

twhetzel May 26, 2024

joeflack4 May 26, 2024

twhetzel left a comment

twhetzel May 26, 2024

joeflack4 May 26, 2024 •

edited

Loading

twhetzel May 26, 2024

Bug fix: Build error due to breaking data model change #27

Bug fix: Build error due to breaking data model change #27

Conversation

joeflack4 commented May 19, 2024 • edited Loading

Changes

joeflack4 May 19, 2024 • edited Loading

Choose a reason for hiding this comment

The bug fix

twhetzel May 25, 2024 • edited Loading

Choose a reason for hiding this comment

joeflack4 May 26, 2024

Choose a reason for hiding this comment

joeflack4 May 19, 2024

Choose a reason for hiding this comment

All other, superfluous changes

twhetzel May 25, 2024

Choose a reason for hiding this comment

twhetzel May 26, 2024

Choose a reason for hiding this comment

joeflack4 May 26, 2024

Choose a reason for hiding this comment

twhetzel left a comment

Choose a reason for hiding this comment

twhetzel May 26, 2024

Choose a reason for hiding this comment

joeflack4 May 26, 2024 • edited Loading

Choose a reason for hiding this comment

twhetzel May 26, 2024

Choose a reason for hiding this comment

joeflack4 commented May 19, 2024 •

edited

Loading

joeflack4 May 19, 2024 •

edited

Loading

twhetzel May 25, 2024 •

edited

Loading

joeflack4 May 26, 2024 •

edited

Loading