-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug fix: Build error due to breaking data model change #27
Conversation
src/utils.py
Outdated
df = pd.read_csv(inpath, sep='|').rename(columns={'#CUI': 'xref_id'}) | ||
df = pd.read_csv(inpath, sep='|').rename(columns={ | ||
'#CUI_or_CN_id': 'xref_id', | ||
'#CUI': 'xref_id' # 2024/05/19: MedGen Renamed to "#CUI_or_CN_id". Leaving this in case they change back. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bug fix
MedGen renamed a field and the last build failed, so I needed to update the code to match.
@twhetzel After fixing, I ran a new release off this branch and it passed, so we should be good to merge this if you approve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the change on Line 46 as the MedGen format or data model as you mention has changed.
nit: I would not have left Line 47 in and would prefer to review the new MedGen data format/model if it changes again in the future. However, you do have the reason for this nicely commented and so far the remaining code would be ok (albeit possibly redundant) if a CUI
column were added back and only included CUIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, you bring up a good point. What if they do add a CUI
column and it's not the same as it was before; like it only includes CUIs and not CNs, as it was doing before? Well, then me leaving this line in here would create problems. So I just removed it now!
would prefer to review the new MedGen data format/model if it changes again in the future.
If you would like I could upload a copy of the mapping file for you to take a look at, but I didn't keep the old version from before the column name changed.
output/medgen.obographs.json: output/medgen-disease-extract.owl | output/ | ||
robot convert -i $< -o $@ | ||
|
||
output/medgen.sssom.tsv: output/medgen.obographs.json | output/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All other, superfluous changes
@twhetzel The essential change in this PR is basically a one-liner. But I wanna explain why there's a lot in the diff.
As I was working on the bug fix I was getting other errors, so I started to fix those as well. But then I realized that these errors are actually non-issues because they were happening when I was running make all
locally. However, for this repo, the build runs off of make minimal
, which has no errors now that I've fixed that bug. So, in any case, these fixes and optimizations I've made will be good if we ever go back to having a use for make all
. I suppose I should have split those into a separate PR, but I'm just realizing that now, sorry. If you'd like I can still do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do all the make
goals still work as expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! As mentioned in my meeting, make minimal
, which is for the build, works. The latest release that's currently up was created by running make minimal
from this branch.
As I was explaining in the meeting, I also fixed parts of make all
before remembering that I didn't need to do that. make all
has been broken for a while and will continue to be broken until it ever might be needed. Though I think I really ought to do #28 (quick change) to make this clearer / not cause undue worry.
a508bc9
to
a5fcbdb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look ok. Please do review the comments.
FYI - I am using nit as a small issue noticed in the code review.
@@ -54,8 +56,7 @@ def run(input_mappings: str = INPUT_MAPPINGS, input_sssom_config: str = INPUT_CO | |||
# move the col removals below (umls) to above | |||
# - add mapping_justification | |||
df_hpo_mesh['mapping_justification'] = 'semapv:ManualMappingCuration' | |||
write_sssom(df_hpo_mesh, input_sssom_config, | |||
OUTPUT_FILE_HPO_MESH.replace('.sssom.tsv', '-non-matches-included.sssom.tsv')) | |||
write_sssom(df_hpo_mesh, input_sssom_config, OUTPUT_FILE_HPO_MESH_WITH_NON_MATCHES) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the file hpo-mesh_non-matches-included.sssom.tsv
still saved to the same location so that any downstream processes that need this file can still access it?
And was the change in the filename from ...mesh-non...
(with a hyphen) to ...mesh_non...
(with an underscore) intentional? It looks like this is accounted for in all places I can see, but want to confirm.
nit: There are a few levels of directories to visually navigate here with OUTPUT_FILE_HPO_MESH_WITH_NON_MATCHES
so the readability is a bit difficult. Since write_sssom
expects outpath: Union[Path, str]
you may be able to to use Path
in something like full_path = Path(base_dir) / sub_dir1 / sub_dir2 / filename
. However, as long as the file is still saved in the same place or any downstream processes know the new location, it is ok to leave this as-is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the file hpo-mesh_non-matches-included.sssom.tsv still saved to the same location so that any downstream processes that need this file can still access it?
It is saved at the same location, yes! I just decided to make a variable representing its path, as opposed to the hacky way I was creating the filename/path beforehand.
FYI though, it is not actually a release artefact, and it is also not used anywhere right now. If I remember correctly, Nico asked me to filter those non-matches out of the original release artefact hpo-mesh.sssom.tsv
, and I think he gave me the option to save them if I needed. IIRC, I think I wanted to upload them to a google sheet as part of some evaluation process, I don't really remember. So if we want, we probably don't even need to save them, but it's not doing any harm either. The file isn't released and it isn't committed either.
And was the change in the filename from ...mesh-non... (with a hyphen) to ...mesh_non... (with an underscore) intentional? It looks like this is accounted for in all places I can see, but want to confirm.
Yeah, it was intentional! I figured using a_
would help separate out the two parts of this filename, thehpo-mesh
, which is the primary identifier for the matching set, andnon-matches-included
, which is more of a qualifier.
Also I checked and I did indeed rename the path consistently across the codebase (only 2 locations).
There are a few levels of directories to visually navigate here with OUTPUT_FILE_HPO_MESH_WITH_NON_MATCHES so the readability is a bit difficult.
Maybe you mean how I have this dir output/
and I also have output/release/
. Yeah, this is a new pattern I'm trying. I also have in some other repo I think a setup like io/input/
and io/output/...
. I've been wanting to explore these minor tweaks and see which pattern I end up liking better.
something like
full_path = Path(base_dir...
Yeah, I have also been inconsistent in how I'm doing these paths in various repos / places. I am trying to always do the type declaration Union[str, Path]
though, I think that's good. But I should probably use Path
instead of str
whenever possible, too.
output/medgen.obographs.json: output/medgen-disease-extract.owl | output/ | ||
robot convert -i $< -o $@ | ||
|
||
output/medgen.sssom.tsv: output/medgen.obographs.json | output/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do all the make
goals still work as expected?
- MedGen source made breaking data model change; a field was renamed. Updated code to match. General - Bug fixes: for 'make all', which we're not currently using. But the goal still exists and, if run, was bugging out because the dependency pipeline was broken. Certain necessary goals were commented out, and some references were incorrect. - Update: Some files in the release/ dir were not actually release files, so they were moved to output/. Code in various files updated for this. - Update: Added and removed some comments and todos. - Update: Some goal prereqs to make it dependencies clearer - Rename: goal 'sssom' to 'sssom-validate' - Update: goal 'sssom' is now a different goal and there for convenience
Changes