-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HGNC robot template #113
HGNC robot template #113
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Successful release
https://github.com/monarch-initiative/omim/releases/tag/2024-06-06
.gitignore
Outdated
@@ -34,4 +34,4 @@ omim.json | |||
mondo_exactmatch_omim.sssom.tsv | |||
mondo_exactmatch_omimps.sssom.tsv | |||
omim.owl | |||
mondo_genes.csv | |||
mondo_genes.robot.tsv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename: mondo_genes.csv
--> mondo-omim-genes.robot.tsv
Throughout the code base.
makefile
Outdated
@@ -35,8 +35,18 @@ omim.owl: omim.ttl mondo_exactmatch_omim.sssom.owl mondo_exactmatch_omimps.sssom | |||
query --update sparql/hgnc_links.ru \ | |||
convert -f ofn -o $@ | |||
|
|||
mondo_genes.csv: omim.owl | |||
mondo_genes.robot.tsv: omim.owl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: Output in TSV now instead of CSV
- ROBOT automatically does this based on the file extension
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now renamed to mondo-omim-genes.robot.tsv
makefile
Outdated
# Insert the source_code column as the second to last column | ||
awk 'BEGIN {FS=OFS="\t"} {if (NR==1) {$$(NF+1)=$$(NF); $$(NF-1)="?source_code";} else {$$(NF+1)=$$(NF); $$(NF-1)="MONDO:OMIM";}} 1' $@ > temp_file && mv temp_file $@ | ||
# Remove the first character of each field in the header | ||
awk 'BEGIN {FS=OFS="\t"} NR==1 {for (i=1; i<=NF; i++) $$i=substr($$i, 2)} {print}' $@ > temp_file && mv temp_file $@ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: Remove the first character, a question mark (?), from each field in the header. This is an artefact of the SPARQL query.
makefile
Outdated
awk 'BEGIN {FS=OFS="\t"} NR>1 {gsub(/^<|>$$/, "", $$1); gsub(/^<|>$$/, "", $$2); gsub(/^<|>$$/, "", $$5)} {print}' $@ > temp_file && mv temp_file $@ | ||
# Insert ROBOT subheader | ||
robot_subheader="ID\tSC 'has material basis in germline mutation in' some %\t>A oboInOwl:source\t>A oboInOwl:source\t" && \ | ||
sed 1a"$$robot_subheader" $@ > temp_file && mv temp_file $@ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add: ROBOT subheader
makefile
Outdated
awk 'BEGIN {FS=OFS="\t"} NR>1 {gsub(/^<|>$$/, "", $$1); gsub(/^<|>$$/, "", $$2); gsub(/^<|>$$/, "", $$5)} {print}' $@ > temp_file && mv temp_file $@ | ||
# Insert ROBOT subheader | ||
robot_subheader="ID\tSC 'has material basis in germline mutation in' some %\t>A oboInOwl:source\t>A oboInOwl:source\t" && \ | ||
sed 1a"$$robot_subheader" $@ > temp_file && mv temp_file $@ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hgnc_id
: SC 'has material basis in germline mutation in' some %
What Nico wrote in the issue was a placeholder for the actual thing. I went and looked through some examples we had of this pattern SC '<PROPERTY>' some %
, and also found the correct string representation 'has material basis in germline mutation in'
. I'm basing it off of several different locations in mondo
where I saw this: '%s and ''has material basis in germline mutation in'' some %s'
- Rename: mondo_genes.csv --> mondo_genes.robot.tsv - Update: Change from CSV to TSV - Update: Set a ROBOT sub-header - Update: remove < > around URIs - Update: remove ?'s at start of col names - Update: insert source_code col, w/ values: MONDO:OMIM General: - Add: run.sh: For running ODK. And updated README.md w/ docs about that. - Update: README.md: Put some less important stuff in <details>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am weary of the extreme use of awk, but as long as it is dockerized.. I would advice caution on this and focus on building mondolib
I am also weary of the extreme use of awk and would prefer to find another option. |
makefile
Outdated
@@ -35,8 +35,18 @@ omim.owl: omim.ttl mondo_exactmatch_omim.sssom.owl mondo_exactmatch_omimps.sssom | |||
query --update sparql/hgnc_links.ru \ | |||
convert -f ofn -o $@ | |||
|
|||
mondo_genes.csv: omim.owl | |||
mondo_genes.robot.tsv: omim.owl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rewrite implementation: awk
--> pandas
- @joeflack4 refactor
Nico:
I am weary of the extreme use of awk, but as long as it is dockerized.. I would advice caution on this and focus on building mondolib
I am also weary of the extreme use of awk and would prefer to find another option.
Haha, this is funny, because I feel the same way. I thought for some reason you guys would probably prefer a ShellScript solution to pandas
, but that was also when I thought I only needed to do 2 manipulations, but it turned out to be 4.
After I wrote that, I sent this to my friend who heavily uses awk
and sed
, who I've been trying to get to use pandas
. Not sure if you guys are familiar with this meme, lol:
It should be an easy rewrite into pandas
, so I'll do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I appreciate your efforts here, but also think something more readable and more easily portable to a common solution in mondolib
eventually will be helpful longer term :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! Please take a look at the new Python file and refactored make goal.
I also added column sorting. Forgot to do that before, and it's not entirely unimportant.
I re-ran the goal and the output is the same as what I've attached to the release, the only difference being the sorting. I'll update that file shortly.
RE: mondolib
refactor: I'm sure there's some kind of ROBOT-template-fu that we could move over there, but I'm not sure yet what that would be. I write a lot of code that looks similar, but the ROBOT templates and the modifications I do to create them vary quite a bit.
5c18c12
to
264181a
Compare
- Update: Refactor method to do this from ShellScript / awk to Python / pandas. - Update: Now sorts columns General - Update: .gitignore: Simplified ignores for files at root. - Add: Utility function to handle < > around URIs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine other than the open question about the source column, which will be sorted Monday.
- Delete: source_code column (w/ values: MONDO:OMIM) - Bug fix: No longer adding exact match gene annotations if >1 gene associated with MIM.
Addresses sub-tasks in:
Related:
Overview
Update
mondo_genes.csv
to be a proper ROBOT template:mondo-omim-genes.robot.tsv
Changes
HGNC ROBOT template
General:
CC: @souzadevinicius Thought this would be a good one for you to review