-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add QC check for multiple gene associations #8328
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just wanna confirm my understanding. I'm missing some background knowledge. Since there is a filter for hgnc/ncbigene, I'm just wondering how the OMIM disease-gene associations make their way into Mondo. Because the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You yourself have implemented this mapping: https://github.com/monarch-initiative/omim/releases/download/2024-11-10/mondo-omim-genes.robot.tsv! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah OK, I remember this. In my false memory these were something else, not has material basis in germline mutation in. It's a bit wonky because for the MIM-MIM has material basis in germline mutation in, it's done in pure Python, but the HGNC/NCBI-Gene ones are done via SPARQL. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Gene associations are added into Mondo from both the OMIM gene pipeline (Nico referred to) and from curators directly. When gene associations are added by curators, they can be added for both human diseases and non-human diseases in Mondo. The non-human diseases will have identifiers from NCBI Gene. Any of these curator added gene associations that are not found in OMIM, but are requested by other collaborators will have a different |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> | ||
PREFIX owl: <http://www.w3.org/2002/07/owl#> | ||
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> | ||
PREFIX obo: <http://purl.obolibrary.org/obo/> | ||
|
||
# Find classes with mismatched gene identifiers added as equivalentTo and subClassOf | ||
|
||
SELECT DISTINCT ?entity ?label ?equivGeneIdentifier ?subClassGeneIdentifier | ||
WHERE { | ||
?entity rdf:type owl:Class ; | ||
rdfs:label ?label . | ||
|
||
# Equivalent class restriction | ||
?entity owl:equivalentClass ?equivClass . | ||
?equivClass owl:intersectionOf/rdf:rest*/rdf:first ?equivComponent . | ||
|
||
?equivComponent rdf:type owl:Restriction ; | ||
owl:onProperty obo:RO_0004003 ; | ||
owl:someValuesFrom ?equivGeneIdentifier . | ||
|
||
# subClassOf restriction | ||
?entity rdfs:subClassOf ?subClassRestriction . | ||
?subClassRestriction rdf:type owl:Restriction ; | ||
owl:onProperty obo:RO_0004003 ; | ||
owl:someValuesFrom ?subClassGeneIdentifier . | ||
|
||
# Filter for gene identifiers with HGNC or NCBIGene prefixes | ||
FILTER(STRSTARTS(STR(?equivGeneIdentifier), "http://identifiers.org/hgnc/") || | ||
STRSTARTS(STR(?equivGeneIdentifier), "http://identifiers.org/ncbigene/")) | ||
twhetzel marked this conversation as resolved.
Show resolved
Hide resolved
|
||
FILTER(STRSTARTS(STR(?subClassGeneIdentifier), "http://identifiers.org/hgnc/") || | ||
STRSTARTS(STR(?subClassGeneIdentifier), "http://identifiers.org/ncbigene/")) | ||
|
||
# Filter for cases where the gene identifiers do not match | ||
FILTER(?equivGeneIdentifier != ?subClassGeneIdentifier) | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> | ||
PREFIX owl: <http://www.w3.org/2002/07/owl#> | ||
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> | ||
PREFIX obo: <http://purl.obolibrary.org/obo/> | ||
PREFIX oboInOwl: <http://www.geneontology.org/formats/oboInOwl#> | ||
|
||
# Get classes that have more than 1 gene association (either subClassOf or equivalentClass) with RO:0004003 property | ||
|
||
SELECT DISTINCT ?entity ?label (GROUP_CONCAT(DISTINCT ?geneIdentifier; separator=", ") AS ?geneIdentifiers) | ||
WHERE { | ||
{ | ||
# subClassOf association | ||
?entity rdfs:subClassOf ?restriction ; | ||
rdfs:label ?label . | ||
|
||
?restriction rdf:type owl:Restriction ; | ||
owl:onProperty obo:RO_0004003 ; | ||
owl:someValuesFrom ?geneIdentifier . | ||
} | ||
UNION | ||
{ | ||
# equivalentClass association | ||
?entity owl:equivalentClass ?equivClass ; | ||
rdfs:label ?label . | ||
|
||
?equivClass owl:intersectionOf/rdf:rest*/rdf:first ?component . | ||
|
||
?component rdf:type owl:Restriction ; | ||
owl:onProperty obo:RO_0004003 ; | ||
owl:someValuesFrom ?geneIdentifier . | ||
} | ||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My feeling is that in a UNION query, the bindings are renewed, so ?entity in the first union does not have to be bound to the same ?entity in the second union query. To mitigate this, you can move the ?label outside the UNION, like so:
Now the entity is bound outside the union, and inside they will bind to the same. Could be wrong though, just tickling the back of my head? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I had these as separate query files before, one qc query to check if there is more than one gene association stated using the equivalentTo restriction and one qc query to check if there is more than one gene association stated as a subClassOf relationship, so I'm not trying to check if the same entity is used in both. After having each query work separately, I though you might ask me to combine them so I did... but they can also easily be separate qc queries. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think one query is good, you just need to add an external clause like |
||
GROUP BY ?entity ?label | ||
HAVING (COUNT(DISTINCT ?geneIdentifier) > 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice test. I like it a lot, hopefully it is not too slow, but looks ok, given that it failed after the usual 10 minutes.
The only minor feedback that I have is that to make things easier for the future, I tend to now only
SELECT DISTINCT ?entity ?property ?value
- this way, we can easily integrate the test into a dashboard and ROBOT report if we so wisheth. However, I don't know if that will ever happen, and granted this way of listing all properties like you have it is more readable. So, I leave this choice in your hands.