-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add QC check for multiple gene associations #8328
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice test. I like it a lot, hopefully it is not too slow, but looks ok, given that it failed after the usual 10 minutes.
The only minor feedback that I have is that to make things easier for the future, I tend to now only SELECT DISTINCT ?entity ?property ?value
- this way, we can easily integrate the test into a dashboard and ROBOT report if we so wisheth. However, I don't know if that will ever happen, and granted this way of listing all properties like you have it is more readable. So, I leave this choice in your hands.
owl:onProperty obo:RO_0004003 ; | ||
owl:someValuesFrom ?geneIdentifier . | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My feeling is that in a UNION query, the bindings are renewed, so ?entity in the first union does not have to be bound to the same ?entity in the second union query. To mitigate this, you can move the ?label outside the UNION, like so:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX oboInOwl: <http://www.geneontology.org/formats/oboInOwl#>
SELECT DISTINCT ?entity ?label (GROUP_CONCAT(DISTINCT ?geneIdentifier; separator=", ") AS ?geneIdentifiers)
WHERE {
{
# subClassOf association
?entity rdfs:subClassOf ?restriction .
?restriction rdf:type owl:Restriction ;
owl:onProperty obo:RO_0004003 ;
owl:someValuesFrom ?geneIdentifier .
}
UNION
{
# equivalentClass association
?entity owl:equivalentClass ?equivClass .
?equivClass (owl:intersectionOf/rdf:rest*/rdf:first) ?component .
?component rdf:type owl:Restriction ;
owl:onProperty obo:RO_0004003 ;
owl:someValuesFrom ?geneIdentifier .
}
# Ensure consistent labeling, if available
?entity rdfs:label ?label .
}
GROUP BY ?entity ?label
HAVING (COUNT(DISTINCT ?geneIdentifier) > 1)
Now the entity is bound outside the union, and inside they will bind to the same.
Could be wrong though, just tickling the back of my head?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had these as separate query files before, one qc query to check if there is more than one gene association stated using the equivalentTo restriction and one qc query to check if there is more than one gene association stated as a subClassOf relationship, so I'm not trying to check if the same entity is used in both. After having each query work separately, I though you might ask me to combine them so I did... but they can also easily be separate qc queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one query is good, you just need to add an external clause like ?entity rdfs:label ?label .
outside the UNION so that is bound first and ?entity is always the same!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wanna confirm my understanding. I'm missing some background knowledge.
Since there is a filter for hgnc/ncbigene, I'm just wondering how the OMIM disease-gene associations make their way into Mondo. Because the omim.owl
creates them as MIM-MIM associations. I guess then somewhere the gene MIM is mapped to hgnc or ncbigene.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You yourself have implemented this mapping: https://github.com/monarch-initiative/omim/releases/download/2024-11-10/mondo-omim-genes.robot.tsv!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK, I remember this. In my false memory these were something else, not has material basis in germline mutation in.
It's a bit wonky because for the MIM-MIM has material basis in germline mutation in, it's done in pure Python, but the HGNC/NCBI-Gene ones are done via SPARQL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gene associations are added into Mondo from both the OMIM gene pipeline (Nico referred to) and from curators directly. When gene associations are added by curators, they can be added for both human diseases and non-human diseases in Mondo. The non-human diseases will have identifiers from NCBI Gene. Any of these curator added gene associations that are not found in OMIM, but are requested by other collaborators will have a different source
for the gene association so the OMIM gene pipeline will not change these when it's run.
closes #8316
This PR adds a qc check and creates an error if any Mondo class has more than one gene association added either as an equivalentTo restriction or a subClassOf axiom. It is ok if the class has one of each, as long as it's the same gene (I can add that qc check).
This PR is related to point 7 in the omim pipeline PR.
NOTE: Based on #8276 the classes MONDO:0011974 'retinitis pigmentosa 7 and MONDO:0100050 'Usher syndrome, type 1D/F' should be excluded from this QC check.
The qc failures in this PR are handled in #8330 (edits on a newer version of mondo-edit.obo).