Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add QC check for multiple gene associations #8328

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 35 additions & 0 deletions src/sparql/qc/mondo/qc-gene-identifier-mismatch.sparql
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice test. I like it a lot, hopefully it is not too slow, but looks ok, given that it failed after the usual 10 minutes.

The only minor feedback that I have is that to make things easier for the future, I tend to now only SELECT DISTINCT ?entity ?property ?value - this way, we can easily integrate the test into a dashboard and ROBOT report if we so wisheth. However, I don't know if that will ever happen, and granted this way of listing all properties like you have it is more readable. So, I leave this choice in your hands.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanna confirm my understanding. I'm missing some background knowledge.

Since there is a filter for hgnc/ncbigene, I'm just wondering how the OMIM disease-gene associations make their way into Mondo. Because the omim.owl creates them as MIM-MIM associations. I guess then somewhere the gene MIM is mapped to hgnc or ncbigene.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@joeflack4 joeflack4 Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK, I remember this. In my false memory these were something else, not has material basis in germline mutation in.

It's a bit wonky because for the MIM-MIM has material basis in germline mutation in, it's done in pure Python, but the HGNC/NCBI-Gene ones are done via SPARQL.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gene associations are added into Mondo from both the OMIM gene pipeline (Nico referred to) and from curators directly. When gene associations are added by curators, they can be added for both human diseases and non-human diseases in Mondo. The non-human diseases will have identifiers from NCBI Gene. Any of these curator added gene associations that are not found in OMIM, but are requested by other collaborators will have a different source for the gene association so the OMIM gene pipeline will not change these when it's run.

Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX obo: <http://purl.obolibrary.org/obo/>

# Find classes with mismatched gene identifiers added as equivalentTo and subClassOf

SELECT DISTINCT ?entity ?label ?equivGeneIdentifier ?subClassGeneIdentifier
WHERE {
?entity rdf:type owl:Class ;
rdfs:label ?label .

# Equivalent class restriction
?entity owl:equivalentClass ?equivClass .
?equivClass owl:intersectionOf/rdf:rest*/rdf:first ?equivComponent .

?equivComponent rdf:type owl:Restriction ;
owl:onProperty obo:RO_0004003 ;
owl:someValuesFrom ?equivGeneIdentifier .

# subClassOf restriction
?entity rdfs:subClassOf ?subClassRestriction .
?subClassRestriction rdf:type owl:Restriction ;
owl:onProperty obo:RO_0004003 ;
owl:someValuesFrom ?subClassGeneIdentifier .

# Filter for gene identifiers with HGNC or NCBIGene prefixes
FILTER(STRSTARTS(STR(?equivGeneIdentifier), "http://identifiers.org/hgnc/") ||
STRSTARTS(STR(?equivGeneIdentifier), "http://identifiers.org/ncbigene/"))
twhetzel marked this conversation as resolved.
Show resolved Hide resolved
FILTER(STRSTARTS(STR(?subClassGeneIdentifier), "http://identifiers.org/hgnc/") ||
STRSTARTS(STR(?subClassGeneIdentifier), "http://identifiers.org/ncbigene/"))

# Filter for cases where the gene identifiers do not match
FILTER(?equivGeneIdentifier != ?subClassGeneIdentifier)
}
34 changes: 34 additions & 0 deletions src/sparql/qc/mondo/qc-multiple-gene-associations.sparql
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX oboInOwl: <http://www.geneontology.org/formats/oboInOwl#>

# Get classes that have more than 1 gene association (either subClassOf or equivalentClass) with RO:0004003 property

SELECT DISTINCT ?entity ?label (GROUP_CONCAT(DISTINCT ?geneIdentifier; separator=", ") AS ?geneIdentifiers)
WHERE {
{
# subClassOf association
?entity rdfs:subClassOf ?restriction ;
rdfs:label ?label .

?restriction rdf:type owl:Restriction ;
owl:onProperty obo:RO_0004003 ;
owl:someValuesFrom ?geneIdentifier .
}
UNION
{
# equivalentClass association
?entity owl:equivalentClass ?equivClass ;
rdfs:label ?label .

?equivClass owl:intersectionOf/rdf:rest*/rdf:first ?component .

?component rdf:type owl:Restriction ;
owl:onProperty obo:RO_0004003 ;
owl:someValuesFrom ?geneIdentifier .
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My feeling is that in a UNION query, the bindings are renewed, so ?entity in the first union does not have to be bound to the same ?entity in the second union query. To mitigate this, you can move the ?label outside the UNION, like so:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX oboInOwl: <http://www.geneontology.org/formats/oboInOwl#>

SELECT DISTINCT ?entity ?label (GROUP_CONCAT(DISTINCT ?geneIdentifier; separator=", ") AS ?geneIdentifiers)
WHERE {
  {
    # subClassOf association
    ?entity rdfs:subClassOf ?restriction .
    ?restriction rdf:type owl:Restriction ;
                 owl:onProperty obo:RO_0004003 ;
                 owl:someValuesFrom ?geneIdentifier .
  }
  UNION
  {
    # equivalentClass association
    ?entity owl:equivalentClass ?equivClass .
    ?equivClass (owl:intersectionOf/rdf:rest*/rdf:first) ?component .
    ?component rdf:type owl:Restriction ;
               owl:onProperty obo:RO_0004003 ;
               owl:someValuesFrom ?geneIdentifier .
  }

  # Ensure consistent labeling, if available
  ?entity rdfs:label ?label .
}
GROUP BY ?entity ?label
HAVING (COUNT(DISTINCT ?geneIdentifier) > 1)

Now the entity is bound outside the union, and inside they will bind to the same.

Could be wrong though, just tickling the back of my head?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had these as separate query files before, one qc query to check if there is more than one gene association stated using the equivalentTo restriction and one qc query to check if there is more than one gene association stated as a subClassOf relationship, so I'm not trying to check if the same entity is used in both. After having each query work separately, I though you might ask me to combine them so I did... but they can also easily be separate qc queries.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one query is good, you just need to add an external clause like ?entity rdfs:label ?label . outside the UNION so that is bound first and ?entity is always the same!

GROUP BY ?entity ?label
HAVING (COUNT(DISTINCT ?geneIdentifier) > 1)
Loading