Skip to content

Commit

Permalink
[biomed nl] update embedding index with linked props examples (#4052)
Browse files Browse the repository at this point in the history
  • Loading branch information
chejennifer authored Mar 21, 2024
1 parent ec3160c commit 11178ed
Show file tree
Hide file tree
Showing 6 changed files with 106 additions and 47 deletions.
2 changes: 1 addition & 1 deletion deploy/nl/embeddings.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
medium_ft: embeddings_medium_2024_03_14_16_38_53.ft_final_v20230717230459.all-MiniLM-L6-v2.csv
sdg_ft: embeddings_sdg_2023_12_26_10_03_03.ft_final_v20230717230459.all-MiniLM-L6-v2.csv
undata_ft: embeddings_undata_2024_03_20_11_01_12.ft_final_v20230717230459.all-MiniLM-L6-v2.csv
bio_ft: embeddings_bio_2024_03_04_10_28_51.ft_final_v20230717230459.all-MiniLM-L6-v2.csv
bio_ft: embeddings_bio_2024_03_19_16_39_03.ft_final_v20230717230459.all-MiniLM-L6-v2.csv
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
"PROP": [
"phylum",
"chemblID",
"virusGenus",
"geneID",
"geneticVariantFunctionalCategory",
"typeOfGene",
Expand All @@ -28,15 +29,13 @@
"hg19GenomicLocation",
"strandOrientation",
"alleleOrigin",
"referenceSNPClusterID<-GeneticVariantGeneAssociation->geneSymbol",
"<-referenceSNPClusterID{typeOf:GeneticVariantGeneAssociation}->geneSymbol",
"ofVirusSpecies",
"antigenType",
"chromosomeSize",
"ncbiTaxonID",
"simplifiedMolecularInputLineEntrySystem",
"ncbiProteinAccessionNumber",
"virusHost",
"imageUrl"
"ncbiProteinAccessionNumber"
]
}
}
12 changes: 12 additions & 0 deletions tools/nl/embeddings/data/curated_input/bio/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# Curated Input for Bio index

This index has properties used by biomedical entities and follows the format of [relation expressions](https://docs.datacommons.org/api/rest/v2#relation-expressions). Properties can be structured like:

- `prop`: this can match to either in or out properties
- e.g., `virusHost` which will match both the 'in' and 'out' values for the property virusHost
- `->prop`: this matches to an 'out' property
- e.g., `->phylum` which will match the 'out' values for the property phylum
- `<-prop`: this matches to an 'in' property
- e.g., `<-virusGenus` which will match the 'in' values for the property virusGenus
- `<-prop1{typeOf:X}->prop2`: in this case, we will get all the 'in' values for prop1 that are of type X & then from those values, get all the 'out' values for prop2
- e.g., `<-geneID{typeOf:DiseaseGeneAssociation}->diseaseOntologyID` which will first get all the DiseaseGeneAssociations that are 'in' values for the property geneID and then get all the 'out' values for the property diseaseOntologyID for those DiseaseGeneAssociations.
83 changes: 43 additions & 40 deletions tools/nl/embeddings/data/curated_input/bio/sheets_svs.csv
Original file line number Diff line number Diff line change
@@ -1,45 +1,48 @@
dcid,Name,Description,Override_Alternatives,Curated_Alternatives
ofVirusSpecies,ofVirusSpecies,"The species of a virus isolate",,
virusHost,virusHost,"A specific organism or taxonomic group of organisms that are susceptible to be infected by a virus",,"host of a virus"
ncbiTaxonID,ncbiTaxonID,"NCBI Taxonomy database identifier",,
diseaseName,diseaseName,"preferred disease name for the concept specified by disease identifiers",,"The name of the disease"
observedAllele,observedAllele,"The sequences of the observed alleles from rs-fasta files.",,
referenceAlleleNCBI,referenceAlleleNCBI,"Reference genomic sequence from dbSNP",,"reference allele"
ofVirusSpecies,ofVirusSpecies,The species of a virus isolate,,
virusHost,virusHost,A specific organism or taxonomic group of organisms that are susceptible to be infected by a virus,,host of a virus
ncbiTaxonID,ncbiTaxonID,NCBI Taxonomy database identifier,,
diseaseName,diseaseName,preferred disease name for the concept specified by disease identifiers,,The name of the disease
observedAllele,observedAllele,The sequences of the observed alleles from rs-fasta files.,,
referenceAlleleNCBI,referenceAlleleNCBI,Reference genomic sequence from dbSNP,,reference allele
class,class,,,
phylum,phylum,,,
geneticVariantFunctionalCategory,geneticVariantFunctionalCategory,"Functional category of the genetic variant",,
hg19GenomicPosition,hg19GenomicPosition,"The genomic position of a genetic variant using the hg19 assembly",,
hg19GenomicLocation,hg19GenomicLocation,"The genomic location of a genetic variant using the hg19 assembly",,
hg38GenomicPosition,hg38GenomicPosition,"The genomic position of a genetic variant using the hg38 assembly",,
hg38GenomicLocation,hg38GenomicLocation,"The genomic location of a genetic variant using the hg38 assembly",,
hasRNATranscript,hasRNATranscript,"Recorded transcript",,"RNA transcript that a gene has"
strandOrientation,strandOrientation,"The strand on which a given annotation is located",,"The orientation of the strand on which an annotation is located"
typeOfGene,typeOfGene,"The type of gene",,
omimID,omimID,"OMIM database identifier",,
geneticVariantFunctionalCategory,geneticVariantFunctionalCategory,Functional category of the genetic variant,,
hg19GenomicPosition,hg19GenomicPosition,The genomic position of a genetic variant using the hg19 assembly,,
hg19GenomicLocation,hg19GenomicLocation,The genomic location of a genetic variant using the hg19 assembly,,
hg38GenomicPosition,hg38GenomicPosition,The genomic position of a genetic variant using the hg38 assembly,,
hg38GenomicLocation,hg38GenomicLocation,The genomic location of a genetic variant using the hg38 assembly,,
hasRNATranscript,hasRNATranscript,Recorded transcript,,RNA transcript that a gene has
strandOrientation,strandOrientation,The strand on which a given annotation is located,,The orientation of the strand on which an annotation is located
typeOfGene,typeOfGene,The type of gene,,
omimID,omimID,OMIM database identifier,,
icd10CMCode,icd10CMCode,"The disease diagnosis code for version 10 of the International Classification of Diseases (ICD), Clinical Modification",,
subClassificationOf,subClassificationOf,"subclassification of",,
snomedCT,snomedCT,"Systematiized Nomenclature of Medicine (SNOMED) clinical terms (CT) code",,
unifiedMedicalLanguageSystemConceptUniqueIdentifier,unifiedMedicalLanguageSystemConceptUniqueIdentifier,"Unified Medical Language System (UMLS) Concept Unique Identifier (CUI)",, "UMLS CUI"
specializationOf,specializationOf,"specialization of",,
chemblID,chemblID,"ChEMBL identifier",,
simplifiedMolecularInputLineEntrySystem,simplifiedMolecularInputLineEntrySystem,"Simplified Molecular Input Line Entry System (SMILE)",,
medicalSubjectHeadingSupplementaryRecordID,medicalSubjectHeadingSupplementaryRecordID,"A unique ID for a Medical Subject Heading supplementary record",,"An ID for a Medical Subject Heading supplementary record;MeSH supplementary record ID"
medicalSubjectHeadingDescriptorID,medicalSubjectHeadingDescriptorID,"A unique ID for a Medical Subject Heading Descriptor record",,"An ID for a Medical Subject Heading descriptor record;MeSH descriptor record ID"
subClassificationOf,subClassificationOf,subclassification of,,
snomedCT,snomedCT,Systematiized Nomenclature of Medicine (SNOMED) clinical terms (CT) code,,
unifiedMedicalLanguageSystemConceptUniqueIdentifier,unifiedMedicalLanguageSystemConceptUniqueIdentifier,Unified Medical Language System (UMLS) Concept Unique Identifier (CUI),," ""UMLS CUI"""
specializationOf,specializationOf,specialization of,,
chemblID,chemblID,ChEMBL identifier,,
simplifiedMolecularInputLineEntrySystem,simplifiedMolecularInputLineEntrySystem,Simplified Molecular Input Line Entry System (SMILE),,
medicalSubjectHeadingSupplementaryRecordID,medicalSubjectHeadingSupplementaryRecordID,A unique ID for a Medical Subject Heading supplementary record,,An ID for a Medical Subject Heading supplementary record;MeSH supplementary record ID
medicalSubjectHeadingDescriptorID,medicalSubjectHeadingDescriptorID,A unique ID for a Medical Subject Heading Descriptor record,,An ID for a Medical Subject Heading descriptor record;MeSH descriptor record ID
activeIngredient,activeIngredient,"component that provides pharmacological activity or other direct effect in the diagnosis, cure, mitigation, treatment, or prevention of disease, or to affect the structure or any function of the body of man or animals",,
administrationRoute,administrationRoute,"The method by which a drug is administered",,
dosageForm,dosageForm,"physical form in which a drug is produced and dispensed",,
antibodyType,antibodyType,"type of antibody",,
antigenType,antigenType,"type of antigen",,
chromosomeSize,chromosomeSize,"number of nucleotides in a chromosome",,"Size of chromosome"
ensemblID,ensemblID,"Ensembl ID",,
fullName,fullName,"full name of the gene",,
geneID,geneID,"gene id",,
ncbiProteinAccessionNumber,ncbiProteinAccessionNumber,"NCBI protein accession number",,
alleleOrigin,alleleOrigin,"Variant allele origin",,"Origin of variant allele"
alleleType,alleleType,"The allele of a genetic variant observed within a population",,"Type of allele"
ncbiDNASequenceName,ncbiDNASequenceName,"NCBI defined segment of DNA sequence name",,"Name used by NIH NCBI to refer to a segment of DNA sequence"
imageUrl,imageUrl,"url to an image of what the biological specimen looks like",,"what the entity looks like"
genomicCoordinates,genomicCoordinates,"genomic coordinates",,
availableStrength,availableStrength,"dose approved for a drug",,
referenceSNPClusterID<-GeneticVariantGeneAssociation->geneSymbol,GeneticVariantGeneAssociation,"Association between a genetic variant and a gene",,"Gene associated with a genetic variant;genetic variant associated with a gene"
diseaseOntologyID<-DiseaseGeneAssociation->geneID,DiseaseGeneAssociation,"Association of a disease and a gene",,
administrationRoute,administrationRoute,The method by which a drug is administered,,
dosageForm,dosageForm,physical form in which a drug is produced and dispensed,,
antibodyType,antibodyType,type of antibody,,
antigenType,antigenType,type of antigen,,
chromosomeSize,chromosomeSize,number of nucleotides in a chromosome,,Size of chromosome
ensemblID,ensemblID,Ensembl ID,,
fullName,fullName,full name of the gene,,
geneID,geneID,gene id,,
ncbiProteinAccessionNumber,ncbiProteinAccessionNumber,NCBI protein accession number,,
alleleOrigin,alleleOrigin,Variant allele origin,,Origin of variant allele
alleleType,alleleType,The allele of a genetic variant observed within a population,,Type of allele
ncbiDNASequenceName,ncbiDNASequenceName,NCBI defined segment of DNA sequence name,,Name used by NIH NCBI to refer to a segment of DNA sequence
imageUrl,imageUrl,url to an image of what the biological specimen looks like,,what the entity looks like
genomicCoordinates,genomicCoordinates,genomic coordinates,,
availableStrength,availableStrength,dose approved for a drug,,
<-referenceSNPClusterID{typeOf:GeneticVariantGeneAssociation}->geneSymbol,GeneticVariantGeneAssociation,Gene associated with a genetic variant,,
<-geneSymbol{typeOf:GeneticVariantGeneAssociation}->referenceSNPClusterID,GeneticVariantGeneAssociation,genetic variant associated with a gene,,
<-diseaseOntologyID{typeOf:DiseaseGeneAssociation}->geneID,DiseaseGeneAssociation,Gene associated with a disease,,
<-geneID{typeOf:DiseaseGeneAssociation}->diseaseOntologyID,DiseaseGeneAssociation,Disease associated with a gene,,
virusGenus,virusGenus,genus of a virus species,,
2 changes: 2 additions & 0 deletions tools/nl/embeddings/data/preindex/bio/duplicate_names.csv
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
PreferredSV,DroppedSV,DuplicateName
<-referenceSNPClusterID{typeOf:GeneticVariantGeneAssociation}->geneSymbol,<-geneSymbol{typeOf:GeneticVariantGeneAssociation}->referenceSNPClusterID,GeneticVariantGeneAssociation
<-diseaseOntologyID{typeOf:DiseaseGeneAssociation}->geneID,<-geneID{typeOf:DiseaseGeneAssociation}->diseaseOntologyID,DiseaseGeneAssociation
Amount_EconomicActivity_GrossDomesticProduction_Nominal,dc/topic/GDP,GDP
dc/topic/Mortality,dc/topic/WHOMortality,Mortality
dc/topic/EconomicActivity,dc/topic/GlobalEconomicActivity,Economic Activity
Expand Down
Loading

0 comments on commit 11178ed

Please sign in to comment.