-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Best way to apply on large quantities of documents? #352
Comments
So I did some digging and it looks like OpenAI natively supports batching, all you have to do is pass a list of prompts to the I took a look through the code to see if I could figure out where to change things to allow for batching. The OpenAI client complete function only takes a single prompt, but I think you could just pass a list of prompts without changing that function itself -- besides having to change how the response is processed. I then looked to see where that function is called, to see where you'd have to change the input. In the complete function in the CLI module, it also only has access to one doc at a time. I'm having some trouble following the code in |
Sure! Batching would be a useful feature to support. Is there additional slowdown elsewhere in the system, like in the grounding steps? So to add the batching:
|
That is a fantastic question -- I hadn't thought about it, but in another NER algorithm I used a while ago, the classification was super fast but the grounding took a prohibitively long time, so I ended up doing all the classification and then all the grounding at once to speed it up (as opposed to on a document-by-document basis). Is there a quick way for me to check if it's the grounding vs. OpenAI before I go ahead and try to implement batching, or should I just go through the code and add |
Yes - the easiest way is to repeat exactly the same extract command, since OntoGPT will cache the OpenAI results. |
Ok so it definitely looks like it's the grounding, here are the times from running it on a single doc with the
The output for this doc looks like:
EDIT: Let me know what your thoughts are about the best way to tackle this! I also am getting this error with
|
Wow, that time is much longer than I would expect. |
The other thing is probably another issue, maybe one that happens when the output stream isn't the expected type. |
Also depends on how your grounding: semsql, bioportal, Gilda…
I believe there may be an embarrassingly simple optimization for grounding
the naive strategy may be reindexing each invocation
…On Wed, Mar 27, 2024 at 5:50 PM Harry Caufield ***@***.***> wrote:
Wow, that time is much longer than I would expect.
It may be because NCBITaxon and PR are two very large ontologies, and
they're used as annotators in multiple classes in your schema.
A quick check for this may be to temporarily disable annotation for a
specific class, or change the domain of the corresponding slot to string
so it doesn't ground - then run again and see if the process is faster.
There are some ways to further tune the schema, plus alternative
strategies for annotation, and most of them come down to using smaller sets
of potential identifiers (e.g., you may be able to use a slim version of
NCBITaxon if you're primarily expecting taxons to be grasses)
—
Reply to this email directly, view it on GitHub
<#352 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAMMOI7YF4F2AKOTMCANPLY2NSMPAVCNFSM6AAAAABFDUNI32VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRUGIYTEMZVGM>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
I'll give this a shot when I get the chance!
This is also a great suggestion, I am expecting them basically all to be plant species. I haven't taken an in-depth look at the code for grounding yet, is there a way to save grounding for last and then only do it on unique entities? In the past with other algorithms something I've done is disabled grounding until I've gotten all the entities, then removed any duplicates to reduce the number of times the grounding process has to run. |
@cmungall what grounding approach am I using if my annotators are all prefixed with
Could you elaborate on this? |
Annotators prefixed with As for the other grounding strategies, they're described in the OAK docs here: |
Ok so I did a closer read of the paper as well as looking at the implementation of grounding here, and took a cursory skim through the semantic-sql docs.
In this case, since they're already faster and provide the ontologies I want, I'm planning to stick with the Semantic SQL annotators. I've got a few clarifying questions that I couldn't find answers for.
In any case, I'd super appreciate some practical pointers on where to begin optimizing, whether that be changes to the code (is there a place for parallelization here?) or semantic-sql-related ontology changes. EDIT: Forgot to mention that I made an equivalent schema with no annotators, and it was many many times faster, but as noted in the paper, has absolutely atrocious performance. So it's definitely the grounding side of things causing the issue. |
Nope! The grounding process does not involve the LLM at all and the LLM is never aware of the parameters you use for grounding unless you explicitly tell it (e.g., if the description for a class says "these should be Gene Ontology terms" or something). This is intentional since LLMs are prone to hallucinating completely nonexistent IDs and misrepresenting the connections between terms and IDs.
I'm not certain if this will provide major performance boosts but it's worth a try. Disease:
is_a: NamedEntity
annotations:
annotators: "sqlite:obo:mesh, sqlite:obo:mondo, sqlite:obo:hp, sqlite:obo:ncit, sqlite:obo:doid, bioportal:meddra"
prompt.examples: cardiac asystole, COVID-19, Headache, cancer
# For the purposes of evaluating against BC5CDR, we force normalization to MESH
id_prefixes:
- MESH
slot_usage:
id:
pattern: "^MESH:[CD][0-9]{6}$"
values_from:
- MeshDiseaseIdentifier
enums:
...
MeshDiseaseIdentifier:
reachable_from:
source_ontology: obo:mesh
source_nodes:
- MESH:D001423 ## Bacterial Infections and Mycoses
- MESH:D001523 ## Mental Disorders
- MESH:D002318 ## Cardiovascular Diseases
- MESH:D002943 ## Circulatory and Respiratory Physiological Phenomena
- MESH:D004066 ## Digestive System Diseases
- MESH:D004700 ## Endocrine System Diseases
- MESH:D005128 ## Eye Diseases
- MESH:D005261 ## Female Urogenital Diseases and Pregnancy Complications
- MESH:D006425 ## Hemic and Lymphatic Diseases
- MESH:D007154 ## Immune System Diseases
- MESH:D007280 ## Disorders of Environmental Origin
- MESH:D009057 ## Stomatognathic Diseases
- MESH:D009140 ## Musculoskeletal Diseases
- MESH:D009358 ## Congenital, Hereditary, and Neonatal Diseases and Abnormalities
- MESH:D009369 ## Neoplasms
- MESH:D009422 ## Nervous System Diseases
- MESH:D009750 ## Nutritional and Metabolic Diseases
- MESH:D009784 ## Occupational Diseases
- MESH:D010038 ## Otorhinolaryngologic Diseases
- MESH:D010272 ## Parasitic Diseases
- MESH:D012140 ## Respiratory Tract Diseases
- MESH:D013568 ## Pathological Conditions, Signs and Symptoms
- MESH:D014777 ## Virus Diseases
- MESH:D014947 ## Wounds and Injuries
- MESH:D017437 ## Skin and Connective Tissue Diseases
- MESH:D052801 ## Male Urogenital Diseases
- MESH:D064419 ## Chemically-Induced Disorders There are some good examples in the cell_type and gocam templates, too. I suspect the size of the NCBITaxon annotator isn't helping.
Please post your full schema and we'll see if there are some other areas to optimize. |
This is great, thank you! I'll hold off on adding specific identifiers to the schema until I've exhausted other options since it may not be a huge boost anyway. I'll try paring down the NCBI taxonomy early next week and let you know how it goes! This is my current schema:
I'm not actually sure that I need gilda in the Any other suggestions appreciated! |
I tried just running the verbatim version of
Anecdotally too it looks like the performance is basically the same for extracting the species in the example abstract I've been using. While that is great news, I do still have an issue related to timing: even running at 1min per abstract, it would take 55 days to run this code. Having looked at the grounding code, I feel like there is definitely a way to speed it up internally, in terms of parallelization. Have you all thought about parallelizing this section of code and decided against it because of some kind of barrier, or is it an open problem that I could try my hand at a PR for? EDIT: Realize that I haven't tried paring down the |
Great! Thanks for providing the schema. I'm 110% sure there are ways to speed up the grounding, so if you feel inspired, a PR is welcome! Using a slim version of NCBI taxon in its OBO form may not be the fastest option - but in this case you would have to get it in OWL format and then convert to semantic-sql database (like here: https://github.com/INCATools/semantic-sql?tab=readme-ov-file#creating-a-sqlite-database-from-an-owl-file). PR and CHEBI are also larger ontologies, so there may be some speedup to be had in using other or smaller versions of those annotators. |
It definitely got faster after turning the slim Taxonomy into an sqlite database! Cut off about 10 seconds. ChEBI has a Lite version of the ontology, which I made into an sqlite database and used. However, there was no speedup -- I think this is because Lite in the ChEBI case refers to the data associated with each instance in the ontology, but there are just as many terms, so there's no speedup. I think I'm going to turn my attentions towards optimizing the code itself as opposed to the databases I'm using for normalization -- it seems generally advantageous to do that in any case. Thanks again for all your help, I'll open a PR when I have something to show! |
Hi all, Just wanted to update on this and ask some questions. Before spending time optimizing, I decided to make sure that switching to the slim ontology didn't affect performance too badly. Unfortunately, on a sample of 1,000 docs from my dataset, switching to slim results in a loss of about 50% of groundings, as well as completely dropping 20% of entities entirely. So for my use case, optimizing performance with slim ontologies doesn't seem to be sufficient. I noticed that you opened #363, which might help, but since optimizing the schema with the slim taxonomy helped so drastically in terms of time, I'm not sure that I'd ever be able to use the full taxonomy, which may be a dealbreaker for being able to use the tool. So for the moment, I'm going to hold off on doing any optimization of the grounding code itself. I also noticed while quantifying the outputs of the two graphs that the relation extraction performance, regardless of which taxonomy DB I used, is absolutely abysmal. I don't have a gold standard for this dataset, but just anecdotally speaking, for a dataset of 1,000 docs, only ~700 relations were extracted. I added specific prompts to each relation in the schema before running this analysis, so I'm not sure what else I can do to get better relation extraction performance. Wondering if you have any thoughts -- I looked for similar issues but didn't find any that specifically talked about engineering the relation prompts within the schema, so let me know if I should be opening a separate issue for this. |
Hi @serenalotreck - thanks for your patience, and thanks for looking into some areas for performance improvements! As for relation extraction results, a couple things you could try:
Triple:
abstract: true
description: Abstract parent for Relation Extraction tasks
is_a: CompoundExpression
attributes:
subject:
range: NamedEntity
predicate:
range: RelationshipType
object:
range: NamedEntity
qualifier:
range: string
description: >-
A qualifier for the statements, e.g. "NOT" for negation
subject_qualifier:
range: NamedEntity
description: >-
An optional qualifier or modifier for the subject of the statement, e.g. "high dose" or "intravenously administered"
object_qualifier:
range: NamedEntity
description: >-
An optional qualifier or modifier for the object of the statement, e.g. "severe" or "with additional complications"
medications:
description: >-
A semicolon-separated list of the patient's medications.
This should include the medication name, dosage, frequency,
and route of administration. Relevant acronyms: PO: per os/by mouth,
PRN: pro re nata/as needed. 'Not provided' if not provided.
range: DrugTherapy
multivalued: true Then this is the entity definition: DrugTherapy:
is_a: CompoundExpression
annotations:
owl: IntersectionOf
attributes:
drug:
description: >-
The name of a specific drug for a patient's preventative
or therapeutic treatment.
range: Drug
amount:
description: >-
The quantity or dosage of the drug, if provided.
May include a frequency.
N/A if not provided.
range: QuantitativeValueWithFrequency
dosage_by_unit:
description: >-
The unit of a patient's properties used to determine drug
dosage. Often "kilogram". N/A if not provided.
range: Unit
duration:
description: >-
The duration of the drug therapy, if provided.
N/A if not provided.
range: QuantitativeValue
route_of_administration:
description: >-
The route of administration for the drug therapy, if provided.
N/A if not provided.
range: string To be fair, these details are usually provided adjacent to each other, unlike many relations like protein-protein interactions (except in those ideal cases like "protein A interacts with protein B"). But this kind of prompt engineering appears to help with relation extraction. |
I have a corpus of ~75,000 abstracts that I want to make a KG out of using OntoGPT. After 4 hours, it only got through 50 documents -- not super promising!
I took a look through the docs to see if there was a parallelization option, but didn't find anything -- is there a better way to run OntoGPT over tons of documents besides making a bunch of separate small directories and submitting a bunch of different jobs?
If you have a thought about where in the code it would make sense to add parallelization capabilities I'm happy to give a shot at opening a PR!
The text was updated successfully, but these errors were encountered: