-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a chunk summary and discarting irrevelant information. #71
Comments
Ok, so I made a test to check if a single pass pre-evaluation was possible with a 8b model. This fingerprint will be use to determine how much work (ie tokens) should be used to extract data from it and how much tokens should be produced. (in sort, this is choosing the pipeline efforts and output size) I think that this could be used to parse at least the first page of a document, more if it fits in context. This is the prompt : prompt
The output of llama8b{
"Author": ["Jay Shah"],
"ConceptsExplained": ["Antimicrobial Resistance (AMR)", "Antibiotic Use", "Global Economic Impact", "Surgical Site Infections", "Methicillin-Resistant Staphylococcus aureus (MRSA)", "Tuberculosis (TB)", "Malaria", "HIV", "Escherichia coli", "Klebsiella pneumoniae"],
"Context": ["Global Health", "Economics", "Medicine", "Infectious Diseases"],
"Credibility": 8,
"DepthOfKnowledge": 9,
"DomainOfKnowledge": ["Microbiology", "Epidemiology", "Economics", "Public Health"],
"KnowledgeType": ["Scientific Research", "Review Article"],
"Language": ["English"],
"OverallMessage": ["Antimicrobial Resistance is a growing global health concern with significant economic and human implications"],
"Publisher": ["Centre for Disease Control and Prevention", "The Economist", "KPMG LLP"],
"Source": ["Pre-print article", "Journal articles", "Reports"],
"Summary": ["The article discusses the growing concern of Antimicrobial Resistance (AMR) and its impact on global health and economy. The authors highlight the need for educational, political, economic, medical, and social factors to address this issue."],
"TechnicalTerms": ["Antimicrobial Resistance (AMR)", "Antibiotic Use", "Surgical Site Infections", "Methicillin-Resistant Staphylococcus aureus (MRSA)", "Tuberculosis (TB)", "Malaria", "HIV", "Escherichia coli", "Klebsiella pneumoniae"],
"TypeOfMaterial": ["Research Article", "Report"],
"Usefulness": 9
} EDIT: wanted to try 405b{
"Author": ["Jay Shah"],
"ConceptsExplained": [
"Antimicrobial resistance (AMR) and its impacts",
"Factors contributing to AMR",
"Global economic impacts of AMR",
"Challenges in developing new antibiotics"
],
"Context": ["Global health issue"],
"Credibility": 8,
"DepthOfKnowledge": 7,
"DomainOfKnowledge": ["Healthcare", "Economics"],
"KnowledgeType": ["Factual", "Analytical"],
"Language": ["English"],
"OverallMessage": ["AMR is a serious global health threat with significant economic impacts that requires coordinated efforts to address."],
"Publisher": [],
"Source": "Centre for Disease Control and Prevention",
"Summary": ["This article discusses the growing threat of antimicrobial resistance (AMR), its causes, potential economic impacts, and challenges in developing new antibiotics. It highlights the need for global efforts to combat AMR."],
"TechnicalTerms": [
"Antimicrobial resistance (AMR)",
"Staphylococcus aureus",
"Escherichia coli",
"Klebsiella pneumoniae",
"Antibiotic consumption"
],
"TypeOfMaterial": "Article",
"Usefulness": 8
} This proves that a single API call can generate lots of different elements that would be otherwise be distributed over many API calls. Since it output only JSON, it's kinda cheap token wise. Pretty impressed with a 8b model ! I think it's almost better than 405b for some keys. I find that re-running multiple call to 8b with slightly different prompt (and smaller chunks) is more efficient than using a 1x prompt of a larger model. Data extraction works better with smaller chunks and small LLM, 3b will need to be evaluated as a MICRO model and 1.5b as a NANO model. What kind of results can be expected from them (for data extraction at least) ? Those are cheap to run tokens thru be it locally or remotely. Technical terms extraction would work best on smaller chunks than a whole article like this, but it still did a good job. Perhaps a dictionary based comparison of the JSON output with all terms found in text using regex, and requesting the LLM for the missing terms would be the optimal way to go instead of pooling it multiple times for small chunks. Forgot to specify only lookup definition in the text, but oh well :
llama8b technical terms{
"terms": [
{
"term": "Antimicrobial Resistance (AMR)",
"definition": "A phenomenon where bacteria and other pathogens develop the ability to evade or defeat the drugs designed to eliminate them."
},
{
"term": "Antimicrobial use (AMU)",
"definition": "The use of antimicrobial agents, such as antibiotics, to treat or prevent infections."
},
{
"term": "Broad-spectrum antibiotics",
"definition": "Antibiotics that are effective against a wide range of bacteria, including both Gram-positive and Gram-negative bacteria."
},
{
"term": "Gram-positive bacteria",
"definition": "Bacteria that have a thick peptidoglycan cell wall and a positive charge, examples include Staphylococcus aureus and Enterococcus faecalis."
},
{
"term": "Gram-negative bacteria",
"definition": "Bacteria that have a thinner peptidoglycan cell wall and a negative charge, examples include Escherichia coli and Pseudomonas aeruginosa."
},
{
"term": "Methicillin-resistant Staphylococcus aureus (MRSA)",
"definition": "A type of Staphylococcus aureus that is resistant to methicillin, a type of beta-lactam antibiotic."
},
{
"term": "Carbapenem-resistant Enterobacteriaceae (CRE)",
"definition": "A type of bacteria that is resistant to carbapenem, a broad-spectrum antibiotic."
},
{
"term": "Extensively drug-resistant tuberculosis (XDR-TB)",
"definition": "A type of tuberculosis that is resistant to multiple drugs."
},
{
"term": "Intrinsic resistance",
"definition": "A natural property of bacteria that makes them resistant to certain antibiotics due to the structure of their cell wall or other bacterial components."
},
{
"term": "Acquired resistance",
"definition": "Resistance that develops in bacteria due to exposure to an antibiotic, which can occur through genetic mutations or changes in gene expression."
},
{
"term": "Beta-lactam antibiotics",
"definition": "A class of antibiotics that interfere with the bacterial cell wall synthesis, examples include penicillin and cephalosporins."
},
{
"term": "Quinolone antibiotics",
"definition": "A class of antibiotics that inhibit DNA replication in bacteria, examples include ciprofloxacin and levofloxacin."
},
{
"term": "Aminoglycoside antibiotics",
"definition": "A class of antibiotics that inhibit protein synthesis in bacteria, examples include gentamicin and tobramycin."
},
{
"term": "Macrolide antibiotics",
"definition": "A class of antibiotics that inhibit protein synthesis in bacteria, examples include azithromycin and clarithromycin."
},
{
"term": "Fluoroquinolone antibiotics",
"definition": "A class of antibiotics that inhibit DNA replication in bacteria, examples include ciprofloxacin and levofloxacin."
},
{
"term": "Gut microbiome",
"definition": "The collection of microorganisms that live in the human gut, which play a crucial role in digestion and immune system function."
},
{
"term": "Probiotics",
"definition": "Live microorganisms that are similar to the beneficial bacteria found in the human gut, which can help to restore balance to the gut microbiome."
},
{
"term": "Synthetic biology",
"definition": "The design and construction of new biological systems, such as microbes, to perform specific functions, such as the production of biofuels or other chemicals."
},
{
"term": "Precision medicine",
"definition": "An approach to healthcare that involves tailoring medical treatment to the individual needs of each patient, based on their unique genetic and environmental characteristics."
},
{
"term": "Pharmacogenomics",
"definition": "The study of how people's genetic differences affect their response to different medications."
},
{
"term": "Patent expiration",
"definition": "The end of the time period during which a company has exclusive rights to produce and sell a particular product, such as an antibiotic."
},
{
"term": "Patent pool",
"definition": "A mechanism that allows multiple companies to work together to develop and share intellectual property, such as patents, related to a particular product or technology."
},
{
"term": "WTO",
"definition": "The World Trade Organization, an international organization that promotes free trade and sets rules for international trade."
},
{
"term": "TRIPS",
"definition": "The Agreement on Trade-Related Aspects of Intellectual Property Rights, an international agreement that sets rules for intellectual property protection and enforcement."
},
{
"term": "WHO",
"definition": "The World Health Organization, an international organization that promotes global health and sets standards for healthcare."
},
{
"term": "Pneumonia",
"definition": "An infection of the lungs that can be caused by bacteria, viruses, or other microorganisms."
},
{
"term": "Resistance genes",
"definition": "Genes that confer resistance to antibiotics, which can be transmitted horizontally between bacteria."
"
},
{
"term": "Horizontal gene transfer",
"definition": "The transfer of genes between bacteria through mechanisms such as conjugation, transformation, or transduction."
},
{
"term": "Transformation",
"definition": "The direct uptake and incorporation of free DNA molecules from the environment into a bacterial cell."
},
{
"term": "Transduction",
"definition": "The transfer of DNA from one bacterium to another through a viral vector."
},
{
"term": "Conjugation",
"definition": "The transfer of DNA between bacteria through direct cell-to-cell contact."
}
]
} llama405b technical terms{
"technical_terms": [
{
"term": "antimicrobial resistance (AMR)",
"definition": "a phenomenon wherein bacteria and other pathogens develop the ability to evade or defeat the drugs designed to eliminate them"
},
{
"term": "resistant infections",
"definition": "infections caused by pathogens that have developed resistance to one or more antimicrobial agents"
},
{
"term": "post-operative infections",
"definition": "infections that occur after a surgical procedure"
},
{
"term": "extensively drug-resistant tuberculosis (XDR-TB)",
"definition": "a rare disease that can become more common and pose a substantial mortality threat, characterized by resistance to at least four of the core anti-TB drugs"
},
{
"term": "carbapenem-resistant Enterobacteriaceae (CRE)",
"definition": "bacteria that are resistant to carbapenems, which are broad-spectrum antibiotics that are often reserved as a last resort for treating resistant infections"
},
{
"term": "methicillin-resistant Staphylococcus aureus (MRSA)",
"definition": "a rare disease that can become more common and pose a substantial mortality threat, caused by a strain of Staphylococcus aureus resistant to methicillin and other beta-lactam antibiotics"
},
{
"term": "broad-spectrum antibiotics",
"definition": "antibiotics that are effective against a wide range of bacteria"
},
{
"term": "antibiotic prophylaxis",
"definition": "the preventive use of antibiotics to reduce the risk of infection"
},
{
"term": "intrapartum antibiotic prophylaxis",
"definition": "the administration of antibiotics during labor to prevent sepsis in infants"
},
{
"term": "intestinal microbiomes",
"definition": "the community of microorganisms living in the gut"
},
{
"term": "antimicrobial resistance genes",
"definition": "genes that confer resistance to antimicrobial agents"
},
{
"term": "prospective controlled cohort study",
"definition": "a type of study that follows a group of people over time and compares those exposed to a particular factor with those who are not"
},
{
"term": "low-and-middle income countries (LMIC)",
"definition": "countries with a lower economic status based on their Gross National Income (GNI) per capita"
},
{
"term": "antibiotic use (AMU)",
"definition": "the consumption or utilization of antibiotic medications"
},
{
"term": "intensive livestock production",
"definition": "a system of raising animals in which large numbers of livestock are kept in confined spaces, often with the use of antibiotics to promote growth and prevent disease"
},
{
"term": "World Organization for Animal Health (WOAH)",
"definition": "an intergovernmental organization responsible for improving animal health worldwide"
},
{
"term": "agri-food systems",
"definition": "the interconnected processes involved in the production, processing, distribution, and consumption of food"
},
{
"term": "medically important antimicrobials",
"definition": "antimicrobial agents that are critical for human medicine and should be used judiciously in animals to minimize the risk of resistance development"
},
{
"term": "human-animal interface",
"definition": "the points of contact between humans and animals, where pathogens can be transmitted between species"
},
{
"term": "point-of-care diagnostic kits",
"definition": "devices that allow for rapid testing and diagnosis at the point of patient care, without the need for specialized laboratory equipment"
},
{
"term": " GLOBAL GDP",
"definition": "the total value of all goods and services produced worldwide in a given year"
},
{
"term": "reserve antibiotics",
"definition": "antibiotics that are held in reserve and used only for the most resistant infections, to slow the development of resistance"
}
]
} EDIT: Just noticed there is an error in the JSON of llama8b for technical terms extraction, a extra |
Hey, thanks for making this issue, looks like you're doing some really creative stuff! Very glad that someone's making good use of the boilerplate :) If you have any specific questions as you develop it please let me know, and when you're done I would not be against merging it into the main project as an official pipeline option if you're open to it. Keep me posted! This looks really cool. |
How would it be possible to have a summary made out of a chunk? With a prompt named:
summary_gen.yaml
EDIT: While I had trouble getting my head around the code in
./original
, I started from scratch withBOILERPLATE_TO_MAKE_YOUR_OWN_PIPELINE
. So I kinda figured how to this on my own.Will close this in a bit when I get it working, for now, I am sharing some of my work in abstracting the pipeline, abusing the prompts and optimizing the quality of the output while minimizing token use.
I have had some passable results by telling it to identify the main theme of the chunk and what domain of knowledge it's about and that any information not in relation to it should be discarded, like publicity or irrelevant mixed text. That would help for generating good content for continuous training. Different levels of shrinking it down could be tried: long summary, summary, short summary.
Summarize the following text by keeping only what is consistent with the main idea, theme, or key points. Remove anything that is not relevant or seems off-topic."
Also, if this works well, it could be possible to preprocess the chunk for later extraction of QA data, validating it against the original chunk. Small 1.5b and 3b models could be used to pump quick and cheap Q-A that could be classified, verified, grouped by
something common
and reworded in a complex Q-A that convey more information.Would have to get it done and compare normal pipeline with a summary based pipeline to see if there is any difference in dataset quality or speed to generate it.
Thanks
The text was updated successfully, but these errors were encountered: