A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?
Yunfei Xie*, Juncheng Wu*, Haoqin Tu*, Siwei Yang*, Bingchen Zhao, Yongshuo Zong, Qiao Jin, Cihang Xie, Yuyin Zhou
* Equal technical contribution
1 UC Santa Cruz, 2 University of Edinburgh, 3 National Institutes of Health
- [ππ₯ September 24, 2024] Our arXiv paper is released.
Figure 1: Overall results of o1 and other 4 strong LLMs. We show performance on 12 medical datasets spanning diverse domains. o1 demonstrates a clear performance advantage over closed- and open-source models.
Figure 2: Average accuracy of o1 and other 4 strong LLMs. o1 achieves the highest average accuracy of 73.3% across 19 medical datasets.
Figure 3: Our evaluation pipeline comprises different evaluation (a) aspects containing various tasks. We collect multiple (b) datasets for each task, combining them with various (c) prompt strategies to evaluate the latest (d) language models. We leverage a comprehensive set of (e) evaluations to present a holistic view of model progress in the medical domain.
Table 1: Accuracy (Acc.) or F1 results on 4 tasks across 2 aspects. Model performances with * are taken from Wu et al. (2024) as the reference. We also present the average score (Average) of each metric in the table.
Table 2: BLEU-1 (B-1) and ROUGE-1 (R-1) results on 3 tasks across 2 aspects. We use the gray background to highlight o1 results. We also present the average score (Average) of each metric.
Table 3: Accuracy of models on the multilingual task, XMedBench Wang et al. (2024).
Table 4: Accuracy of LLMs on two agentic benchmarks.
Table 5: Accuracy results of model outputs with/without CoT prompting on 5 knowledge QA datasets.
Figure 4: Comparison of the answers from o1 and GPT-4 for a question from NEJM. o1 provides a more concise and accurate reasoning process compared to GPT-4.
Figure 5: Comparison of the answers from o1 and GPT-4 for a case from the Chinese dataset AI Hospital, along with its English translation. o1 offers a more precise diagnosis and practical treatment suggestions compared to GPT-4.
To set up the evaluation framework, clone our repository and run the setup script:
git clone https://github.com/UCSC-VLAA/o1_eval.git
cd o1_eval
bash setup.sh
Create a .env
file in the root directory and add your OpenAI API credentials:
OPENAI_ORGANIZATION_ID=...
OPENAI_API_KEY=...
We include the prompts and inquiries used in our paper. The detailed datasets are listed below, except for LancetQA and NEJMQA due to copyright.
In the eval_bash
directory, there are evaluation scripts corresponding to each dataset. Simply run the scripts to perform the evaluations.
bash eval_bash/eval_dataset_name/eval_script.sh
-
Clone the AgentClinic repository:
git clone https://github.com/SamuelSchmidgall/AgentClinic/
Follow the installation instructions provided in the repository's
README.md
. -
To run evaluations, execute the following bash command with the specified parameters:
python agentclinic.py --doctor_llm o1-preview \ --patient_llm o1-preview --inf_type llm \ --agent_dataset dataset --doctor_image_request False \ --num_scenarios 220 \ --total_inferences 20 --openai_client
--agent_dataset
: You can choose betweenMedQA
orNEJM_Ext
.
-
Clone the AI Hospital repository:
git clone https://github.com/LibertFan/AI_Hospital
Follow the installation instructions provided in the repository's
README.md
. -
To run evaluations, execute the following bash command with the specified parameters:
python run.py --patient_database ./data/patients_sample_200.json \ --doctor_openai_api_key $OPENAI_API_KEY \ --doctor Agent.Doctor.GPT --doctor_openai_model_name o1-preview \ --patient Agent.Patient.GPT --patient_openai_model_name gpt-3 \ --reporter Agent.Reporter.GPT --reporter_openai_model_name gpt-3 \ --save_path outputs/dialog_history_iiyi/dialog_history_gpto1_200.jsonl \ --max_conversation_turn 10 --max_workers 2 --parallel
- Note: We evaluated only the first 200 records from AI Hospital due to cost constraints.
Our evaluation framework is fully based on OpenAI Evals. OpenAI Evals provides a framework for evaluating large language models (LLMs) or systems built using LLMs. It offers an existing registry of evaluations to test different dimensions of OpenAI models and the ability to write your own custom evaluations for use cases you care about. You can also use your data to build private evaluations representing the common LLM patterns in your workflow without exposing any of that data publicly.
For detailed instructions on creating and running custom evaluations, please refer to the OpenAI Evals documentation.
This work is partially supported by the OpenAI Researcher Access Program and Microsoft Accelerate Foundation Models Research Program. Q.J. is supported by the NIH Intramural Research Program, National Library of Medicine. The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding agencies.
If you find this work useful for your research and applications, please cite using this BibTeX:
@misc{xie2024preliminarystudyo1medicine,
title={A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?},
author={Yunfei Xie and Juncheng Wu and Haoqin Tu and Siwei Yang and Bingchen Zhao and Yongshuo Zong and Qiao Jin and Cihang Xie and Yuyin Zhou},
year={2024},
eprint={2409.15277},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.15277},
}