REFLECTOOL: Towards Reflection-Aware Tool-Augmented Clinical Agents

This is the official repository for "REFLECTOOL: Towards Reflection-Aware Tool-Augmented Clinical Agents"

[Webpage] [Paper] [Huggingface Dataset] [Leaderboard]

Outlines

🔦 About

ClinicalAgent Bench

Despite clinical agents succeeding in diverse signal interaction, they are oriented to a single clinical scenario and hence fail for broader applications. To evaluate clinical agents holistically, we propose ClinicalAgent Bench, a comprehensive medical agent benchmark consisting of 18 tasks across five key realistic clinical dimensions.

ReflecTool

Building on this, we introduce ReflectTool, a novel framework that excels at utilizing domain-specific tools within two stages. ReflectTool can search for supportive successful demonstrations from already built long-term memory to guide the tool selection strategy, and a verifier improves the tool usage according to the tool-wise experience with two verification methods--Iterative Refinement and Candidate Selection.

🏆 LeaderBoard

Methods	Type	Total	Knowledge& Reasoning					Multimodal				Numerical Analysis					Data Understanding				Trustworthiness
			MedQA	MMLU	BioASQ	PubMedQA	Avg.	VQARAD	SLAKE	OmniMedQA	Avg.	MedCalc	EHRSQL	MIMIC-III	eICU	Avg.	MedMentions	emrQA	LongHealthQA	Avg.	MedHalt-Rht	MedVQA-Halt	EHR-Halt	LongHalt	Avg.
MedLlama3-8b	LLM	25.14	58.13	72.82	66.18	42.40	59.88	-	-	-	-	22.45	7.92	8.65	7.40	11.61	22.93	23.10	-	23.02	9.90	-	2.23	-	6.07
Qwen2-7B	LLM	38.01	54.04	69.51	71.68	48.20	60.86	-	-	-	-	14.42	18.09	19.70	21.73	18.49	18.38	39.86	75.25	44.50	14.11	-	3.97	66.50	28.19
Llama3-8b	LLM	35.62	56.32	70.34	72.82	53.80	63.32	-	-	-	-	28.08	17.02	12.53	21.83	19.87	28.54	41.92	-	35.23	29.22	-	18.90	-	24.06
Llama3.1-8b	LLM	42.46	65.20	76.58	74.92	53.00	67.43	-	-	-	-	37.44	11.35	17.42		22.07	32.08	42.41	74.25	49.58	27.78	-	3.97	60.50	30.75
Qwen2-72B*	LLM	48.76	71.25	84.48	82.85	53.00	72.90	-	-	-	-	32.19	23.98	33.95	34.15	31.07	29.20	42.89	79.75	50.61	31.56	-	31.30	58.50	40.45
Llama3.1-70b*	LLM	47.59	79.58	88.15	82.52	57.40	76.91	-	-	-	-	48.52	16.49	25.44	26.47	29.23	25.71	31.69	80.00	45.80	28.22	-	22.48	64.50	38.40
GPT-3.5-turbo	LLM	31.31	58.68	69.88	75.40	50.60	63.64	-	-	-	-	20.53	17.57	24.31	14.30	19.18	26.88	21.64	-	24.26	9.78	-	26.55	-	18.17
MiniCPM-V-2.6	MLLM	29.28	46.58	61.16	70.23	47.20	56.29	48.78	47.12	73.70	56.53	13.28	1.61	1.63	1.88	4.60	18.92	17.42	5.25	13.86	12.44	36.89	8.91	2.25	15.12
InternVL-Chat-V1.5	MLLM	37.02	50.82	65.56	64.89	30.40	52.92	49.67	41.47	68.50	53.21	18.91	17.99	18.05	20.83	18.95	26.47	42.53	*	34.50	25.78	50.78	0.00	*	25.52
HuatuoGPT-Vision-7B	MLLM	41.11	50.43	66.12	73.30	54.00	60.96	53.65	52.97	92.10	66.24	13.56	4.39	9.27	9.76	9.25	16.74	38.44	73.00	42.73	14.33	23.44	2.42	65.25	26.36
HuatuoGPT-Vision-34B	MLLM	41.31	54.83	72.36	73.79	48.00	62.25	56.76	53.72	91.50	67.33	25.79	8.14	9.15	9.79	13.22	16.34	41.68	*	29.01	28.44	43.77	32.07	*	34.76
GPT-4o-mini	MLLM	52.65	76.90	85.67	82.84	49.20	73.65	50.47	46.47	59.20	52.05	50.43	21.73	28.07	18.19	29.61	31.66	40.00	78.50	50.05	62.11	53.78	45.74	70.00	57.91
COT	Qwen2-7b Agent	39.94	52.47	69.97	72.21	41.00	58.91	-	-	-	-	19.10	16.06	23.81	20.95	19.98	22.83	19.92	65.75	36.17	45.56	-	31.49	57.00	44.68
ReAct	Qwen2-7b Agent	42.73	51.61	67.68	80.24	48.60	62.03	35.92	39.59	72.90	49.47	18.62	18.52	25.06	34.00	24.05	22.19	24.04	41.50	29.24	49.89	35.33	61.49	48.75	48.87
CRITIC	Qwen2-7b Agent	43.97	52.87	58.68	71.68	43.20	56.61	48.12	42.70	70.80	53.87	13.09	23.55	28.20	33.12	24.49	25.47	36.33	50.25	37.35	30.44	28.33	57.64	73.75	47.54
Reflexion	Qwen2-7b Agent	45.25	51.78	66.48	74.60	50.80	60.92	45.68	47.97	77.20	56.95	13.37	17.56	22.16	30.23	20.83	30.30	28.92	53.00	37.41	50.55	36.33	62.91	50.75	50.14
ReflecTool (Iterative Refinement, k=2)	Qwen2-7b Agent	49.37	50.12	65.47	76.37	63.20	63.79	53.88	45.71	82.90	60.83	24.68	24.20	16.92	22.08	21.97	43.69	50.27	61.00	51.65	54.44	37.99	56.73	45.25	48.60
ReflecTool (Candidates Selection, k=2)	Qwen2-7b Agent	49.08	50.81	64.84	72.98	62.60	62.81	56.76	49.76	79.20	61.91	24.64	29.87	25.13	27.48	26.78	59.21	43.40	54.00	52.20	53.44	34.21	55.97	23.25	41.72
COT	Qwen2-72b Agent	50.58	72.51	85.45	81.07	37.40	69.11	-	-	-	-	29.89	18.73	23.55	25.72	24.47	24.43	52.09	81.00	52.51	57.11	-	40.79	70.75	56.22
ReAct	Qwen2-72b Agent	53.31	72.43	85.67	85.38	62.40	76.47	50.09	46.01	73.00	56.37	26.46	35.97	31.45	31.87	31.44	42.11	55.02	62.75	53.29	56.89	24.75	55.78	58.50	48.98
CRITIC	Qwen2-72b Agent	52.35	71.85	85.31	87.86	51.00	74.01	40.58	50.80	73.50	54.96	22.44	35.48	32.47	33.28	30.92	33.60	56.36	75.50	55.15	51.77	24.22	52.03	58.75	46.69
Reflexion	Qwen2-72b Agent	56.37	70.78	84.30	87.06	65.00	76.79	54.99	50.05	77.80	60.95	22.45	40.14	31.94	33.42	31.99	52.37	58.73	64.00	58.37	59.00	33.00	59.00	64.00	53.75
ReflecTool (Iterative Refinement, k=2)	Qwen2-72b Agent	59.43	73.30	84.11	84.63	65.20	76.81	57.21	48.82	85.20	63.74	36.01	47.43	33.96	36.39	38.45	54.57	67.96	68.00	63.51	59.55	38.21	57.82	63.00	54.65
ReflecTool (Candidates Selection, k=2)	Qwen2-72b Agent	59.66	71.37	84.00	83.50	66.20	76.27	57.66	48.54	81.90	62.70	36.77	49.89	31.20	34.38	38.06	60.51	66.61	66.50	64.54	60.77	39.63	59.78	66.75	56.73

💫 News

🔥 [2024/10/24] We release the ClinicalAgent Bench, comprised 18 tasks across five capacity dimensions!
🔥 [2024/10/24] We release the code implementation of the ReflecTool!

💡 Installations

Environment

conda create -n reflectool python=3.10.8
conda activate reflectool
# install torch
pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu118
# install java
conda install -c conda-forge openjdk=21.0.4

Dependency

git clone https://github.com/BlueZeros/ReflecTool.git
cd ReflecTool
pip install -e .

ToolBox

You can find details of the tools installation in here.

📔 Dataset

Download the dataset ClinicalAgentBench from here and put it into the temp folder. The ClinicalAgentBench has the structure below:

ReflecTool/
├── ClinicalAgentBench/
│   ├── ablations/ # subset used for analysis
│   ├── train/ # dataset used for optimization stage
│   ├── test/ # dataset used for evaluation
│   ├── memory/ # few-shot samples, long-term memory and tool experience

🔬 Support Models

The support model list can be found in model_config. You should add corresponding folder path before using the model. The type of the supported models are shown below:

Local Models: Huggingface models load with AutoModelForCausalLM. We also support the vllm accelerate for these models (automatically used if vllm is installed).
- Llama3
- Llama3.1
- Qwen1.5
- Qwen2
- Qwen2.5
OpenAI Models: OpenAI Model list.
- GPT-3.5-turbo
- GPT-4o-mini
- GPT-4o
- Note: need to set environment variables OPENAI_API_KEY={your_openai_api_key}.
MultiModal LLM: HuatuoGPT-Vision and MiniCPM.
- HuatuoGPT-Vision
- MiniCPM-V-2.6
- InternVL-Chat-V1.5

📝 Evaluation

Baselines

Models

DATA_PATH=./ClinicalAgentBench
TASK_PATH=test

OUTPUT_PATH=./results/${TASK_PATH}
MEMORY_PATH=${DATA_PATH}/memory

DATASET=medqa
MODEL=qewn2-7b

EXP_NAME=${MODEL}
mkdir -p ${OUTPUT_PATH}/${DATASET}/${EXP_NAME}

python run.py \
    --data-path ${DATA_PATH} \
    --output-path ${OUTPUT_PATH} \
    --task-name ${TASK_PATH} \
    --test-split ${DATASET} \
    --exp-name ${EXP_NAME} \
    --model ${MODEL} \
    --log-print \
    --prompt-debug \
    --resume

Reflexion Agent

DATA_PATH=./ClinicalAgentBench
TASK_PATH=test

OUTPUT_PATH=./results/${TASK_PATH}
MEMORY_PATH=${DATA_PATH}/memory

DATASET="medqa"
MODEL=qwen2-7b
ACTIONS="all_wo_mm"
FEWSHOT=1

EXP_NAME=reflexion_${MODEL}-few_shot_${FEWSHOT}-${ACTIONS}
mkdir -p ${OUTPUT_PATH}/${DATASET}/${EXP_NAME}

python run.py \
    --data-path ${DATA_PATH} \
    --output-path ${OUTPUT_PATH} \
    --task-name ${TASK_PATH} \
    --test-split ${DATASET} \
    --exp-name ${EXP_NAME} \
    --agent "reflexion" \
    --model ${MODEL} \
    --actions ${ACTIONS} \
    --memory-path ${MEMORY_PATH} \
    --memory-type "reflexion_standard" \
    --few-shot ${FEWSHOT} \
    --resume

CRITIC Agent

DATA_PATH=./ClinicalAgentBench
TASK_PATH=test

OUTPUT_PATH=./results/${TASK_PATH}
MEMORY_PATH=${DATA_PATH}/memory

DATASET="medqa"
MODEL=qwen2-72b-int4

ACTIONS="all_wo_mm"
FEWSHOT=1

EXP_NAME=critic_${MODEL}-few_shot_${FEWSHOT}-${ACTIONS}
mkdir -p ${OUTPUT_PATH}/${DATASET}/${EXP_NAME}

python run.py \
    --data-path ${DATA_PATH} \
    --output-path ${OUTPUT_PATH} \
    --task-name ${TASK_PATH} \
    --test-split ${DATASET} \
    --exp-name ${EXP_NAME} \
    --agent "critic" \
    --model ${MODEL} \
    --actions ${ACTIONS} \
    --memory-path ${MEMORY_PATH} \
    --memory-type "critic_standard" \
    --few-shot ${FEWSHOT} \
    --resume \
    --log-print \
    --prompt-debug

ReflecTool

Optimization

optimization on medqa dataset (Knowledge task example)

DATA_PATH=./ClinicalAgentBench
TASK_PATH=train

OUTPUT_PATH=./results/${TASK_PATH}
MEMORY_PATH=${DATA_PATH}/my_memory

domain=medqa
MODEL=qwen2-7b
ACTIONS="all_wo_mm" # tool type. note that `all` and `mm` actions should use 2 gpu.
FEWSHOT=0 # num of few-shot demonstration

EXP_NAME=train-${MODEL}-few_shot_${FEWSHOT}-${ACTIONS}
mkdir -p ${OUTPUT_PATH}/${domain}/${EXP_NAME}

python train.py \
    --data-path ${DATA_PATH} \
    --output-path ${OUTPUT_PATH} \
    --task-name ${TASK_PATH} \
    --test-split ${domain} \
    --exp-name ${EXP_NAME} \
    --model ${MODEL} \
    --actions ${ACTIONS} \
    --memory-path ${MEMORY_PATH} \
    --max-exec-steps 15 \
    --few-shot ${FEWSHOT} \
    --write-memory \
    --memory-type task \
    --log-print \
    --resume

optimization on slake dataset (MultiModal task example)

DATA_PATH=./ClinicalAgentBench
TASK_PATH=train

OUTPUT_PATH=./results/${TASK_PATH}
MEMORY_PATH=${DATA_PATH}/my_memory

domain=slake
MODEL=qwen2-7b
ACTIONS="all" # tool type. note that `all` and `mm` actions should use 2 gpu.
FEWSHOT=0 # num of few-shot demonstration

EXP_NAME=train-${MODEL}-few_shot_${FEWSHOT}-${ACTIONS}
mkdir -p ${OUTPUT_PATH}/${domain}/${EXP_NAME}

python train.py \
    --data-path ${DATA_PATH} \
    --output-path ${OUTPUT_PATH} \
    --task-name ${TASK_PATH} \
    --test-split ${domain} \
    --exp-name ${EXP_NAME} \
    --model ${MODEL} \
    --actions ${ACTIONS} \
    --memory-path ${MEMORY_PATH} \
    --max-exec-steps 15 \
    --few-shot ${FEWSHOT} \
    --write-memory \
    --memory-type task \
    --log-print \
    --resume

Inference

Evaluation on medqa dataset with ReflecTool (Iterative Refinement)

DATA_PATH=./ClinicalAgentBench
TASK=test

OUTPUT_PATH=./results/${TASK}

DATASET=medqa
MODEL=qwen2-7b
ACTIONS="all_wo_mm" # tool type. note that `all` and `mm` actions should use 2 gpu.
FEWSHOT=1 # num of few-shot demonstration
MEMORY_PATH=${DATA_PATH}/memory/task/long_term_memory/${DATASET}

EXP_NAME=reflectool_refine-${MODEL}-few_shot_${FEWSHOT}-${ACTIONS}-task_memory
mkdir -p ${OUTPUT_PATH}/${TASK}/${DATASET}/${EXP_NAME}

python run.py \
    --data-path ${DATA_PATH} \
    --output-path ${OUTPUT_PATH} \
    --task-name ${TASK} \
    --test-split ${DATASET} \
    --exp-name ${EXP_NAME} \
    --agent "reflectool" \
    --model ${MODEL} \
    --actions ${ACTIONS} \
    --action-guide-path ${DATA_PATH}/memory/task/tool_experience/${DATASET}.json \
    --memory-path ${MEMORY_PATH} \
    --force-action \
    --action-search "refine" \
    --few-shot ${FEWSHOT} \
    --log-print \
    --prompt-debug \
    --resume

Evaluation on medqa dataset with ReflecTool (Candidate Selection)

DATA_PATH=./ClinicalAgentBench
TASK=test

OUTPUT_PATH=./results/${TASK}

DATASET=medqa
MODEL=qwen2-7b
ACTIONS="all_wo_mm" # tool type. note that `all` and `mm` actions should use 2 gpu.
FEWSHOT=1 # num of few-shot demonstration
MEMORY_PATH=${DATA_PATH}/memory/task/long_term_memory/${DATASET}

EXP_NAME=reflectool_select-${MODEL}-few_shot_${FEWSHOT}-${ACTIONS}-task_memory
mkdir -p ${OUTPUT_PATH}/${TASK}/${DATASET}/${EXP_NAME}

python run.py \
    --data-path ${DATA_PATH} \
    --output-path ${OUTPUT_PATH} \
    --task-name ${TASK} \
    --test-split ${DATASET} \
    --exp-name ${EXP_NAME} \
    --agent "reflectool" \
    --model ${MODEL} \
    --actions ${ACTIONS} \
    --action-guide-path ${DATA_PATH}/memory/task/tool_experience/${DATASET}.json \
    --memory-path ${MEMORY_PATH} \
    --force-action \
    --action-search "select" \
    --few-shot ${FEWSHOT} \
    --log-print \
    --prompt-debug \
    --resume

📚 Quickly Use

ReAct Agent

Question Answering

from reflectool.agents.TaskAgent import TaskAgent
from reflectool.commons.TaskPackage import TaskPackage
from reflectool.utilities import parse_args


args = parse_args()
args.model = "gpt-4o-mini"
Agent = TaskAgent(args)

task = TaskPackage(
    inputs="Can you explain me about the fever?",
    instruction="",
)
print(Agent(task))

Visual Question Answering

from reflectool.agents.TaskAgent import TaskAgent
from reflectool.commons.TaskPackage import TaskPackage
from reflectool.utilities import parse_args


args = parse_args()
args.model = "gpt-4o-mini"
Agent = TaskAgent(args)

task = TaskPackage(
    inputs="Can you descript the image?",
    instruction="",
    multimodal_inputs={
      "image": "{path to the image}",
    }
)
print(Agent(task))

ReflecTool Agent

Question Answering

from reflectool.agents.ReflecToolAgent import ReflecToolAgent
from reflectool.commons.TaskPackage import TaskPackage
from reflectool.utilities import parse_args


args = parse_args()
args.model = "gpt-4o-mini"
args.action_search = "refine" # verification method: "refine" or "select"
Agent = ReflecToolAgent(args)

task = TaskPackage(
    inputs="Can you explain me about the fever?",
    instruction="",
)
print(Agent(task))

🪶 Acknowledgements

Thanks to the codebase we built upon:

Citation

@misc{liao2024reflectoolreflectionawaretoolaugmentedclinical,
      title={ReflecTool: Towards Reflection-Aware Tool-Augmented Clinical Agents}, 
      author={Yusheng Liao and Shuyang Jiang and Yanfeng Wang and Yu Wang},
      year={2024},
      eprint={2410.17657},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.17657}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
examples		examples
imgs		imgs
reflectool.egg-info		reflectool.egg-info
reflectool		reflectool
results		results
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
run.py		run.py
test.py		test.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

REFLECTOOL: Towards Reflection-Aware Tool-Augmented Clinical Agents

Outlines

🔦 About

ClinicalAgent Bench

ReflecTool

🏆 LeaderBoard

💫 News

💡 Installations

Environment

Dependency

ToolBox

📔 Dataset

🔬 Support Models

📝 Evaluation

Baselines

ReflecTool

Optimization

Inference

📚 Quickly Use

ReAct Agent

ReflecTool Agent

🪶 Acknowledgements

Citation

About

Releases

Packages

Languages

License

BlueZeros/ReflecTool

Folders and files

Latest commit

History

Repository files navigation

REFLECTOOL: Towards Reflection-Aware Tool-Augmented Clinical Agents

Outlines

🔦 About

ClinicalAgent Bench

ReflecTool

🏆 LeaderBoard

💫 News

💡 Installations

Environment

Dependency

ToolBox

📔 Dataset

🔬 Support Models

📝 Evaluation

Baselines

ReflecTool

Optimization

Inference

📚 Quickly Use

ReAct Agent

ReflecTool Agent

🪶 Acknowledgements

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages