Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add openfunctionv2 model inference script and fix minor bug #360

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

JasonZhu1313
Copy link
Contributor

@JasonZhu1313 JasonZhu1313 commented Apr 15, 2024

Summary

This PR introduces a new model handler openfunctions_handler.py to run inference on OS model gorilla-llm/gorilla-openfunctions-v2 and reproduce the results on leaderboard

Issue: #352

Changes

  • Merge the input data into a single file since that's what the evaluation script is expecting, data is under /gorilla/berkeley-function-call-leaderboard/data/BFCL/questions.json
  • Fixed a minor bug in utils.py, the returned object function is mistakenly put inside the for loop, instead it should be put outside loop
  • Added a new OpenfunctionsHandler in handler map
  • openfunctions_handler.py with prompt template and decoding step compatible with openfunctions_v2 model
  • A simple handler_runner.py to run the inference and save the result for evaluation
  • Add instructions to readme.md

Test

  1. Generates the inference result with vLLM
python model_handler/handler_runner.py --data-path /home/jobuser/gorilla/berkeley-function-call-leaderboard/data/gorilla_openfunctions_v1_test_all.json --model-name gorilla-llm/gorilla-openfunctions-v2 --model-path {PATH_TO_MODEL}/gorilla-openfunctions-v2/

Result: https://www.toptal.com/developers/paste-gd/GLTJTe4l

  1. Generate eval result with eval_runner

generate evaluation for AST based metrics

python {PATH_TO_REPO}/gorilla/berkeley-function-call-leaderboard/eval_checker/eval_runner.py --model gorilla-llm/gorilla-openfunctions-v2 --skip-api-sanity-check --test-category simple sql relevance parallel_multiple_function parallel_function multiple_function

🦍 Model: gorilla-llm_gorilla-openfunctions-v2
2024-04-15 18:37:42,909 INFO worker.py:1616 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
🔍 Running test: parallel_multiple_function
✅ Test completed: parallel_multiple_function. 🎯 Accuracy: 0.695
2024-04-15 18:37:43,761 INFO worker.py:1454 -- Calling ray.init() again after it has already been called.
2024-04-15 18:37:43,774 INFO worker.py:1454 -- Calling ray.init() again after it has already been called.
🔍 Running test: parallel_function
✅ Test completed: parallel_function. 🎯 Accuracy: 0.845
2024-04-15 18:37:43,813 INFO worker.py:1454 -- Calling ray.init() again after it has already been called.
🔍 Running test: simple
(raylet) /home/jobuser/build/openconnect-lib-core-image/environments/satellites/python/lib/python3.10/site-packages/ray/dashboard/modules/reporter/reporter_agent.py:56: UserWarning: gpustat package is not installed. GPU monitoring is not available. To have full functionality of the dashboard please install pip install ray[default].)
(raylet) warnings.warn(
✅ Test completed: simple. 🎯 Accuracy: 0.845
2024-04-15 18:37:43,854 INFO worker.py:1454 -- Calling ray.init() again after it has already been called.
🔍 Running test: relevance
✅ Test completed: relevance. 🎯 Accuracy: 0.6875
2024-04-15 18:37:43,876 INFO worker.py:1454 -- Calling ray.init() again after it has already been called.
🔍 Running test: multiple_function
✅ Test completed: multiple_function. 🎯 Accuracy: 0.935

  1. Leaderboard scores

Rank,Overall Acc,Model,Model Link,Organization,License,AST Summary,Exec Summary,Simple Function AST,Python Simple Function AST,Java Simple Function AST,JavaScript Simple Function AST,Multiple Functions AST,Parallel Functions AST,Parallel Multiple AST,Simple Function Exec,Python Simple Function Exec,REST Simple Function Exec,Multiple Functions Exec,Parallel Functions Exec,Parallel Multiple Exec,Relevance Detection,Cost ($ Per 1k Function Calls),Latency Mean (s),Latency Standard Deviation (s),Latency 95th Percentile (s)
1,80.48%,Gorilla-OpenFunctions-v2 (FC) from HuggingFace,https://huggingface.co/gorilla-llm/gorilla-openfunctions-v2,Gorilla LLM,Apache 2.0,83.00%,0.00%,84.50%,84.50%,0.00%,0.00%,93.50%,84.50%,69.50%,0.00%,0.00%,0.00%,0.00%,0.00%,0.00%,68.75%,N/A,N/A,N/A,N/A

@Fanjia-Yan
Copy link
Contributor

Hi Jason, on the onset, thank you so much for taking the time to review and modify our codebases. We appreciate your feedback and are actively working on verifying the result and merging the code.

Here is a list of actionable items we are going to take in the next few days:

  1. We have a script eval_data_compilation.py that compiles all the data pulled from HuggingFace into a single file for vLLM batch inference. We are open to this change as it allows users to set up the inference pipeline easily using the HuggingFace Models. Here is what we plan to do:
    • Remove all data from the ./data folder and put the 2000-line data file under HuggingFace.
    • Modify apply_function_credential_config.py such that it will also apply credentials to the new data file.
  2. We welcome the idea of handler_runner.py. This is an easier and simplified interface for locally deploying models to run on our evaluation. We would like to have a single reference point for model result generation. In other words, users can call openfunctions_evaluation.py only to accomplish data generation. Here is what we can do:
    • We will merge the handler_runner.py content into openfunctions_evaluations.py . python model_handler/handler_runner.py --data-path /home/jobuser/gorilla/berkeley-function-call-leaderboard/data/gorilla_openfunctions_v1_test_all.json --model-name gorilla-llm/gorilla-openfunctions-v2 --model-path {PATH_TO_MODEL}/gorilla-openfunctions-v2/ python model_handler/openfunctions_evaluation.py --model gorilla-llm/gorilla-openfunctions-v2
  3. In eval_runner_helper.py, model_namemodel_name_escaped change will introduce inconsistencies. The MODEL_METADATA_MAPPING will substitute ‘_’ with ‘/’ in the model name and will dedicate a raw model name mapped to the one displayed on the website. We are going to revert that change before merging in.
  4. Regarding your inference result, we will run a local evaluation based on your modifications and check if our result matches. We will also check if the current handler matches what we have in our backend. We will respond to that soon with more information.

Again, thank you very much for improving our code quality and spot-check issue. Looking forward to collaborating on this issue.

BFCL Team

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants