-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BFCL] How to evaluate bitsandbytes 4bit and 8bit quantized models? #759
Comments
Hi @abdul-456 , The current pipeline don't support quantized models. But there is a fairly straightforward patch you can do to make BFCL work for quantized models. It should look like this, after the change.
And thank you for bringing this up. We will add a feature to support additional arguments to the vllm server. |
Hi @HuanzhiMao, Thank you for your response and for providing the guidance on modifying base_oss_handler.py to support quantized models in BFCL. I have incorporated the suggested changes by adding the --quantization bitsandbytes and --load-format bitsandbytes flags to the vLLM server command within the handler. However, I'm now encountering an issue when trying to evaluate my model abdulmannan-01/qwen-2.5-7b-finetuned-for-json-generation-bnb-8bit using BFCL with the vLLM backend. Error: Traceback (most recent call last): The vLLM server fails to start when invoked through BFCL. This is puzzling because serving the model directly with vLLM using the same Bitsandbytes quantization flags works without any issues: vllm serve abdulmannan-01/qwen-2.5-7b-finetuned-for-json-generation-bnb-8bit When I run BFCL with the updated base_oss_handler.py as per your suggestion: bfcl generate Do you have any insights into why this might be happening? Is there an additional step or configuration needed within BFCL to properly support Bitsandbytes quantized models, especially for the Qwen architecture? Alternatively, is there a way to bypass or modify this check within vLLM when used by BFCL? PS: This fix is working for Llama 3.1 8B 4bit quantized (bitsandbytes) finetuned models but failing on Qwen quantized finetuned models. Best regards, |
I am encountering an issue with evaluating Bitsandbytes 4-bit and 8-bit quantized models on the Berkeley Function Call Leaderboard (BFCL). I have successfully quantized my models using Bitsandbytes and am serving them locally using vLLM with the following command:
vllm serve quantized_model_hf_repo
--quantization bitsandbytes
--load-format bitsandbytes
--gpu-memory-utilization 0.95
--max-model-len 100000
While the models are served and accessible locally, I am unsure how to proceed with evaluating these quantized models on the BFCL leaderboard.
Issue Details:
When I attempt to run the BFCL evaluation using the following command:
bfcl generate --model quantized_model_hf_repo
--test-category ast
--backend vllm
--num-gpus 1
--gpu-memory-utilization 0.95
I receive a KeyError related to some layer weights. This same issue was initially encountered when serving the model with vLLM. However, by adding the following flags to the vLLM serve command, I was able to serve the model successfully:
--quantization bitsandbytes --load-format bitsandbytes
Questions:
The text was updated successfully, but these errors were encountered: