[BFCL] How to evaluate bitsandbytes 4bit and 8bit quantized models? #759

abdul-456 · 2024-11-14T15:24:43Z

I am encountering an issue with evaluating Bitsandbytes 4-bit and 8-bit quantized models on the Berkeley Function Call Leaderboard (BFCL). I have successfully quantized my models using Bitsandbytes and am serving them locally using vLLM with the following command:

vllm serve quantized_model_hf_repo
--quantization bitsandbytes
--load-format bitsandbytes
--gpu-memory-utilization 0.95
--max-model-len 100000

While the models are served and accessible locally, I am unsure how to proceed with evaluating these quantized models on the BFCL leaderboard.

Issue Details:

When I attempt to run the BFCL evaluation using the following command:

bfcl generate --model quantized_model_hf_repo
--test-category ast
--backend vllm
--num-gpus 1
--gpu-memory-utilization 0.95

I receive a KeyError related to some layer weights. This same issue was initially encountered when serving the model with vLLM. However, by adding the following flags to the vLLM serve command, I was able to serve the model successfully:

--quantization bitsandbytes --load-format bitsandbytes

Questions:

Does BFCL Support Bitsandbytes Quantized Models?
How can I modify the BFCL evaluation command or configuration to include the necessary quantization flags for vLLM?

HuanzhiMao · 2024-11-15T02:55:32Z

Hi @abdul-456 ,

The current pipeline don't support quantized models. But there is a fairly straightforward patch you can do to make BFCL work for quantized models.
In base_oss_handler.py, these are the command we use to spin up a vllm server (similar thing for sglang). You can add your additional flags there.

It should look like this, after the change.

        if backend == "vllm":
            process = subprocess.Popen(
                [
                    "vllm",
                    "serve",
                    str(self.model_name_huggingface),
                    "--port",
                    str(VLLM_PORT),
                    "--dtype",
                    str(self.dtype),
                    "--tensor-parallel-size",
                    str(num_gpus),
                    "--gpu-memory-utilization",
                    str(gpu_memory_utilization),
                    "--trust-remote-code", 
                    "--quantization",
                    "bitsandbytes",
                    "--load-format",
                     "bitsandbytes",
                ],
                stdout=subprocess.PIPE,  # Capture stdout
                stderr=subprocess.PIPE,  # Capture stderr
                text=True,  # To get the output as text instead of bytes
            )

And thank you for bringing this up. We will add a feature to support additional arguments to the vllm server.

abdul-456 · 2024-11-15T12:00:14Z

Hi @HuanzhiMao,

Thank you for your response and for providing the guidance on modifying base_oss_handler.py to support quantized models in BFCL. I have incorporated the suggested changes by adding the --quantization bitsandbytes and --load-format bitsandbytes flags to the vLLM server command within the handler.

However, I'm now encountering an issue when trying to evaluate my model abdulmannan-01/qwen-2.5-7b-finetuned-for-json-generation-bnb-8bit using BFCL with the vLLM backend.

Error:

Traceback (most recent call last):
File "/home/rapids/anaconda3/envs/BFCL/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/home/rapids/anaconda3/envs/BFCL/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/rapids/anaconda3/envs/BFCL/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 390, in run_mp_engine
engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
File "/home/rapids/anaconda3/envs/BFCL/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 139, in from_engine_args
return cls(
File "/home/rapids/anaconda3/envs/BFCL/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 78, in init
self.engine = LLMEngine(*args, **kwargs)
File "/home/rapids/anaconda3/envs/BFCL/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 334, in init
self.model_executor = executor_class(
File "/home/rapids/anaconda3/envs/BFCL/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 47, in init
self._init_executor()
File "/home/rapids/anaconda3/envs/BFCL/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 40, in _init_executor
self.driver_worker.load_model()
File "/home/rapids/anaconda3/envs/BFCL/lib/python3.10/site-packages/vllm/worker/worker.py", line 183, in load_model
self.model_runner.load_model()
File "/home/rapids/anaconda3/envs/BFCL/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1058, in load_model
self.model = get_model(model_config=self.model_config,
File "/home/rapids/anaconda3/envs/BFCL/lib/python3.10/site-packages/vllm/model_executor/model_loader/init.py", line 19, in get_model
return loader.load_model(model_config=model_config,
File "/home/rapids/anaconda3/envs/BFCL/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 1148, in load_model
self._load_weights(model_config, model)
File "/home/rapids/anaconda3/envs/BFCL/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 1033, in _load_weights
raise AttributeError(
AttributeError: Model Qwen2ForCausalLM does not support BitsAndBytes quantization yet.
Traceback (most recent call last):
File "/home/rapids/anaconda3/envs/BFCL/bin/vllm", line 8, in
sys.exit(main())
File "/home/rapids/anaconda3/envs/BFCL/lib/python3.10/site-packages/vllm/scripts.py", line 195, in main
args.dispatch_function(args)
File "/home/rapids/anaconda3/envs/BFCL/lib/python3.10/site-packages/vllm/scripts.py", line 41, in serve
uvloop.run(run_server(args))
File "/home/rapids/.local/lib/python3.10/site-packages/uvloop/init.py", line 82, in run
return loop.run_until_complete(wrapper())
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/home/rapids/.local/lib/python3.10/site-packages/uvloop/init.py", line 61, in wrapper
return await main
File "/home/rapids/anaconda3/envs/BFCL/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 552, in run_server
async with build_async_engine_client(args) as engine_client:
File "/home/rapids/anaconda3/envs/BFCL/lib/python3.10/contextlib.py", line 199, in aenter
return await anext(self.gen)
File "/home/rapids/anaconda3/envs/BFCL/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 107, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File "/home/rapids/anaconda3/envs/BFCL/lib/python3.10/contextlib.py", line 199, in aenter
return await anext(self.gen)
File "/home/rapids/anaconda3/envs/BFCL/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args
raise RuntimeError(
RuntimeError: Engine process failed to start

The vLLM server fails to start when invoked through BFCL.

This is puzzling because serving the model directly with vLLM using the same Bitsandbytes quantization flags works without any issues:

vllm serve abdulmannan-01/qwen-2.5-7b-finetuned-for-json-generation-bnb-8bit
--quantization bitsandbytes
--load-format bitsandbytes
--gpu-memory-utilization 0.95
--max-model-len 8192

When I run BFCL with the updated base_oss_handler.py as per your suggestion:

bfcl generate
--model abdulmannan-01/qwen-2.5-7b-finetuned-for-json-generation-bnb-8bit
--test-category ast
--backend vllm
--num-gpus 1
--gpu-memory-utilization 0.95

Do you have any insights into why this might be happening? Is there an additional step or configuration needed within BFCL to properly support Bitsandbytes quantized models, especially for the Qwen architecture? Alternatively, is there a way to bypass or modify this check within vLLM when used by BFCL?

PS: This fix is working for Llama 3.1 8B 4bit quantized (bitsandbytes) finetuned models but failing on Qwen quantized finetuned models.

Best regards,

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BFCL] How to evaluate bitsandbytes 4bit and 8bit quantized models? #759

[BFCL] How to evaluate bitsandbytes 4bit and 8bit quantized models? #759

abdul-456 commented Nov 14, 2024 •

edited

Loading

HuanzhiMao commented Nov 15, 2024

abdul-456 commented Nov 15, 2024 •

edited

Loading

[BFCL] How to evaluate bitsandbytes 4bit and 8bit quantized models? #759

[BFCL] How to evaluate bitsandbytes 4bit and 8bit quantized models? #759

Comments

abdul-456 commented Nov 14, 2024 • edited Loading

HuanzhiMao commented Nov 15, 2024

abdul-456 commented Nov 15, 2024 • edited Loading

abdul-456 commented Nov 14, 2024 •

edited

Loading

abdul-456 commented Nov 15, 2024 •

edited

Loading