Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BFCL] How to evaluate bitsandbytes 4bit and 8bit quantized models? #759

Open
abdul-456 opened this issue Nov 14, 2024 · 2 comments
Open

Comments

@abdul-456
Copy link

abdul-456 commented Nov 14, 2024

I am encountering an issue with evaluating Bitsandbytes 4-bit and 8-bit quantized models on the Berkeley Function Call Leaderboard (BFCL). I have successfully quantized my models using Bitsandbytes and am serving them locally using vLLM with the following command:

vllm serve quantized_model_hf_repo
--quantization bitsandbytes
--load-format bitsandbytes
--gpu-memory-utilization 0.95
--max-model-len 100000

While the models are served and accessible locally, I am unsure how to proceed with evaluating these quantized models on the BFCL leaderboard.

Issue Details:

When I attempt to run the BFCL evaluation using the following command:

bfcl generate --model quantized_model_hf_repo
--test-category ast
--backend vllm
--num-gpus 1
--gpu-memory-utilization 0.95

I receive a KeyError related to some layer weights. This same issue was initially encountered when serving the model with vLLM. However, by adding the following flags to the vLLM serve command, I was able to serve the model successfully:

--quantization bitsandbytes --load-format bitsandbytes

Questions:

  1. Does BFCL Support Bitsandbytes Quantized Models?
  2. How can I modify the BFCL evaluation command or configuration to include the necessary quantization flags for vLLM?
@HuanzhiMao
Copy link
Collaborator

Hi @abdul-456 ,

The current pipeline don't support quantized models. But there is a fairly straightforward patch you can do to make BFCL work for quantized models.
In base_oss_handler.py, these are the command we use to spin up a vllm server (similar thing for sglang). You can add your additional flags there.

It should look like this, after the change.

        if backend == "vllm":
            process = subprocess.Popen(
                [
                    "vllm",
                    "serve",
                    str(self.model_name_huggingface),
                    "--port",
                    str(VLLM_PORT),
                    "--dtype",
                    str(self.dtype),
                    "--tensor-parallel-size",
                    str(num_gpus),
                    "--gpu-memory-utilization",
                    str(gpu_memory_utilization),
                    "--trust-remote-code", 
                    "--quantization",
                    "bitsandbytes",
                    "--load-format",
                     "bitsandbytes",
                ],
                stdout=subprocess.PIPE,  # Capture stdout
                stderr=subprocess.PIPE,  # Capture stderr
                text=True,  # To get the output as text instead of bytes
            )

And thank you for bringing this up. We will add a feature to support additional arguments to the vllm server.

@abdul-456
Copy link
Author

abdul-456 commented Nov 15, 2024

Hi @HuanzhiMao,

Thank you for your response and for providing the guidance on modifying base_oss_handler.py to support quantized models in BFCL. I have incorporated the suggested changes by adding the --quantization bitsandbytes and --load-format bitsandbytes flags to the vLLM server command within the handler.

However, I'm now encountering an issue when trying to evaluate my model abdulmannan-01/qwen-2.5-7b-finetuned-for-json-generation-bnb-8bit using BFCL with the vLLM backend.

Error:

Traceback (most recent call last):
File "/home/rapids/anaconda3/envs/BFCL/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/home/rapids/anaconda3/envs/BFCL/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/rapids/anaconda3/envs/BFCL/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 390, in run_mp_engine
engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
File "/home/rapids/anaconda3/envs/BFCL/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 139, in from_engine_args
return cls(
File "/home/rapids/anaconda3/envs/BFCL/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 78, in init
self.engine = LLMEngine(*args, **kwargs)
File "/home/rapids/anaconda3/envs/BFCL/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 334, in init
self.model_executor = executor_class(
File "/home/rapids/anaconda3/envs/BFCL/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 47, in init
self._init_executor()
File "/home/rapids/anaconda3/envs/BFCL/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 40, in _init_executor
self.driver_worker.load_model()
File "/home/rapids/anaconda3/envs/BFCL/lib/python3.10/site-packages/vllm/worker/worker.py", line 183, in load_model
self.model_runner.load_model()
File "/home/rapids/anaconda3/envs/BFCL/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1058, in load_model
self.model = get_model(model_config=self.model_config,
File "/home/rapids/anaconda3/envs/BFCL/lib/python3.10/site-packages/vllm/model_executor/model_loader/init.py", line 19, in get_model
return loader.load_model(model_config=model_config,
File "/home/rapids/anaconda3/envs/BFCL/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 1148, in load_model
self._load_weights(model_config, model)
File "/home/rapids/anaconda3/envs/BFCL/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 1033, in _load_weights
raise AttributeError(
AttributeError: Model Qwen2ForCausalLM does not support BitsAndBytes quantization yet.
Traceback (most recent call last):
File "/home/rapids/anaconda3/envs/BFCL/bin/vllm", line 8, in
sys.exit(main())
File "/home/rapids/anaconda3/envs/BFCL/lib/python3.10/site-packages/vllm/scripts.py", line 195, in main
args.dispatch_function(args)
File "/home/rapids/anaconda3/envs/BFCL/lib/python3.10/site-packages/vllm/scripts.py", line 41, in serve
uvloop.run(run_server(args))
File "/home/rapids/.local/lib/python3.10/site-packages/uvloop/init.py", line 82, in run
return loop.run_until_complete(wrapper())
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/home/rapids/.local/lib/python3.10/site-packages/uvloop/init.py", line 61, in wrapper
return await main
File "/home/rapids/anaconda3/envs/BFCL/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 552, in run_server
async with build_async_engine_client(args) as engine_client:
File "/home/rapids/anaconda3/envs/BFCL/lib/python3.10/contextlib.py", line 199, in aenter
return await anext(self.gen)
File "/home/rapids/anaconda3/envs/BFCL/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 107, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File "/home/rapids/anaconda3/envs/BFCL/lib/python3.10/contextlib.py", line 199, in aenter
return await anext(self.gen)
File "/home/rapids/anaconda3/envs/BFCL/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args
raise RuntimeError(
RuntimeError: Engine process failed to start

The vLLM server fails to start when invoked through BFCL.

This is puzzling because serving the model directly with vLLM using the same Bitsandbytes quantization flags works without any issues:

vllm serve abdulmannan-01/qwen-2.5-7b-finetuned-for-json-generation-bnb-8bit
--quantization bitsandbytes
--load-format bitsandbytes
--gpu-memory-utilization 0.95
--max-model-len 8192

When I run BFCL with the updated base_oss_handler.py as per your suggestion:

bfcl generate
--model abdulmannan-01/qwen-2.5-7b-finetuned-for-json-generation-bnb-8bit
--test-category ast
--backend vllm
--num-gpus 1
--gpu-memory-utilization 0.95

Do you have any insights into why this might be happening? Is there an additional step or configuration needed within BFCL to properly support Bitsandbytes quantized models, especially for the Qwen architecture? Alternatively, is there a way to bypass or modify this check within vLLM when used by BFCL?

PS: This fix is working for Llama 3.1 8B 4bit quantized (bitsandbytes) finetuned models but failing on Qwen quantized finetuned models.

Best regards,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants