Issue with Token Latency Reporting for Second Token in OpenVINO GenAI LLM Models #1277

manojrashinkar · 2024-11-29T07:33:44Z

There has been a change in the OpenVINO GenAI Git repository. Recently, when I try to execute LLM models, I am getting a value for the first token latency, but the second token latency is coming up as N/A. Is there any fix for this?

STEPS:

Create a Conda environment with Python 3.10.11:

conda create -n openvino_env python=3.10.11
conda activate openvino_env
Clone the openvino.genai repository:

git clone https://github.com/openvinotoolkit/openvino.genai.git
cd openvino.genai

PATH : https://github.com/openvinotoolkit/openvino.genai/tree/master/tools/llm_bench

Install dependencies from the requirements.txt file:

pip install -r tools/llm_bench/requirements.txt
Download or place the model (e.g., llama-2-7b-chat) in the models/ directory.
Prepare the prompt file (e.g., llama-2-7b-chat_l.jsonl) and place it in the prompts/ directory.
Run the benchmark with the following command:

python benchmark.py -m -d -r <report_csv> -f -p <prompt_text> -n <num_iters>
=
Example command:

python benchmark.py -m models/llama-2-7b-chat/ -pf prompts/llama-2-7b-chat_l.jsonl -n 2 -d GPU -r benchmark_results.csv -f openvino
After running the benchmark , second token latency is coming up as N/A. Is there any fix for this

The text was updated successfully, but these errors were encountered:

Wan-Intel · 2024-12-02T05:35:27Z

I've created a virtual environment with Python 3.10.12, clone the openvino.genai repository, and ran the following command:
python benchmark.py -m ~/waifook_temp/openvino_notebooks/notebooks/llm-chatbot/llama-2-chat-7b/FP16 -p "how are you" -n 2 -d CPU

I'm able to obtain second token latency. The result is shown as follows:

May I know which operating system are you using on your machine? Do you encounter issue when using prompt text instead of prompt file? Could you please share the prompt file with us to replicate the issue from our end?

manojrashinkar · 2024-12-02T06:57:26Z

Prompt-file : https://github.com/openvinotoolkit/openvino.genai/blob/master/tools/llm_bench/prompts/llama-2-7b-chat_l.jsonl
Sorry i am running it on INTEL GPU : python benchmark.py -m models/llama-2-7b-chat/ -pf prompts/llama-2-7b-chat_l.jsonl -n 2 -d GPU -r benchmark_results.csv -f openvino

YuChern-Intel assigned Wan-Intel Dec 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with Token Latency Reporting for Second Token in OpenVINO GenAI LLM Models #1277

Issue with Token Latency Reporting for Second Token in OpenVINO GenAI LLM Models #1277

manojrashinkar commented Nov 29, 2024 •

edited

Loading

Wan-Intel commented Dec 2, 2024

manojrashinkar commented Dec 2, 2024

Issue with Token Latency Reporting for Second Token in OpenVINO GenAI LLM Models #1277

Issue with Token Latency Reporting for Second Token in OpenVINO GenAI LLM Models #1277

Comments

manojrashinkar commented Nov 29, 2024 • edited Loading

Wan-Intel commented Dec 2, 2024

manojrashinkar commented Dec 2, 2024

manojrashinkar commented Nov 29, 2024 •

edited

Loading