Intel(R) Core(TM) Ultra 5 125H NPU so slowly? #1084

mnlife · 2024-10-28T07:27:18Z

when I using Intel(R) Core(TM) Ultra 5 125H to test, npu is so slowly?

install npu driver follow this: https://github.com/intel/linux-npu-driver/blob/main/docs/overview.md

pip install optimum-intel nncf==2.11 onnx==1.16.1
pip install --pre openvino openvino-tokenizers openvino-genai --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly

this is benchmark result

(envv) mnlife@ubuntu:~/openvino.genai/samples/python/benchmark_genai$ python benchmark_genai.py -p "why the Sun is yellow?" -nw 1 -n 1 -mt 200 -d CPU -m /home/mnlife/models/TinyLlama
Load time: 620.00 ms
Generate time: 70.05 ± 0.00 ms
Tokenization time: 0.38 ± 0.00 ms
Detokenization time: 0.55 ± 0.00 ms
TTFT: 69.48 ± 0.00 ms
TPOT: 69.48 ± 0.00 ms
Throughput : 14.39 ± 0.00 tokens/s
(envv) mnlife@ubuntu:~/openvino.genai/samples/python/benchmark_genai$ python benchmark_genai.py -p "why the Sun is yellow?" -nw 1 -n 1 -mt 200 -d NPU -m /home/mnlife/models/TinyLlama
Load time: 3325.00 ms
Generate time: 53633.71 ± 0.00 ms
Tokenization time: 0.36 ± 0.00 ms
Detokenization time: 0.50 ± 0.00 ms
TTFT: 4742.35 ± 0.00 ms
TPOT: 268.16 ± 317.23 ms
Throughput : 3.73 ± 4.41 tokens/s

The text was updated successfully, but these errors were encountered:

mnlife · 2024-10-28T07:35:59Z

very slow when I using npu

(envv) mnlife@ubuntu:~/models$ time python genai.py CPU
args: CPU
it is the color of the sky.



real	0m1.636s
user	0m3.346s
sys	0m1.512s
(envv) mnlife@ubuntu:~/models$ time python genai.py NPU
args: NPU
it is the color of the sky.



real	0m11.004s
user	1m36.290s
sys	0m5.256s
(envv) mnlife@ubuntu:~/models$ cat genai.py 
import openvino_genai as ov_genai
import sys
print(f"args: {sys.argv[1]}")
pipe = ov_genai.LLMPipeline("TinyLlama", sys.argv[1])
config = ov_genai.GenerationConfig()
config.max_new_tokens = 10
prompt = "The Sky is blue because"
res = pipe.generate(prompt, config)
print(res)
(envv) mnlife@ubuntu:~/models$

dmatveev · 2024-10-31T17:18:44Z

very slow when I using npu

(envv) mnlife@ubuntu:~/models$ time python genai.py CPU
args: CPU
it is the color of the sky.



real	0m1.636s
user	0m3.346s
sys	0m1.512s
(envv) mnlife@ubuntu:~/models$ time python genai.py NPU
args: NPU
it is the color of the sky.



real	0m11.004s
user	1m36.290s
sys	0m5.256s
(envv) mnlife@ubuntu:~/models$ cat genai.py 
import openvino_genai as ov_genai
import sys
print(f"args: {sys.argv[1]}")
pipe = ov_genai.LLMPipeline("TinyLlama", sys.argv[1])
config = ov_genai.GenerationConfig()
config.max_new_tokens = 10
prompt = "The Sky is blue because"
res = pipe.generate(prompt, config)
print(res)
(envv) mnlife@ubuntu:~/models$

Hi @mnlife , note - when you measure time end-to-end, you also include the model compile time on NPU - so it may show higher overall number than for CPU and GPU.

Please wait the new OpenVINO release and driver update, I hope @TolyaTalamanov will help you get better results.

mnlife · 2024-11-04T08:21:11Z

very slow when I using npu
(envv) mnlife@ubuntu:~/models$ time python genai.py CPU
args: CPU
it is the color of the sky.



real	0m1.636s
user	0m3.346s
sys	0m1.512s
(envv) mnlife@ubuntu:~/models$ time python genai.py NPU
args: NPU
it is the color of the sky.



real	0m11.004s
user	1m36.290s
sys	0m5.256s
(envv) mnlife@ubuntu:~/models$ cat genai.py 
import openvino_genai as ov_genai
import sys
print(f"args: {sys.argv[1]}")
pipe = ov_genai.LLMPipeline("TinyLlama", sys.argv[1])
config = ov_genai.GenerationConfig()
config.max_new_tokens = 10
prompt = "The Sky is blue because"
res = pipe.generate(prompt, config)
print(res)
(envv) mnlife@ubuntu:~/models$ 
Hi @mnlife , note - when you measure time end-to-end, you also include the model compile time on NPU - so it may show higher overall number than for CPU and GPU.

Please wait the new OpenVINO release and driver update, I hope @TolyaTalamanov will help you get better results.

Very much looking forward to your reply

junruizh2021 · 2024-11-06T02:36:01Z

@dmatveev Does performance of LLM on NPU rely on "remote tensors" feature? I also observed that the performance on NPU is worse than CPU.

mnlife · 2024-11-13T06:57:26Z

very slow when I using npu
(envv) mnlife@ubuntu:~/models$ time python genai.py CPU
args: CPU
it is the color of the sky.



real	0m1.636s
user	0m3.346s
sys	0m1.512s
(envv) mnlife@ubuntu:~/models$ time python genai.py NPU
args: NPU
it is the color of the sky.



real	0m11.004s
user	1m36.290s
sys	0m5.256s
(envv) mnlife@ubuntu:~/models$ cat genai.py 
import openvino_genai as ov_genai
import sys
print(f"args: {sys.argv[1]}")
pipe = ov_genai.LLMPipeline("TinyLlama", sys.argv[1])
config = ov_genai.GenerationConfig()
config.max_new_tokens = 10
prompt = "The Sky is blue because"
res = pipe.generate(prompt, config)
print(res)
(envv) mnlife@ubuntu:~/models$ 
Hi @mnlife , note - when you measure time end-to-end, you also include the model compile time on NPU - so it may show higher overall number than for CPU and GPU.

Please wait the new OpenVINO release and driver update, I hope @TolyaTalamanov will help you get better results.

Hi, Has this issue been resolved?

helena-intel · 2024-11-27T12:15:45Z

Hi @mnlife , OpenVINO 2024.5 was released recently with much improved performance for LLMs on NPU. Could you try again? Please follow the recommendations in the documentation https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide/genai-guide-npu.html . In particular: update the NPU driver, make sure to export the model with symmetric quantization, and add do_sample=False to pipe.generate(). The documentation also has tips for model caching among other things.

This is an example for exporting the model:

optimum-cli export openvino -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 --weight-format int4 --sym --ratio 1.0 --group-size 128 TinyLlama-1.1B-Chat-v1.0

Also note the comment above about including compilation time. To compare inference performance specifically, you can measure time before and after inference, e.g.

import time

start = time.perf_counter()
res = pipe.generate(prompt, config)
end = time.perf_counter()
print("Duration: ", end-start)

Using model caching as explained in the docs will speed up model compilation and therefore also improve overall duration of the script.

andrei-kochin assigned dmatveev and TolyaTalamanov Oct 28, 2024

andrei-kochin added the category: NPU label Oct 28, 2024

ilya-lavrenov added the category: LLM LLM pipeline (stateful, static) label Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intel(R) Core(TM) Ultra 5 125H NPU so slowly? #1084

Intel(R) Core(TM) Ultra 5 125H NPU so slowly? #1084

mnlife commented Oct 28, 2024 •

edited

Loading

mnlife commented Oct 28, 2024

dmatveev commented Oct 31, 2024

mnlife commented Nov 4, 2024

junruizh2021 commented Nov 6, 2024 •

edited

Loading

mnlife commented Nov 13, 2024

helena-intel commented Nov 27, 2024

Intel(R) Core(TM) Ultra 5 125H NPU so slowly? #1084

Intel(R) Core(TM) Ultra 5 125H NPU so slowly? #1084

Comments

mnlife commented Oct 28, 2024 • edited Loading

mnlife commented Oct 28, 2024

dmatveev commented Oct 31, 2024

mnlife commented Nov 4, 2024

junruizh2021 commented Nov 6, 2024 • edited Loading

mnlife commented Nov 13, 2024

helena-intel commented Nov 27, 2024

mnlife commented Oct 28, 2024 •

edited

Loading

junruizh2021 commented Nov 6, 2024 •

edited

Loading