Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intel(R) Core(TM) Ultra 5 125H NPU so slowly? #1084

Open
mnlife opened this issue Oct 28, 2024 · 6 comments
Open

Intel(R) Core(TM) Ultra 5 125H NPU so slowly? #1084

mnlife opened this issue Oct 28, 2024 · 6 comments
Assignees
Labels
category: LLM LLM pipeline (stateful, static) category: NPU

Comments

@mnlife
Copy link

mnlife commented Oct 28, 2024

when I using Intel(R) Core(TM) Ultra 5 125H to test, npu is so slowly?

install npu driver follow this: https://github.com/intel/linux-npu-driver/blob/main/docs/overview.md

pip install optimum-intel nncf==2.11 onnx==1.16.1
pip install --pre openvino openvino-tokenizers openvino-genai --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly

this is benchmark result

(envv) mnlife@ubuntu:~/openvino.genai/samples/python/benchmark_genai$ python benchmark_genai.py -p "why the Sun is yellow?" -nw 1 -n 1 -mt 200 -d CPU -m /home/mnlife/models/TinyLlama
Load time: 620.00 ms
Generate time: 70.05 ± 0.00 ms
Tokenization time: 0.38 ± 0.00 ms
Detokenization time: 0.55 ± 0.00 ms
TTFT: 69.48 ± 0.00 ms
TPOT: 69.48 ± 0.00 ms
Throughput : 14.39 ± 0.00 tokens/s
(envv) mnlife@ubuntu:~/openvino.genai/samples/python/benchmark_genai$ python benchmark_genai.py -p "why the Sun is yellow?" -nw 1 -n 1 -mt 200 -d NPU -m /home/mnlife/models/TinyLlama
Load time: 3325.00 ms
Generate time: 53633.71 ± 0.00 ms
Tokenization time: 0.36 ± 0.00 ms
Detokenization time: 0.50 ± 0.00 ms
TTFT: 4742.35 ± 0.00 ms
TPOT: 268.16 ± 317.23 ms
Throughput : 3.73 ± 4.41 tokens/s
@mnlife
Copy link
Author

mnlife commented Oct 28, 2024

very slow when I using npu

(envv) mnlife@ubuntu:~/models$ time python genai.py CPU
args: CPU
it is the color of the sky.



real	0m1.636s
user	0m3.346s
sys	0m1.512s
(envv) mnlife@ubuntu:~/models$ time python genai.py NPU
args: NPU
it is the color of the sky.



real	0m11.004s
user	1m36.290s
sys	0m5.256s
(envv) mnlife@ubuntu:~/models$ cat genai.py 
import openvino_genai as ov_genai
import sys
print(f"args: {sys.argv[1]}")
pipe = ov_genai.LLMPipeline("TinyLlama", sys.argv[1])
config = ov_genai.GenerationConfig()
config.max_new_tokens = 10
prompt = "The Sky is blue because"
res = pipe.generate(prompt, config)
print(res)
(envv) mnlife@ubuntu:~/models$ 

@dmatveev
Copy link
Contributor

very slow when I using npu

(envv) mnlife@ubuntu:~/models$ time python genai.py CPU
args: CPU
it is the color of the sky.



real	0m1.636s
user	0m3.346s
sys	0m1.512s
(envv) mnlife@ubuntu:~/models$ time python genai.py NPU
args: NPU
it is the color of the sky.



real	0m11.004s
user	1m36.290s
sys	0m5.256s
(envv) mnlife@ubuntu:~/models$ cat genai.py 
import openvino_genai as ov_genai
import sys
print(f"args: {sys.argv[1]}")
pipe = ov_genai.LLMPipeline("TinyLlama", sys.argv[1])
config = ov_genai.GenerationConfig()
config.max_new_tokens = 10
prompt = "The Sky is blue because"
res = pipe.generate(prompt, config)
print(res)
(envv) mnlife@ubuntu:~/models$ 

Hi @mnlife , note - when you measure time end-to-end, you also include the model compile time on NPU - so it may show higher overall number than for CPU and GPU.

Please wait the new OpenVINO release and driver update, I hope @TolyaTalamanov will help you get better results.

@mnlife
Copy link
Author

mnlife commented Nov 4, 2024

very slow when I using npu

(envv) mnlife@ubuntu:~/models$ time python genai.py CPU
args: CPU
it is the color of the sky.



real	0m1.636s
user	0m3.346s
sys	0m1.512s
(envv) mnlife@ubuntu:~/models$ time python genai.py NPU
args: NPU
it is the color of the sky.



real	0m11.004s
user	1m36.290s
sys	0m5.256s
(envv) mnlife@ubuntu:~/models$ cat genai.py 
import openvino_genai as ov_genai
import sys
print(f"args: {sys.argv[1]}")
pipe = ov_genai.LLMPipeline("TinyLlama", sys.argv[1])
config = ov_genai.GenerationConfig()
config.max_new_tokens = 10
prompt = "The Sky is blue because"
res = pipe.generate(prompt, config)
print(res)
(envv) mnlife@ubuntu:~/models$ 

Hi @mnlife , note - when you measure time end-to-end, you also include the model compile time on NPU - so it may show higher overall number than for CPU and GPU.

Please wait the new OpenVINO release and driver update, I hope @TolyaTalamanov will help you get better results.

Very much looking forward to your reply

@ilya-lavrenov ilya-lavrenov added the category: LLM LLM pipeline (stateful, static) label Nov 5, 2024
@junruizh2021
Copy link

junruizh2021 commented Nov 6, 2024

@dmatveev Does performance of LLM on NPU rely on "remote tensors" feature? I also observed that the performance on NPU is worse than CPU.

@mnlife
Copy link
Author

mnlife commented Nov 13, 2024

very slow when I using npu

(envv) mnlife@ubuntu:~/models$ time python genai.py CPU
args: CPU
it is the color of the sky.



real	0m1.636s
user	0m3.346s
sys	0m1.512s
(envv) mnlife@ubuntu:~/models$ time python genai.py NPU
args: NPU
it is the color of the sky.



real	0m11.004s
user	1m36.290s
sys	0m5.256s
(envv) mnlife@ubuntu:~/models$ cat genai.py 
import openvino_genai as ov_genai
import sys
print(f"args: {sys.argv[1]}")
pipe = ov_genai.LLMPipeline("TinyLlama", sys.argv[1])
config = ov_genai.GenerationConfig()
config.max_new_tokens = 10
prompt = "The Sky is blue because"
res = pipe.generate(prompt, config)
print(res)
(envv) mnlife@ubuntu:~/models$ 

Hi @mnlife , note - when you measure time end-to-end, you also include the model compile time on NPU - so it may show higher overall number than for CPU and GPU.

Please wait the new OpenVINO release and driver update, I hope @TolyaTalamanov will help you get better results.

Hi, Has this issue been resolved?

@helena-intel
Copy link
Contributor

Hi @mnlife , OpenVINO 2024.5 was released recently with much improved performance for LLMs on NPU. Could you try again? Please follow the recommendations in the documentation https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide/genai-guide-npu.html . In particular: update the NPU driver, make sure to export the model with symmetric quantization, and add do_sample=False to pipe.generate(). The documentation also has tips for model caching among other things.

This is an example for exporting the model:

optimum-cli export openvino -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 --weight-format int4 --sym --ratio 1.0 --group-size 128 TinyLlama-1.1B-Chat-v1.0

Also note the comment above about including compilation time. To compare inference performance specifically, you can measure time before and after inference, e.g.

import time

start = time.perf_counter()
res = pipe.generate(prompt, config)
end = time.perf_counter()
print("Duration: ", end-start)

Using model caching as explained in the docs will speed up model compilation and therefore also improve overall duration of the script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: LLM LLM pipeline (stateful, static) category: NPU
Projects
None yet
Development

No branches or pull requests

7 participants