Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No significant improvement in token generation time in version 2.5.1 vs. 2.3.1 #758

Open
shira-g opened this issue Dec 26, 2024 · 3 comments
Assignees

Comments

@shira-g
Copy link

shira-g commented Dec 26, 2024

Describe the issue

Hi,

I measure time to generate each token for microsoft/Phi-3-mini-4k-instruct model.
I use prompt size=512 and generation size=128, running on XPU. model precision: fp16.

  • Ipex v2.3.1: average time per token ~94 ms
  • Ipex v2.5.1: average time per token ~90 ms

Should I expect higher improvement in token time generation in ipex v2.5.1?

@ZhaoqiongZ ZhaoqiongZ self-assigned this Dec 30, 2024
@ZhaoqiongZ ZhaoqiongZ added XPU/GPU XPU/GPU specific issues Performance LLM labels Dec 30, 2024
@shira-g
Copy link
Author

shira-g commented Dec 31, 2024

For int-4 phi-3 model I get better improvement:

Ipex v2.3.1: average time per token ~50 ms
Ipex v2.5.1: average time per token ~40 ms

Is there a way to further improve time per token?

@ZhaoqiongZ
Copy link
Contributor

Hi @shira-g , have you tried WOQ in the llm inference guide?

@shira-g
Copy link
Author

shira-g commented Jan 2, 2025

Yes my code is based on the instructions here : https://github.com/intel/intel-extension-for-pytorch/tree/2bca09763eb6427c301bc4a57054e936abdf11e6/examples/gpu/llm/inference#learn-to-quantize-llm-and-run-inference

Here is my full code:

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch
import intel_extension_for_pytorch as ipex
import tqdm

model_id = "microsoft/Phi-3-mini-4k-instruct"
device = "xpu"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16)
model = model.eval().to("xpu")
model = model.to(memory_format=torch.channels_last)
model = ipex.llm.optimize(model, inplace=True, dtype=torch.float16, device="xpu")

context = 512
new_tokens = 128
warmup = 3
inputs = torch.randint(0, 30000, (1, context), device=device)
am = torch.ones_like(inputs, dtype=torch.bool, device=device)
from timed_text_streamer import TimedTextStreamer
streamer = TimedTextStreamer()

first_times = []
second_times = []

Warmup:

for _ in tqdm.tqdm(range(warmup), desc='Warmup', leave=False):
_ = model.generate(inputs, max_new_tokens=new_tokens, attention_mask=am, streamer=streamer,cache_implementation="static")

Benchmark:

iters = 5
for i in tqdm.tqdm(range(iters), desc='Benchmarking'):
output = model.generate(inputs, max_new_tokens=new_tokens, attention_mask=am, streamer=streamer,cache_implementation="static")
first, *second = streamer.token_times
first_times.append(first)
second_times.append(second)

r = torch.tensor(first_times), torch.tensor(second_times)
avg_first_time = (r[0]).mean()
avg_second_time = (r[1]).mean(dim=0)
print("avg first: " + str(avg_first_time))
print("avg rest: " + str(avg_second_time))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants