No significant improvement in token generation time in version 2.5.1 vs. 2.3.1 #758

shira-g · 2024-12-26T08:02:40Z

Describe the issue

Hi,

I measure time to generate each token for microsoft/Phi-3-mini-4k-instruct model.
I use prompt size=512 and generation size=128, running on XPU. model precision: fp16.

Ipex v2.3.1: average time per token ~94 ms
Ipex v2.5.1: average time per token ~90 ms

Should I expect higher improvement in token time generation in ipex v2.5.1?

The text was updated successfully, but these errors were encountered:

shira-g · 2024-12-31T10:51:41Z

For int-4 phi-3 model I get better improvement:

Ipex v2.3.1: average time per token ~50 ms
Ipex v2.5.1: average time per token ~40 ms

Is there a way to further improve time per token?

ZhaoqiongZ · 2025-01-02T01:11:52Z

Hi @shira-g , have you tried WOQ in the llm inference guide?

shira-g · 2025-01-02T07:05:54Z

Yes my code is based on the instructions here : https://github.com/intel/intel-extension-for-pytorch/tree/2bca09763eb6427c301bc4a57054e936abdf11e6/examples/gpu/llm/inference#learn-to-quantize-llm-and-run-inference

Here is my full code:

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch
import intel_extension_for_pytorch as ipex
import tqdm

model_id = "microsoft/Phi-3-mini-4k-instruct"
device = "xpu"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16)
model = model.eval().to("xpu")
model = model.to(memory_format=torch.channels_last)
model = ipex.llm.optimize(model, inplace=True, dtype=torch.float16, device="xpu")

context = 512
new_tokens = 128
warmup = 3
inputs = torch.randint(0, 30000, (1, context), device=device)
am = torch.ones_like(inputs, dtype=torch.bool, device=device)
from timed_text_streamer import TimedTextStreamer
streamer = TimedTextStreamer()

first_times = []
second_times = []

Warmup:

for _ in tqdm.tqdm(range(warmup), desc='Warmup', leave=False):
_ = model.generate(inputs, max_new_tokens=new_tokens, attention_mask=am, streamer=streamer,cache_implementation="static")

Benchmark:

iters = 5
for i in tqdm.tqdm(range(iters), desc='Benchmarking'):
output = model.generate(inputs, max_new_tokens=new_tokens, attention_mask=am, streamer=streamer,cache_implementation="static")
first, *second = streamer.token_times
first_times.append(first)
second_times.append(second)

r = torch.tensor(first_times), torch.tensor(second_times)
avg_first_time = (r[0]).mean()
avg_second_time = (r[1]).mean(dim=0)
print("avg first: " + str(avg_first_time))
print("avg rest: " + str(avg_second_time))

ZhaoqiongZ self-assigned this Dec 30, 2024

ZhaoqiongZ added XPU/GPU XPU/GPU specific issues Performance LLM labels Dec 30, 2024

ZhaoqiongZ added NotAnIssue Query labels Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No significant improvement in token generation time in version 2.5.1 vs. 2.3.1 #758

No significant improvement in token generation time in version 2.5.1 vs. 2.3.1 #758

shira-g commented Dec 26, 2024

shira-g commented Dec 31, 2024

ZhaoqiongZ commented Jan 2, 2025

shira-g commented Jan 2, 2025

No significant improvement in token generation time in version 2.5.1 vs. 2.3.1 #758

No significant improvement in token generation time in version 2.5.1 vs. 2.3.1 #758

Comments

shira-g commented Dec 26, 2024

Describe the issue

shira-g commented Dec 31, 2024

ZhaoqiongZ commented Jan 2, 2025

shira-g commented Jan 2, 2025

Warmup:

Benchmark: