Which quantization mode is faster at runtime, W8A8 or W4A16? #414

cocosci · 2023-11-16T21:01:01Z

cocosci
Nov 16, 2023

Hi group,
Great to know TRT-LLM supports INT4 for Llama now! However, it's only for weights with activations in 16-bit. So, here's the question, compared to INT8 SQ where both W and A are in 8-bit, which is faster? Please help me out:)

Which quantization is faster W8A8 or W4A16?

INT4 (W4A16) because weights are in 4-bit and data moving is time consuming!

25%

INT8 (W8A8)

25%

Roughly the same...

0%

Hard to say as it depends on the kernel size, batch_size, seq_len, etc.

50%

8 votes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Which quantization mode is faster at runtime, W8A8 or W4A16? #414

{{title}}

Replies: 0 comments

Select a reply

Which quantization mode is faster at runtime, W8A8 or W4A16? #414

cocosci Nov 16, 2023

Replies: 0 comments

cocosci
Nov 16, 2023