BF16 Compute DType on AVX512 ISA #308

Alavandar08 · 2024-07-04T19:39:45Z

In the Bestla README.md Weight only Quantization supported config is provided - https://github.com/intel/neural-speed/blob/main/bestla/README.md#weight-only

As Bestla supports the BF16 compute DType, I have quantized the model using quantize.py - https://github.com/intel/neural-speed/blob/main/scripts/quantize.py

Ex: python scripts/quantize.py --model_name llama2 --model_file ne-f32.bin --out_file ne-q4_j.bin --weight_dtype int4 --group_size 128 --compute_dtype bf16

During the Inference cycle, I noticed that for both FP32 and BF16 computation types F32 APIs are being triggered:
One scenario is with in QKV fusion BTLAGemmCompF32() is triggered with both F32 and BF16 -

neural-speed/neural_speed/core/layers/ip_fusion_qkv.cpp

Line 71 in 97c8190

    
           void BTLAGemmCompF32(const int M, const int N, const int K, const float* A, const int lda,

Question 1: I would like to know, if I can use Bestla/Neural speed APIs for BF16 compute Dtype without falling back to F32 on AVX512 ISA and what about the input Dtype the API supports?

Alavandar08 changed the title ~~BF16 Compute DType understanding~~ BF16 Compute DType on AVX512 ISA Jul 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BF16 Compute DType on AVX512 ISA #308

BF16 Compute DType on AVX512 ISA #308

Alavandar08 commented Jul 4, 2024 •

edited

Loading

BF16 Compute DType on AVX512 ISA #308

BF16 Compute DType on AVX512 ISA #308

Comments

Alavandar08 commented Jul 4, 2024 • edited Loading

Alavandar08 commented Jul 4, 2024 •

edited

Loading