Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: the generated text on BFloat16 is not as good as that on Float32. #443

Open
1 task done
ccrhx4 opened this issue Oct 29, 2024 · 2 comments
Open
1 task done
Assignees
Labels
bug Something isn't working

Comments

@ccrhx4
Copy link

ccrhx4 commented Oct 29, 2024

Your current environment

The output of `python collect_env.py`
Collecting environment information...
/usr/lib/python3.10/inspect.py:288: FutureWarning: `torch.distributed.reduce_op` is deprecated, please use `torch.distributed.ReduceOp` instead
  return isinstance(object, types.FunctionType)
PyTorch version: 2.4.0a0+git74cd574
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.30.5
Libc version: glibc-2.35

Python version: 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-112-generic-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      46 bits physical, 57 bits virtual
Byte Order:                         Little Endian
CPU(s):                             16
On-line CPU(s) list:                0-15
Vendor ID:                          GenuineIntel
Model name:                         Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
CPU family:                         6
Model:                              106
Thread(s) per core:                 2
Core(s) per socket:                 4
Socket(s):                          2
Stepping:                           6
BogoMIPS:                           4589.21
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid fsrm md_clear arch_capabilities
Virtualization:                     VT-x
Hypervisor vendor:                  KVM
Virtualization type:                full
L1d cache:                          384 KiB (8 instances)
L1i cache:                          256 KiB (8 instances)
L2 cache:                           10 MiB (8 instances)
L3 cache:                           60 MiB (1 instance)
NUMA node(s):                       2
NUMA node0 CPU(s):                  0-7
NUMA node1 CPU(s):                  8-15
Vulnerability Gather data sampling: Unknown: Dependent on hypervisor status
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI Syscall hardening, KVM SW loop
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Mitigation; TSX disabled

Versions of relevant libraries:
[pip3] habana-torch-dataloader==1.18.0.524
[pip3] habana-torch-plugin==1.18.0.524
[pip3] numpy==1.26.4
[pip3] pynvml==8.0.4
[pip3] pytorch-lightning==2.4.0
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.0a0+git74cd574
[pip3] torch_tb_profiler==0.4.0
[pip3] torchaudio==2.4.0a0+69d4077
[pip3] torchdata==0.7.1+5e6f7b7
[pip3] torchmetrics==1.4.2
[pip3] torchtext==0.18.0a0+9bed85d
[pip3] torchvision==0.19.0a0+48b1edf
[pip3] transformers==4.46.0
[pip3] triton==3.1.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.3.dev569+g2a38e6f5
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect

Model Input Dumps

No response

🐛 Describe the bug

Hi, on llama3-8b, I found that the quality of generated text for float32 is much better than that on bfloat16. I have attached my code in the end,

CUDA BF16: Prompt: 'The capital of France is', Generated text: ' Paris. It is located in the north of the country. Paris is the largest city in France and the second largest city in the European Union. It is also the most visited city in the world. Paris is a city of art, culture, and history. It is home to some of the most famous landmarks in the world, such as the Eiffel Tower, the Louvre Museum, and the Notre Dame Cathedral. Paris is also a city of fashion, with many famous designers and brands headquartered there. The city is also home to some of the best restaurants in the world. Paris is a city that is loved by people all over'

HPU BF16: Prompt: 'The capital of France is', Generated text: ' Paris, which is located in the north of the country. The city is located on the Seine River and is the largest city in France. Paris is a major tourist destination and is home to many famous landmarks, including the Eiffel Tower, the Louvre Museum, and the Notre Dame Cathedral. The city is also known for its fashion, food, and art.\nThe capital of France is Paris, which is located in the north of the country. The city is located on the Seine River and is the largest city in France. Paris is a major tourist destination and is home to many famous landmarks, including the Eiffel'

HPU FP32: Prompt: 'The capital of France is', Generated text: ' Paris. It is located in the north of the country. Paris is the largest city in France and the second largest city in the European Union. It is also the most visited city in the world. Paris is a city of art, culture, and history. It is home to some of the most famous landmarks in the world, such as the Eiffel Tower, the Louvre Museum, and the Notre Dame Cathedral. Paris is also a city of fashion, with many famous designers and brands headquartered there. The city is also known for its food, with many Michelin-starred restaurants and cafes. Paris is a city that is loved'

And with PT_HPU_MAX_COMPOUND_OP_SIZE=1 it generated result is trash:

Prompt: 'The future of AI is', Generated text: ' in the hands of the people\nThe future of AI is in the hands of the people\nThe future of AI is in the hands of the people\nThe future of AI is in the hands of the people\nThe future of AI is in the hands of the people\nThe future of AI is in the hands of the people\nThe future of AI is in the hands of the people\nThe future of AI is in the hands of the people\nThe future of AI is in the hands of the people\nThe future of AI is in the hands of the people\nThe future of AI is in the hands of the people\nThe'

from vllm import LLM, SamplingParams

# Sample prompts.
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
    "can you create python programs?",
]
# Create a sampling params object.
sampling_params = SamplingParams(max_tokens=128, temperature=0.0)

# Create an LLM.
llm = LLM(model="meta-llama/Meta-Llama-3-8B")
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Originally posted by @ccrhx4 in #275 (comment)

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@ccrhx4
Copy link
Author

ccrhx4 commented Nov 5, 2024

Hi @madamczykhabana , I found the second issue was caused by the softmax_mode="fast" in FusedSDPA. See PR HabanaAI/vllm-hpu-extension#26

diff --git a/vllm_hpu_extension/ops.py b/vllm_hpu_extension/ops.py
index 0bd542b..40aa062 100644
--- a/vllm_hpu_extension/ops.py
+++ b/vllm_hpu_extension/ops.py
@@ -223,7 +223,7 @@ def prompt_attention(
         if query_heads != kv_heads:
             key = repeat_kv(key, int(query_heads // kv_heads))
             value = repeat_kv(value, int(query_heads // kv_heads))
-        softmax_mode = 'fast'
+        softmax_mode = 'None'
         recompute_mode = True
         attn_weights = FusedSDPA.apply(query, key, value, None, 0.0, True,
                                        scale, softmax_mode, recompute_mode,

After the fix,
PT_HPU_MAX_COMPOUND_OP_SIZE=1 VLLM_PA_SOFTMAX_IMPL=scatter_reduce VLLM_SKIP_WARMUP=true python3 test.py

...
Prompt: 'The future of AI is', Generated text: ' here, and it’s already changing the way we live and work. From self-driving cars to virtual assistants, AI is making our lives easier and more efficient. But what does the future hold for AI? In this blog post, we’ll explore the future of AI and how it will impact our lives.\nThe Future of AI: What to Expect\nThe future of AI is bright, and it’s only going to get brighter. AI is already being used in a variety of industries, from healthcare to finance, and it’s only going to become more prevalent in the years to come. Here are some of the things you can expect to see'
...

@madamczykhabana
Copy link

I'm looking at Qwen2 accuracy at the moment. The reason why it's worse on bf16 then on fp32 is due to value ranges that appear when calculating attention.
For example if we calculate softmax on following tensor (values taken from a real run):
[2542.34228515625, 2545.36328125, 2541.418212890625, 2547.946533203125]
we'll get:
[0.0034073146525770426, 0.0698898509144783, 0.0013523614034056664, 0.9253504872322083]
If we cast the tensor to bf16 before that we get:
[2544.0, 2544.0, 2544.0, 2544.0]
which screws softmax calculation:
[0.25, 0.25, 0.25, 0.25]

I'm currently evaluating different strategies on how to fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants