-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CPU] add arithmetic_mode impl for bf16_emitters #27737
base: master
Are you sure you want to change the base?
[CPU] add arithmetic_mode impl for bf16_emitters #27737
Conversation
Hi, @chenhu-wang and @dmitry-gorokhov : could you please help review this pr ? thanks |
src/plugins/intel_cpu/src/emitters/plugin/x64/jit_bf16_emitters.hpp
Outdated
Show resolved
Hide resolved
@liubo-intel , You use fixed "saturation" mode on avx512_core_bf16 ISA, but on some ISA like avx2, fixed "truncation" mode is used. This is not aligned. |
8f98525
to
99e994a
Compare
99e994a
to
f0b3f10
Compare
src/plugins/intel_cpu/src/emitters/plugin/x64/jit_bf16_emitters.hpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/emitters/plugin/x64/jit_bf16_emitters.hpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/emitters/plugin/x64/jit_bf16_emitters.hpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/emitters/plugin/x64/jit_bf16_emitters.hpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/emitters/plugin/x64/jit_bf16_emitters.hpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/emitters/plugin/x64/jit_bf16_emitters.hpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/emitters/plugin/x64/jit_bf16_emitters.hpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/emitters/plugin/x64/jit_bf16_emitters.hpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/emitters/plugin/x64/jit_bf16_emitters.hpp
Outdated
Show resolved
Hide resolved
@liubo-intel The positive overflow data is 0x7F7FXXXX. X could be any value. The convert should do smt as following function:
As we can see the saturation mode and truncation mode return same result for overflowed data. I think we can remove the explicit mode for simplicity, extend mode when it is really needed explicitly in future. |
Hi, @chenhu-wang : as we synced offline, 'vcvtneps2bf16' is not used for truncation mode because this instruction output of overflow data is 'inf' instead of truncation values. |
The input of vcvtneps2bf16 is bf16_max for overflow data after saturation. vcvtneps2bf16 to bf16_max is still bf16_max, not clear why inf is generated. How onnx truncation impl work, could you please elaborate more detail? |
Hi, @chenhu-wang : the failed CI case is an onnx backend testcase(the same as shown in commit f0b3f10 CI failure of this pr). I'm not sure whether is it worthy to investigate deeper this onnx backend impl or involve onnx team related colleagues, if the impact of current method on performance is within an acceptable range. Hi, @dmitry-gorokhov what's your suggestions about this? |
Hi, @chenhu-wang and @dmitry-gorokhov : I think I know the reason why onnx backend test fails if we use 'saturation' method(function) for truncation mode. this 'saturation' function will change the original (f32)input ['nan, -nan, inf, -inf'] values which is not expected. |
160c7b8
to
72444fb
Compare
h->uni_vpaddd(aux, in, aux); | ||
h->vfixupimmps(aux, in, table_val("selector"), 0); | ||
h->uni_vpaddd(aux, clamped, aux); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please double check we can remove h->vfixupimmps(aux, in, table_val("selector"), 0);
here. Lines above have possibility to spoil the inf/nan bit again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Vmm in = Vmm(in_vec_idxs[0]); | ||
Vmm clamped = Vmm(aux_vec_idxs[0]); | ||
saturate_input(clamped, in, "bf16_min", "bf16_max"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can consider reuse Vmm "in" with Vmm "clamped". Vmm "in" will store clamped data and save one aux. clamp does not break "keep_source_intact" as I understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
saturate_input(in, in, "bf16_min", "bf16_max");
will make the wrong outputs, maybe because the temp data will need this original 'in' data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point is:
saturate_input(clamped, in, "bf16_min", "bf16_max");
h->uni_vmovups(in, clamped);
clamped is just a aux vec, do not hold for long. Just a taste.
60ffe50
to
a5c2669
Compare
Hi @dmitry-gorokhov Could you please take a review? Thanks! |
a5c2669
to
fd3e1fb
Compare
…mitter and store_emitter
…turate_input function nan,inf process issue
ff1294d
to
b43892d
Compare
Hi, @chenhu-wang and @dmitry-gorokhov : since I found truncation impl instructions may affect the model's infer time in certain situations, how about we limit these truncation impl to |
…p acc fix and minimize the impact on performance
b43892d
to
d5c00a2
Compare
Details:
Tickets: