-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathasm-notes.txt
119 lines (85 loc) · 5.03 KB
/
asm-notes.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
Only IvB, not Haswell, can eliminate movzx %dl, %eax. Either way, it still takes a fused-domain uop,
so it only helps with latency (and execution units,
but that doesn't matter because so many of our uops are loads.)
using xmm instead of mmx: LUT is harder
punpckldq loads 64b instead of 32, and has the full alignment requirements. (GP fault on misaligned.)
movd, punpckldq: 2 uops (vs. 1 uop for punpck directly from RAM, but that can suffer from a cacheline split)
movd, palignr: 2 uops (palignr from mem is 2 by itself, so movd to a reg first doesn't cost extra)
after preparing 2 halves of value to xor with dest in 2 xmm regs, use movlhps or something to put low64 of one into high64 of the other (1uop)
or get results in low 16 of 64bit halves, and shift?
LUT loads:
movd: 1 uop, clears upper
punpckldq: 1 uop, shuffles previous contents over
PINSRW from RAM is 2 uops (1 load, 1 p15 SnB. tput 0.5, (just load + p5 on HWell: tput 1))
pmovzxwd / wq: 1 uops from RAM for xmm, 2uops from RAM for ymm. even WQ loads 2 src words, not just low word
# pshufd (2 uops from ram) non-destructive shuffle, but can't combine data from multiple regs
PALIGNR -14(%rbp, %rax, 2)...: 2 uops from RAM (load + p5 on HWell: tput 1)
PALIGNR for misaligned? More useful when you actually want 128B unaligned loads, I think?
or use it to shift old contents up, shifting new 2 or 4 bytes in from RAM. (PALIGNR from RAM is 2 uops, or 1 from xmm)
use it on -2(%rbp, %rax, 2) to get wanted table bytes, shifting old contents over? check if it's slow when non-loaded crosses cache line
punpcklwd: 1 uop: interleave by words, not dwords. With zero-padded LUT, could save the shift by two bytes to get the LH[src1] to the right place.
# punpckh: puts data from high half of dest into low half.
SnB bottleneck analysis for GP load 8byte, shift $16, 2xmovzx (24uop)
loads: 2 src + 16 LUT p23 loads: 9 cycles. for 22 p015 uops: 7.33 cycles. 40 uops total: 10 cycles.
So issue rate is still the bottleneck even with the lowest-total-uop version.
SnB including processing dest:
p23: 1 dest load + 1 dest store (needs AGU on SB/IvB) + 2 src + 16 LUT p23 loads: 10 cycles.
p015: 22 uops: 7.33 cycles.
issue rate: 42 uops total: 10.5 cycles. (not including any uops to combine the LUT lookups with xor)
**Cost to get 8 * 2B words loaded from mem, and into 1B chunks for base+index*scale lookups:
note: don't use %rbp or %r13 as the base, http://www.x86-64.org/documentation/assembly.html says
(%rbp,index,scale) isn't encodable, only 0(%rbp,index,scale).
pextrb: 33uops
16byte load: 1uop
16 * pextrb: 16 * 2 uops. (can't movd the first word)
pextrw + movzx: 32uops (8 of them zero-latency on IvB)
16byte load: 1uop
1st word: movd: 1 uop next 7 words: pextrw: 2 uops (p0 p5): 15 uops
2x movzx per word: 2 uop (1 on Ivy Bridge): 16(8) uops
GP reg: shr $8 + movzx: 30uops (14 of them zero-latency on IvB)
2x8byte load: 2 uops
2x7 movzx %dl, %eax: 14(0) uops
2x7 shr $8: 14 uops
(last 8b left in %rdx ready to use)
GP reg: shr $16 + movzx %dl/dh: 24uops (8 of the the 16 movzx are zero lat on IvB). exec ports: 22 p015, 2 p23.
2x8byte load: 2 uops
2x2 movzx: 4
2xshr $16: 2
2x2 movzx: 4
2xshr $16: 2
2x2 movzx: 4
2xshr $16: 2
2x2 movzx: 4
GP reg: split dep chain: 26 uops. (bad plan, just do 2x8 in parallel)
2x8byte load: 2 uops
2xmovq: 2 uops (0 lat on IvB and Haswell)
2xshr $32: 2 uops
2x4movzx: 8uops
2x2shr $16: 4uops
2x4movzx: 8uops
load XMM -> move to GP: +1 uop. (-1 load port, +3 p015). Higher latency load -> use
+1 movdqa: 1 uop (load port)
-2 movq: -2uop (load port)
+1 movq x,r 1 uop (p015)
+1 pextrq x,1,r 2 uops (p0, p015) (or 2 uops shift+movq)
load 1B at a time: 16 (but competes for the load ports with table lookups)
16 src + 16 LUT p23 loads: 32/2 = 16 cycles.
16x movzx: 16 uops
load 2B at a time: 24 (same as GP shifts, but bottlenecks the load port and has 3 cycle load-use latency more of the time)
8x movzwl: 8 uops (load)
2x8 movzx: 16 uops (reg, reg)
get the upper 16 ready for use:
shr $16, %rdx; 3 (1 (p05)+ low16+zero (or +garbage %rdx has high bytes))
bswap %edx 3 (1 (p1) + low16+garbage)
shld $8, %eax, %edx; shld $8, %ebx, %edx; 4 (1 + 1 (+ 2 for clearing dest regs))
zeroing a reg takes a fused uop, but no execution unit.
Still counts as one of the four from the uop cache per cycle
shld $8, %eax, %edx; shr $24, %edx; 3 (1 + 1 (+ 1 for the xor to clear eax.))
given %edx with the word we want in %dx, and garbage in the upper 16, prepare 2 regs with 8b indices:
movzx %dl, %eax; movzx %dh, %ebx; 2 uops (0 latency for %dl on IvB)
given %edx with the word in %dx, zeros in upper 16:
movzx %dl, %eax; movzx %dh, %ebx; 2 uops (0 latency for %dl on IvB)
movzx %dl, %eax; shr $8, %edx; 2 uops (0 latency for %dl on IvB)
xor %ebx, %ebx; xchg %dh, %bl; 5 (1(xor) + 3(xchg) + 1 partial reg merge)
lea (%dh), %bl; movzx %dl, %eax 2 (1 + 1 + 1 partial reg (%bl))
lea (%dh), %ebx; movzx %dl, %eax 2 (1 + 1 (if you can LEA 8bit address, 32bit operand size))