Skip to content

Latest commit

 

History

History
807 lines (752 loc) · 85.9 KB

skylake-x-w-2104-gcc8.1.0.rst

File metadata and controls

807 lines (752 loc) · 85.9 KB

Population count comparison for Xeon W-2104 CPU @ 3.20GHz

Generated on: 2019-12-08

CPU: Xeon W-2104 CPU @ 3.20GHz

Compiler: gcc version 8.1.0 (Ubuntu 8.1.0-5ubuntu1~16.04)

Instruction set: AVX512BW

Number of runs: 5

All times are given in seconds.

procedure description
lookup-8 lookup in std::uint8_t[256] LUT
lookup-64 lookup in std::uint64_t[256] LUT
bit-parallel naive bit parallel method
bit-parallel-optimized a bit better bit parallel
bit-parallel-mul bit-parallel with fewer instructions
bit-parallel32 naive bit parallel method (32 bit)
bit-parallel-optimized32 a bit better bit parallel (32 bit)
harley-seal Harley-Seal popcount (4th iteration)
sse-bit-parallel SSE implementation of bit-parallel-optimized (unrolled)
sse-bit-parallel-original SSE implementation of bit-parallel-optimized
sse-bit-parallel-better SSE implementation of bit-parallel with fewer instructions
sse-harley-seal SSE implementation of Harley-Seal
sse-lookup SSSE3 variant using pshufb instruction (unrolled)
sse-lookup-original SSSE3 variant using pshufb instruction
avx2-lookup AVX2 variant using pshufb instruction (unrolled)
avx2-lookup-original AVX2 variant using pshufb instruction
avx2-harley-seal AVX2 implementation of Harley-Seal
cpu CPU instruction popcnt (64-bit variant)
sse-cpu load data with SSE, then count bits using popcnt
avx2-cpu load data with AVX2, then count bits using popcnt
avx512-harley-seal AVX512 implementation of Harley-Seal
avx512bw-shuf AVX512BW implementation uses shuffle instruction
builtin-popcnt builtin for popcnt
builtin-popcnt32 builtin for popcnt (32-bit variant)
builtin-popcnt-unrolled unrolled builtin-popcnt
builtin-popcnt-unrolled32 unrolled builtin-popcnt32
builtin-popcnt-unrolled-errata unrolled builtin-popcnt avoiding false-dependency
builtin-popcnt-unrolled-errata-manual unrolled builtin-popcnt avoiding false-dependency (asembly code)
builtin-popcnt-movdq builtin-popcnt where data is loaded via SSE registers
builtin-popcnt-movdq-unrolled builtin-popcnt-movdq unrolled
builtin-popcnt-movdq-unrolled_manual builtin-popcnt-movdq unrolled (assembly code)
procedure 32 B 64 B 128 B 256 B 512 B 1024 B 2048 B 4096 B
lookup-8 1.19116 1.09751 1.05118 1.02817 1.68133 1.64420 1.62533 1.61539
lookup-64 1.16511 1.09198 1.05227 1.03253 1.69700 1.65446 1.63190 1.62115
bit-parallel 1.26917 1.14385 1.08548 1.05825 1.67150 1.66064 1.65513 1.66514
bit-parallel-optimized 0.90882 0.78403 0.73039 0.70171 1.09967 1.08818 1.08248 1.09080
bit-parallel-mul 0.75216 0.67385 0.64054 0.62674 0.99193 1.02415 1.00252 0.99264
bit-parallel32 1.81757 1.74194 1.71687 1.70441 2.71705 2.71217 2.73145 2.71965
bit-parallel-optimized32 1.40133 1.32995 1.29086 1.27213 2.02035 2.01282 2.02735 2.01658
harley-seal 1.01330 0.83291 0.50931 0.39572 0.53908 0.49207 0.46857 0.46524
sse-bit-parallel 2.00714 1.61196 1.29326 0.78731 0.90954 0.73791 0.64884 0.60464
sse-bit-parallel-original 1.21799 0.78844 0.58476 0.49625 0.72684 0.68865 0.67598 0.67192
sse-bit-parallel-better 1.64924 1.55477 0.93940 0.62167 0.73995 0.61430 0.55213 0.52073
sse-harley-seal 1.22938 0.78968 0.56908 0.27129 0.33717 0.28957 0.26622 0.25450
sse-lookup 0.50139 0.35814 0.24174 0.20464 0.31130 0.30164 0.29683 0.29483
sse-lookup-original 1.64531 0.95388 0.60114 0.43871 0.58401 0.53005 0.52198 0.49898
avx2-lookup 0.47421 0.30170 0.20555 0.14685 0.19706 0.16914 0.15924 0.15487
avx2-lookup-original 1.50636 0.88887 0.52544 0.55798 0.43556 0.36215 0.33397 0.32436
avx2-harley-seal 1.03406 0.58683 0.37282 0.26332 0.20285 0.15388 0.13064 0.11857
cpu 0.34469 0.23502 0.16451 0.13317 0.20055 0.20061 0.20058 0.20631
sse-cpu 1.72117 0.25092 0.21348 0.19193 0.29147 0.28359 0.27968 0.27772
avx2-cpu 2.80064 2.11516 0.28004 0.23874 0.35099 0.33524 0.32587 0.32200
avx512-harley-seal 3.94491 0.81799 0.46683 0.29606 0.33146 0.12317 0.08344 0.06327
avx512bw-shuf 2.10369 1.78203 1.01979 0.66235 0.63044 0.39010 0.22794 0.18797
builtin-popcnt 0.34478 0.29778 0.27428 0.26253 0.41058 0.44010 0.42057 0.41129
builtin-popcnt32 0.50161 0.50126 0.50292 0.50593 0.90008 0.87112 0.84699 0.84165
builtin-popcnt-unrolled 0.31336 0.25068 0.21934 0.20368 0.31310 0.30703 0.30345 0.30967
builtin-popcnt-unrolled32 0.43750 0.40176 0.32173 0.31296 0.48960 0.48478 0.50331 0.49151
builtin-popcnt-unrolled-errata 0.28202 0.20368 0.15673 0.13317 0.20368 0.20211 0.20133 0.20848
builtin-popcnt-unrolled-errata-manual 0.45735 0.30761 0.23215 0.19540 0.28193 0.26632 0.25850 0.26378
builtin-popcnt-movdq 0.21151 0.18420 0.17814 0.17976 0.29001 0.29146 0.30206 0.29324
builtin-popcnt-movdq-unrolled 0.32505 0.23502 0.18807 0.16586 0.24785 0.23762 0.23254 0.23797
builtin-popcnt-movdq-unrolled_manual 0.40737 0.25910 0.20416 0.18313 0.28186 0.26956 0.26400 0.28840
procedure time [s] relative time (less is better)
lookup-8 1.19116 ███████████████
lookup-64 1.16511 ██████████████▊
bit-parallel 1.26917 ████████████████
bit-parallel-optimized 0.90882 ███████████▌
bit-parallel-mul 0.75216 █████████▌
bit-parallel32 1.81757 ███████████████████████
bit-parallel-optimized32 1.40133 █████████████████▊
harley-seal 1.01330 ████████████▊
sse-bit-parallel 2.00714 █████████████████████████▍
sse-bit-parallel-original 1.21799 ███████████████▍
sse-bit-parallel-better 1.64924 ████████████████████▉
sse-harley-seal 1.22938 ███████████████▌
sse-lookup 0.50139 ██████▎
sse-lookup-original 1.64531 ████████████████████▊
avx2-lookup 0.47421 ██████
avx2-lookup-original 1.50636 ███████████████████
avx2-harley-seal 1.03406 █████████████
cpu 0.34469 ████▎
sse-cpu 1.72117 █████████████████████▊
avx2-cpu 2.80064 ███████████████████████████████████▍
avx512-harley-seal 3.94491 ██████████████████████████████████████████████████
avx512bw-shuf 2.10369 ██████████████████████████▋
builtin-popcnt 0.34478 ████▎
builtin-popcnt32 0.50161 ██████▎
builtin-popcnt-unrolled 0.31336 ███▉
builtin-popcnt-unrolled32 0.43750 █████▌
builtin-popcnt-unrolled-errata 0.28202 ███▌
builtin-popcnt-unrolled-errata-manual 0.45735 █████▊
builtin-popcnt-movdq 0.21151 ██▋
builtin-popcnt-movdq-unrolled 0.32505 ████
builtin-popcnt-movdq-unrolled_manual 0.40737 █████▏
procedure time [s] relative time (less is better)
lookup-8 1.09751 █████████████████████████▉
lookup-64 1.09198 █████████████████████████▊
bit-parallel 1.14385 ███████████████████████████
bit-parallel-optimized 0.78403 ██████████████████▌
bit-parallel-mul 0.67385 ███████████████▉
bit-parallel32 1.74194 █████████████████████████████████████████▏
bit-parallel-optimized32 1.32995 ███████████████████████████████▍
harley-seal 0.83291 ███████████████████▋
sse-bit-parallel 1.61196 ██████████████████████████████████████
sse-bit-parallel-original 0.78844 ██████████████████▋
sse-bit-parallel-better 1.55477 ████████████████████████████████████▊
sse-harley-seal 0.78968 ██████████████████▋
sse-lookup 0.35814 ████████▍
sse-lookup-original 0.95388 ██████████████████████▌
avx2-lookup 0.30170 ███████▏
avx2-lookup-original 0.88887 █████████████████████
avx2-harley-seal 0.58683 █████████████▊
cpu 0.23502 █████▌
sse-cpu 0.25092 █████▉
avx2-cpu 2.11516 ██████████████████████████████████████████████████
avx512-harley-seal 0.81799 ███████████████████▎
avx512bw-shuf 1.78203 ██████████████████████████████████████████▏
builtin-popcnt 0.29778 ███████
builtin-popcnt32 0.50126 ███████████▊
builtin-popcnt-unrolled 0.25068 █████▉
builtin-popcnt-unrolled32 0.40176 █████████▍
builtin-popcnt-unrolled-errata 0.20368 ████▊
builtin-popcnt-unrolled-errata-manual 0.30761 ███████▎
builtin-popcnt-movdq 0.18420 ████▎
builtin-popcnt-movdq-unrolled 0.23502 █████▌
builtin-popcnt-movdq-unrolled_manual 0.25910 ██████
procedure time [s] relative time (less is better)
lookup-8 1.05118 ██████████████████████████████▌
lookup-64 1.05227 ██████████████████████████████▋
bit-parallel 1.08548 ███████████████████████████████▌
bit-parallel-optimized 0.73039 █████████████████████▎
bit-parallel-mul 0.64054 ██████████████████▋
bit-parallel32 1.71687 ██████████████████████████████████████████████████
bit-parallel-optimized32 1.29086 █████████████████████████████████████▌
harley-seal 0.50931 ██████████████▊
sse-bit-parallel 1.29326 █████████████████████████████████████▋
sse-bit-parallel-original 0.58476 █████████████████
sse-bit-parallel-better 0.93940 ███████████████████████████▎
sse-harley-seal 0.56908 ████████████████▌
sse-lookup 0.24174 ███████
sse-lookup-original 0.60114 █████████████████▌
avx2-lookup 0.20555 █████▉
avx2-lookup-original 0.52544 ███████████████▎
avx2-harley-seal 0.37282 ██████████▊
cpu 0.16451 ████▊
sse-cpu 0.21348 ██████▏
avx2-cpu 0.28004 ████████▏
avx512-harley-seal 0.46683 █████████████▌
avx512bw-shuf 1.01979 █████████████████████████████▋
builtin-popcnt 0.27428 ███████▉
builtin-popcnt32 0.50292 ██████████████▋
builtin-popcnt-unrolled 0.21934 ██████▍
builtin-popcnt-unrolled32 0.32173 █████████▎
builtin-popcnt-unrolled-errata 0.15673 ████▌
builtin-popcnt-unrolled-errata-manual 0.23215 ██████▊
builtin-popcnt-movdq 0.17814 █████▏
builtin-popcnt-movdq-unrolled 0.18807 █████▍
builtin-popcnt-movdq-unrolled_manual 0.20416 █████▉
procedure time [s] relative time (less is better)
lookup-8 1.02817 ██████████████████████████████▏
lookup-64 1.03253 ██████████████████████████████▎
bit-parallel 1.05825 ███████████████████████████████
bit-parallel-optimized 0.70171 ████████████████████▌
bit-parallel-mul 0.62674 ██████████████████▍
bit-parallel32 1.70441 ██████████████████████████████████████████████████
bit-parallel-optimized32 1.27213 █████████████████████████████████████▎
harley-seal 0.39572 ███████████▌
sse-bit-parallel 0.78731 ███████████████████████
sse-bit-parallel-original 0.49625 ██████████████▌
sse-bit-parallel-better 0.62167 ██████████████████▏
sse-harley-seal 0.27129 ███████▉
sse-lookup 0.20464 ██████
sse-lookup-original 0.43871 ████████████▊
avx2-lookup 0.14685 ████▎
avx2-lookup-original 0.55798 ████████████████▎
avx2-harley-seal 0.26332 ███████▋
cpu 0.13317 ███▉
sse-cpu 0.19193 █████▋
avx2-cpu 0.23874 ███████
avx512-harley-seal 0.29606 ████████▋
avx512bw-shuf 0.66235 ███████████████████▍
builtin-popcnt 0.26253 ███████▋
builtin-popcnt32 0.50593 ██████████████▊
builtin-popcnt-unrolled 0.20368 █████▉
builtin-popcnt-unrolled32 0.31296 █████████▏
builtin-popcnt-unrolled-errata 0.13317 ███▉
builtin-popcnt-unrolled-errata-manual 0.19540 █████▋
builtin-popcnt-movdq 0.17976 █████▎
builtin-popcnt-movdq-unrolled 0.16586 ████▊
builtin-popcnt-movdq-unrolled_manual 0.18313 █████▎
procedure time [s] relative time (less is better)
lookup-8 1.68133 ██████████████████████████████▉
lookup-64 1.69700 ███████████████████████████████▏
bit-parallel 1.67150 ██████████████████████████████▊
bit-parallel-optimized 1.09967 ████████████████████▏
bit-parallel-mul 0.99193 ██████████████████▎
bit-parallel32 2.71705 ██████████████████████████████████████████████████
bit-parallel-optimized32 2.02035 █████████████████████████████████████▏
harley-seal 0.53908 █████████▉
sse-bit-parallel 0.90954 ████████████████▋
sse-bit-parallel-original 0.72684 █████████████▍
sse-bit-parallel-better 0.73995 █████████████▌
sse-harley-seal 0.33717 ██████▏
sse-lookup 0.31130 █████▋
sse-lookup-original 0.58401 ██████████▋
avx2-lookup 0.19706 ███▋
avx2-lookup-original 0.43556 ████████
avx2-harley-seal 0.20285 ███▋
cpu 0.20055 ███▋
sse-cpu 0.29147 █████▎
avx2-cpu 0.35099 ██████▍
avx512-harley-seal 0.33146 ██████
avx512bw-shuf 0.63044 ███████████▌
builtin-popcnt 0.41058 ███████▌
builtin-popcnt32 0.90008 ████████████████▌
builtin-popcnt-unrolled 0.31310 █████▊
builtin-popcnt-unrolled32 0.48960 █████████
builtin-popcnt-unrolled-errata 0.20368 ███▋
builtin-popcnt-unrolled-errata-manual 0.28193 █████▏
builtin-popcnt-movdq 0.29001 █████▎
builtin-popcnt-movdq-unrolled 0.24785 ████▌
builtin-popcnt-movdq-unrolled_manual 0.28186 █████▏
procedure time [s] relative time (less is better)
lookup-8 1.64420 ██████████████████████████████▎
lookup-64 1.65446 ██████████████████████████████▌
bit-parallel 1.66064 ██████████████████████████████▌
bit-parallel-optimized 1.08818 ████████████████████
bit-parallel-mul 1.02415 ██████████████████▉
bit-parallel32 2.71217 ██████████████████████████████████████████████████
bit-parallel-optimized32 2.01282 █████████████████████████████████████
harley-seal 0.49207 █████████
sse-bit-parallel 0.73791 █████████████▌
sse-bit-parallel-original 0.68865 ████████████▋
sse-bit-parallel-better 0.61430 ███████████▎
sse-harley-seal 0.28957 █████▎
sse-lookup 0.30164 █████▌
sse-lookup-original 0.53005 █████████▊
avx2-lookup 0.16914 ███
avx2-lookup-original 0.36215 ██████▋
avx2-harley-seal 0.15388 ██▊
cpu 0.20061 ███▋
sse-cpu 0.28359 █████▏
avx2-cpu 0.33524 ██████▏
avx512-harley-seal 0.12317 ██▎
avx512bw-shuf 0.39010 ███████▏
builtin-popcnt 0.44010 ████████
builtin-popcnt32 0.87112 ████████████████
builtin-popcnt-unrolled 0.30703 █████▋
builtin-popcnt-unrolled32 0.48478 ████████▉
builtin-popcnt-unrolled-errata 0.20211 ███▋
builtin-popcnt-unrolled-errata-manual 0.26632 ████▉
builtin-popcnt-movdq 0.29146 █████▎
builtin-popcnt-movdq-unrolled 0.23762 ████▍
builtin-popcnt-movdq-unrolled_manual 0.26956 ████▉
procedure time [s] relative time (less is better)
lookup-8 1.62533 █████████████████████████████▊
lookup-64 1.63190 █████████████████████████████▊
bit-parallel 1.65513 ██████████████████████████████▎
bit-parallel-optimized 1.08248 ███████████████████▊
bit-parallel-mul 1.00252 ██████████████████▎
bit-parallel32 2.73145 ██████████████████████████████████████████████████
bit-parallel-optimized32 2.02735 █████████████████████████████████████
harley-seal 0.46857 ████████▌
sse-bit-parallel 0.64884 ███████████▉
sse-bit-parallel-original 0.67598 ████████████▎
sse-bit-parallel-better 0.55213 ██████████
sse-harley-seal 0.26622 ████▊
sse-lookup 0.29683 █████▍
sse-lookup-original 0.52198 █████████▌
avx2-lookup 0.15924 ██▉
avx2-lookup-original 0.33397 ██████
avx2-harley-seal 0.13064 ██▍
cpu 0.20058 ███▋
sse-cpu 0.27968 █████
avx2-cpu 0.32587 █████▉
avx512-harley-seal 0.08344 █▌
avx512bw-shuf 0.22794 ████▏
builtin-popcnt 0.42057 ███████▋
builtin-popcnt32 0.84699 ███████████████▌
builtin-popcnt-unrolled 0.30345 █████▌
builtin-popcnt-unrolled32 0.50331 █████████▏
builtin-popcnt-unrolled-errata 0.20133 ███▋
builtin-popcnt-unrolled-errata-manual 0.25850 ████▋
builtin-popcnt-movdq 0.30206 █████▌
builtin-popcnt-movdq-unrolled 0.23254 ████▎
builtin-popcnt-movdq-unrolled_manual 0.26400 ████▊
procedure time [s] relative time (less is better)
lookup-8 1.61539 █████████████████████████████▋
lookup-64 1.62115 █████████████████████████████▊
bit-parallel 1.66514 ██████████████████████████████▌
bit-parallel-optimized 1.09080 ████████████████████
bit-parallel-mul 0.99264 ██████████████████▏
bit-parallel32 2.71965 ██████████████████████████████████████████████████
bit-parallel-optimized32 2.01658 █████████████████████████████████████
harley-seal 0.46524 ████████▌
sse-bit-parallel 0.60464 ███████████
sse-bit-parallel-original 0.67192 ████████████▎
sse-bit-parallel-better 0.52073 █████████▌
sse-harley-seal 0.25450 ████▋
sse-lookup 0.29483 █████▍
sse-lookup-original 0.49898 █████████▏
avx2-lookup 0.15487 ██▊
avx2-lookup-original 0.32436 █████▉
avx2-harley-seal 0.11857 ██▏
cpu 0.20631 ███▊
sse-cpu 0.27772 █████
avx2-cpu 0.32200 █████▉
avx512-harley-seal 0.06327 █▏
avx512bw-shuf 0.18797 ███▍
builtin-popcnt 0.41129 ███████▌
builtin-popcnt32 0.84165 ███████████████▍
builtin-popcnt-unrolled 0.30967 █████▋
builtin-popcnt-unrolled32 0.49151 █████████
builtin-popcnt-unrolled-errata 0.20848 ███▊
builtin-popcnt-unrolled-errata-manual 0.26378 ████▊
builtin-popcnt-movdq 0.29324 █████▍
builtin-popcnt-movdq-unrolled 0.23797 ████▎
builtin-popcnt-movdq-unrolled_manual 0.28840 █████▎
procedure 32 B 64 B 128 B 256 B 512 B 1024 B 2048 B 4096 B
lookup-8 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
lookup-64 1.02 1.01 1.00 1.00 0.99 0.99 1.00 1.00
bit-parallel 0.94 0.96 0.97 0.97 1.01 0.99 0.98 0.97
bit-parallel-optimized 1.31 1.40 1.44 1.47 1.53 1.51 1.50 1.48
bit-parallel-mul 1.58 1.63 1.64 1.64 1.70 1.61 1.62 1.63
bit-parallel32 0.66 0.63 0.61 0.60 0.62 0.61 0.60 0.59
bit-parallel-optimized32 0.85 0.83 0.81 0.81 0.83 0.82 0.80 0.80
harley-seal 1.18 1.32 2.06 2.60 3.12 3.34 3.47 3.47
sse-bit-parallel 0.59 0.68 0.81 1.31 1.85 2.23 2.50 2.67
sse-bit-parallel-original 0.98 1.39 1.80 2.07 2.31 2.39 2.40 2.40
sse-bit-parallel-better 0.72 0.71 1.12 1.65 2.27 2.68 2.94 3.10
sse-harley-seal 0.97 1.39 1.85 3.79 4.99 5.68 6.11 6.35
sse-lookup 2.38 3.06 4.35 5.02 5.40 5.45 5.48 5.48
sse-lookup-original 0.72 1.15 1.75 2.34 2.88 3.10 3.11 3.24
avx2-lookup 2.51 3.64 5.11 7.00 8.53 9.72 10.21 10.43
avx2-lookup-original 0.79 1.23 2.00 1.84 3.86 4.54 4.87 4.98
avx2-harley-seal 1.15 1.87 2.82 3.90 8.29 10.68 12.44 13.62
cpu 3.46 4.67 6.39 7.72 8.38 8.20 8.10 7.83
sse-cpu 0.69 4.37 4.92 5.36 5.77 5.80 5.81 5.82
avx2-cpu 0.43 0.52 3.75 4.31 4.79 4.90 4.99 5.02
avx512-harley-seal 0.30 1.34 2.25 3.47 5.07 13.35 19.48 25.53
avx512bw-shuf 0.57 0.62 1.03 1.55 2.67 4.21 7.13 8.59
builtin-popcnt 3.45 3.69 3.83 3.92 4.09 3.74 3.86 3.93
builtin-popcnt32 2.37 2.19 2.09 2.03 1.87 1.89 1.92 1.92
builtin-popcnt-unrolled 3.80 4.38 4.79 5.05 5.37 5.36 5.36 5.22
builtin-popcnt-unrolled32 2.72 2.73 3.27 3.29 3.43 3.39 3.23 3.29
builtin-popcnt-unrolled-errata 4.22 5.39 6.71 7.72 8.25 8.14 8.07 7.75
builtin-popcnt-unrolled-errata-manual 2.60 3.57 4.53 5.26 5.96 6.17 6.29 6.12
builtin-popcnt-movdq 5.63 5.96 5.90 5.72 5.80 5.64 5.38 5.51
builtin-popcnt-movdq-unrolled 3.66 4.67 5.59 6.20 6.78 6.92 6.99 6.79
builtin-popcnt-movdq-unrolled_manual 2.92 4.24 5.15 5.61 5.97 6.10 6.16 5.60

Download skylake-x-w-2104-gcc8.1.0.csv