Population count comparison for Haswell Core i7-4770 CPU @ 3.40GHz
Generated on: 2016-03-26
CPU: Haswell Core i7-4770 CPU @ 3.40GHz
Compiler: GCC 5.3.0 (Ubuntu)
Instruction set: AVX2
Number of runs: 5
All times are given in seconds .
procedure
description
lookup-8
lookup in std::uint8_t[256] LUT
lookup-64
lookup in std::uint64_t[256] LUT
bit-parallel
naive bit parallel method
bit-parallel-optimized
a bit better bit parallel
bit-parallel-mul
bit-parallel with fewer instructions
harley-seal
Harley-Seal popcount (4th iteration)
sse-bit-parallel
SSE implementation of bit-parallel-optimized (unrolled)
sse-bit-parallel-original
SSE implementation of bit-parallel-optimized
sse-bit-parallel-better
SSE implementation of bit-parallel with fewer instructions
sse-harley-seal
SSE implementation of Harley-Seal
sse-lookup
SSSE3 variant using pshufb instruction (unrolled)
sse-lookup-original
SSSE3 variant using pshufb instruction
avx2-lookup
AVX2 variant using pshufb instruction (unrolled)
avx2-lookup-original
AVX2 variant using pshufb instruction
avx2-harley-seal
AVX2 implementation of Harley-Seal
cpu
CPU instruction popcnt (64-bit variant)
sse-cpu
load data with SSE, then count bits using popcnt
avx2-cpu
load data with AVX2, then count bits using popcnt
builtin-popcnt
builtin for popcnt
builtin-popcnt32
builtin for popcnt (32-bit variant)
builtin-popcnt-unrolled
unrolled builtin-popcnt
builtin-popcnt-unrolled32
unrolled builtin-popcnt32
builtin-popcnt-unrolled-errata
unrolled builtin-popcnt avoiding false-dependency
builtin-popcnt-unrolled-errata-manual
unrolled builtin-popcnt avoiding false-dependency (asembly code)
builtin-popcnt-movdq
builtin-popcnt where data is loaded via SSE registers
builtin-popcnt-movdq-unrolled
builtin-popcnt-movdq unrolled
builtin-popcnt-movdq-unrolled_manual
builtin-popcnt-movdq unrolled (assembly code)
procedure
32 B
64 B
128 B
256 B
512 B
1024 B
2048 B
4096 B
lookup-8
1.20459
1.10942
1.06966
1.11342
1.69944
1.66395
1.64353
1.63281
lookup-64
1.17685
1.09910
1.06269
1.08992
1.67641
1.63699
1.61699
1.60908
bit-parallel
1.32661
1.12067
1.05220
1.02585
1.62042
1.60970
1.60452
1.60648
bit-parallel-optimized
1.03180
0.82544
0.73700
0.69278
1.07308
1.05540
1.04655
1.05100
bit-parallel-mul
0.85492
0.72226
0.65594
0.62647
1.06513
1.02557
1.00872
0.99841
harley-seal
1.03180
0.81070
0.50116
0.39429
0.54653
0.50303
0.48200
0.47094
sse-bit-parallel
1.79140
1.69979
0.95726
0.63293
0.73317
0.60194
0.51672
0.49033
sse-bit-parallel-original
1.41642
0.82879
0.56012
0.42746
0.58246
0.57800
0.54149
0.53427
sse-bit-parallel-better
2.61391
2.03126
1.09129
0.66172
0.71430
0.56356
0.49055
0.45673
sse-harley-seal
1.12661
0.71832
0.50877
0.22741
0.28450
0.24347
0.22265
0.21202
sse-lookup
0.53064
0.33902
0.21373
0.18056
0.26689
0.25495
0.24996
0.24830
sse-lookup-original
1.04292
0.63382
0.41272
0.30954
0.41567
0.41024
0.37114
0.35682
avx2-lookup
0.50116
0.30954
0.20636
0.13575
0.17204
0.14997
0.13613
0.13426
avx2-lookup-original
1.78312
0.93599
0.50484
0.36600
0.31277
0.24320
0.20630
0.19625
avx2-harley-seal
1.04442
0.61678
0.38303
0.26795
0.19295
0.14620
0.12328
0.11145
cpu
0.29480
0.22110
0.16214
0.14003
0.20636
0.19751
0.20783
0.19857
sse-cpu
2.18153
0.26444
0.20636
0.17741
0.26346
0.25142
0.26614
0.25413
avx2-cpu
2.00157
1.85354
0.25814
0.21528
0.30251
0.28702
0.27929
0.27857
builtin-popcnt
0.29480
0.26532
0.25058
0.24321
0.38956
0.33049
0.30460
0.29243
builtin-popcnt32
0.46953
0.44765
0.62961
0.50848
0.79063
0.75806
0.72629
0.72995
builtin-popcnt-unrolled
0.26532
0.17688
0.14003
0.12163
0.19162
0.19015
0.21131
0.19759
builtin-popcnt-unrolled32
0.44381
0.37015
0.33226
0.31322
0.48938
0.48954
0.46232
0.44888
builtin-popcnt-unrolled-errata
0.29480
0.20636
0.16214
0.14003
0.20858
0.19923
0.20037
0.19425
builtin-popcnt-unrolled-errata-manual
0.32775
0.23250
0.17305
0.14913
0.22391
0.22265
0.21923
0.20444
builtin-popcnt-movdq
0.20636
0.17988
0.19162
0.18425
0.29000
0.31599
0.30364
0.29296
builtin-popcnt-movdq-unrolled
0.32428
0.25638
0.19162
0.17164
0.25556
0.24600
0.24859
0.23759
builtin-popcnt-movdq-unrolled_manual
0.36384
0.26613
0.20974
0.18593
0.27761
0.27070
0.25343
0.24468
procedure
time [s]
relative time (less is better)
lookup-8
1.20459
███████████████████████
lookup-64
1.17685
██████████████████████▌
bit-parallel
1.32661
█████████████████████████▍
bit-parallel-optimized
1.03180
███████████████████▋
bit-parallel-mul
0.85492
████████████████▎
harley-seal
1.03180
███████████████████▋
sse-bit-parallel
1.79140
██████████████████████████████████▎
sse-bit-parallel-original
1.41642
███████████████████████████
sse-bit-parallel-better
2.61391
██████████████████████████████████████████████████
sse-harley-seal
1.12661
█████████████████████▌
sse-lookup
0.53064
██████████▏
sse-lookup-original
1.04292
███████████████████▉
avx2-lookup
0.50116
█████████▌
avx2-lookup-original
1.78312
██████████████████████████████████
avx2-harley-seal
1.04442
███████████████████▉
cpu
0.29480
█████▋
sse-cpu
2.18153
█████████████████████████████████████████▋
avx2-cpu
2.00157
██████████████████████████████████████▎
builtin-popcnt
0.29480
█████▋
builtin-popcnt32
0.46953
████████▉
builtin-popcnt-unrolled
0.26532
█████
builtin-popcnt-unrolled32
0.44381
████████▍
builtin-popcnt-unrolled-errata
0.29480
█████▋
builtin-popcnt-unrolled-errata-manual
0.32775
██████▎
builtin-popcnt-movdq
0.20636
███▉
builtin-popcnt-movdq-unrolled
0.32428
██████▏
builtin-popcnt-movdq-unrolled_manual
0.36384
██████▉
procedure
time [s]
relative time (less is better)
lookup-8
1.10942
███████████████████████████▎
lookup-64
1.09910
███████████████████████████
bit-parallel
1.12067
███████████████████████████▌
bit-parallel-optimized
0.82544
████████████████████▎
bit-parallel-mul
0.72226
█████████████████▊
harley-seal
0.81070
███████████████████▉
sse-bit-parallel
1.69979
█████████████████████████████████████████▊
sse-bit-parallel-original
0.82879
████████████████████▍
sse-bit-parallel-better
2.03126
██████████████████████████████████████████████████
sse-harley-seal
0.71832
█████████████████▋
sse-lookup
0.33902
████████▎
sse-lookup-original
0.63382
███████████████▌
avx2-lookup
0.30954
███████▌
avx2-lookup-original
0.93599
███████████████████████
avx2-harley-seal
0.61678
███████████████▏
cpu
0.22110
█████▍
sse-cpu
0.26444
██████▌
avx2-cpu
1.85354
█████████████████████████████████████████████▋
builtin-popcnt
0.26532
██████▌
builtin-popcnt32
0.44765
███████████
builtin-popcnt-unrolled
0.17688
████▎
builtin-popcnt-unrolled32
0.37015
█████████
builtin-popcnt-unrolled-errata
0.20636
█████
builtin-popcnt-unrolled-errata-manual
0.23250
█████▋
builtin-popcnt-movdq
0.17988
████▍
builtin-popcnt-movdq-unrolled
0.25638
██████▎
builtin-popcnt-movdq-unrolled_manual
0.26613
██████▌
procedure
time [s]
relative time (less is better)
lookup-8
1.06966
█████████████████████████████████████████████████
lookup-64
1.06269
████████████████████████████████████████████████▋
bit-parallel
1.05220
████████████████████████████████████████████████▏
bit-parallel-optimized
0.73700
█████████████████████████████████▊
bit-parallel-mul
0.65594
██████████████████████████████
harley-seal
0.50116
██████████████████████▉
sse-bit-parallel
0.95726
███████████████████████████████████████████▊
sse-bit-parallel-original
0.56012
█████████████████████████▋
sse-bit-parallel-better
1.09129
██████████████████████████████████████████████████
sse-harley-seal
0.50877
███████████████████████▎
sse-lookup
0.21373
█████████▊
sse-lookup-original
0.41272
██████████████████▉
avx2-lookup
0.20636
█████████▍
avx2-lookup-original
0.50484
███████████████████████▏
avx2-harley-seal
0.38303
█████████████████▌
cpu
0.16214
███████▍
sse-cpu
0.20636
█████████▍
avx2-cpu
0.25814
███████████▊
builtin-popcnt
0.25058
███████████▍
builtin-popcnt32
0.62961
████████████████████████████▊
builtin-popcnt-unrolled
0.14003
██████▍
builtin-popcnt-unrolled32
0.33226
███████████████▏
builtin-popcnt-unrolled-errata
0.16214
███████▍
builtin-popcnt-unrolled-errata-manual
0.17305
███████▉
builtin-popcnt-movdq
0.19162
████████▊
builtin-popcnt-movdq-unrolled
0.19162
████████▊
builtin-popcnt-movdq-unrolled_manual
0.20974
█████████▌
procedure
time [s]
relative time (less is better)
lookup-8
1.11342
██████████████████████████████████████████████████
lookup-64
1.08992
████████████████████████████████████████████████▉
bit-parallel
1.02585
██████████████████████████████████████████████
bit-parallel-optimized
0.69278
███████████████████████████████
bit-parallel-mul
0.62647
████████████████████████████▏
harley-seal
0.39429
█████████████████▋
sse-bit-parallel
0.63293
████████████████████████████▍
sse-bit-parallel-original
0.42746
███████████████████▏
sse-bit-parallel-better
0.66172
█████████████████████████████▋
sse-harley-seal
0.22741
██████████▏
sse-lookup
0.18056
████████
sse-lookup-original
0.30954
█████████████▉
avx2-lookup
0.13575
██████
avx2-lookup-original
0.36600
████████████████▍
avx2-harley-seal
0.26795
████████████
cpu
0.14003
██████▎
sse-cpu
0.17741
███████▉
avx2-cpu
0.21528
█████████▋
builtin-popcnt
0.24321
██████████▉
builtin-popcnt32
0.50848
██████████████████████▊
builtin-popcnt-unrolled
0.12163
█████▍
builtin-popcnt-unrolled32
0.31322
██████████████
builtin-popcnt-unrolled-errata
0.14003
██████▎
builtin-popcnt-unrolled-errata-manual
0.14913
██████▋
builtin-popcnt-movdq
0.18425
████████▎
builtin-popcnt-movdq-unrolled
0.17164
███████▋
builtin-popcnt-movdq-unrolled_manual
0.18593
████████▎
procedure
time [s]
relative time (less is better)
lookup-8
1.69944
██████████████████████████████████████████████████
lookup-64
1.67641
█████████████████████████████████████████████████▎
bit-parallel
1.62042
███████████████████████████████████████████████▋
bit-parallel-optimized
1.07308
███████████████████████████████▌
bit-parallel-mul
1.06513
███████████████████████████████▎
harley-seal
0.54653
████████████████
sse-bit-parallel
0.73317
█████████████████████▌
sse-bit-parallel-original
0.58246
█████████████████▏
sse-bit-parallel-better
0.71430
█████████████████████
sse-harley-seal
0.28450
████████▎
sse-lookup
0.26689
███████▊
sse-lookup-original
0.41567
████████████▏
avx2-lookup
0.17204
█████
avx2-lookup-original
0.31277
█████████▏
avx2-harley-seal
0.19295
█████▋
cpu
0.20636
██████
sse-cpu
0.26346
███████▊
avx2-cpu
0.30251
████████▉
builtin-popcnt
0.38956
███████████▍
builtin-popcnt32
0.79063
███████████████████████▎
builtin-popcnt-unrolled
0.19162
█████▋
builtin-popcnt-unrolled32
0.48938
██████████████▍
builtin-popcnt-unrolled-errata
0.20858
██████▏
builtin-popcnt-unrolled-errata-manual
0.22391
██████▌
builtin-popcnt-movdq
0.29000
████████▌
builtin-popcnt-movdq-unrolled
0.25556
███████▌
builtin-popcnt-movdq-unrolled_manual
0.27761
████████▏
procedure
time [s]
relative time (less is better)
lookup-8
1.66395
██████████████████████████████████████████████████
lookup-64
1.63699
█████████████████████████████████████████████████▏
bit-parallel
1.60970
████████████████████████████████████████████████▎
bit-parallel-optimized
1.05540
███████████████████████████████▋
bit-parallel-mul
1.02557
██████████████████████████████▊
harley-seal
0.50303
███████████████
sse-bit-parallel
0.60194
██████████████████
sse-bit-parallel-original
0.57800
█████████████████▎
sse-bit-parallel-better
0.56356
████████████████▉
sse-harley-seal
0.24347
███████▎
sse-lookup
0.25495
███████▋
sse-lookup-original
0.41024
████████████▎
avx2-lookup
0.14997
████▌
avx2-lookup-original
0.24320
███████▎
avx2-harley-seal
0.14620
████▍
cpu
0.19751
█████▉
sse-cpu
0.25142
███████▌
avx2-cpu
0.28702
████████▌
builtin-popcnt
0.33049
█████████▉
builtin-popcnt32
0.75806
██████████████████████▊
builtin-popcnt-unrolled
0.19015
█████▋
builtin-popcnt-unrolled32
0.48954
██████████████▋
builtin-popcnt-unrolled-errata
0.19923
█████▉
builtin-popcnt-unrolled-errata-manual
0.22265
██████▋
builtin-popcnt-movdq
0.31599
█████████▍
builtin-popcnt-movdq-unrolled
0.24600
███████▍
builtin-popcnt-movdq-unrolled_manual
0.27070
████████▏
procedure
time [s]
relative time (less is better)
lookup-8
1.64353
██████████████████████████████████████████████████
lookup-64
1.61699
█████████████████████████████████████████████████▏
bit-parallel
1.60452
████████████████████████████████████████████████▊
bit-parallel-optimized
1.04655
███████████████████████████████▊
bit-parallel-mul
1.00872
██████████████████████████████▋
harley-seal
0.48200
██████████████▋
sse-bit-parallel
0.51672
███████████████▋
sse-bit-parallel-original
0.54149
████████████████▍
sse-bit-parallel-better
0.49055
██████████████▉
sse-harley-seal
0.22265
██████▊
sse-lookup
0.24996
███████▌
sse-lookup-original
0.37114
███████████▎
avx2-lookup
0.13613
████▏
avx2-lookup-original
0.20630
██████▎
avx2-harley-seal
0.12328
███▊
cpu
0.20783
██████▎
sse-cpu
0.26614
████████
avx2-cpu
0.27929
████████▍
builtin-popcnt
0.30460
█████████▎
builtin-popcnt32
0.72629
██████████████████████
builtin-popcnt-unrolled
0.21131
██████▍
builtin-popcnt-unrolled32
0.46232
██████████████
builtin-popcnt-unrolled-errata
0.20037
██████
builtin-popcnt-unrolled-errata-manual
0.21923
██████▋
builtin-popcnt-movdq
0.30364
█████████▏
builtin-popcnt-movdq-unrolled
0.24859
███████▌
builtin-popcnt-movdq-unrolled_manual
0.25343
███████▋
procedure
time [s]
relative time (less is better)
lookup-8
1.63281
██████████████████████████████████████████████████
lookup-64
1.60908
█████████████████████████████████████████████████▎
bit-parallel
1.60648
█████████████████████████████████████████████████▏
bit-parallel-optimized
1.05100
████████████████████████████████▏
bit-parallel-mul
0.99841
██████████████████████████████▌
harley-seal
0.47094
██████████████▍
sse-bit-parallel
0.49033
███████████████
sse-bit-parallel-original
0.53427
████████████████▎
sse-bit-parallel-better
0.45673
█████████████▉
sse-harley-seal
0.21202
██████▍
sse-lookup
0.24830
███████▌
sse-lookup-original
0.35682
██████████▉
avx2-lookup
0.13426
████
avx2-lookup-original
0.19625
██████
avx2-harley-seal
0.11145
███▍
cpu
0.19857
██████
sse-cpu
0.25413
███████▊
avx2-cpu
0.27857
████████▌
builtin-popcnt
0.29243
████████▉
builtin-popcnt32
0.72995
██████████████████████▎
builtin-popcnt-unrolled
0.19759
██████
builtin-popcnt-unrolled32
0.44888
█████████████▋
builtin-popcnt-unrolled-errata
0.19425
█████▉
builtin-popcnt-unrolled-errata-manual
0.20444
██████▎
builtin-popcnt-movdq
0.29296
████████▉
builtin-popcnt-movdq-unrolled
0.23759
███████▎
builtin-popcnt-movdq-unrolled_manual
0.24468
███████▍
procedure
32 B
64 B
128 B
256 B
512 B
1024 B
2048 B
4096 B
lookup-8
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
lookup-64
1.02
1.01
1.01
1.02
1.01
1.02
1.02
1.01
bit-parallel
0.91
0.99
1.02
1.09
1.05
1.03
1.02
1.02
bit-parallel-optimized
1.17
1.34
1.45
1.61
1.58
1.58
1.57
1.55
bit-parallel-mul
1.41
1.54
1.63
1.78
1.60
1.62
1.63
1.64
harley-seal
1.17
1.37
2.13
2.82
3.11
3.31
3.41
3.47
sse-bit-parallel
0.67
0.65
1.12
1.76
2.32
2.76
3.18
3.33
sse-bit-parallel-original
0.85
1.34
1.91
2.60
2.92
2.88
3.04
3.06
sse-bit-parallel-better
0.46
0.55
0.98
1.68
2.38
2.95
3.35
3.58
sse-harley-seal
1.07
1.54
2.10
4.90
5.97
6.83
7.38
7.70
sse-lookup
2.27
3.27
5.00
6.17
6.37
6.53
6.58
6.58
sse-lookup-original
1.16
1.75
2.59
3.60
4.09
4.06
4.43
4.58
avx2-lookup
2.40
3.58
5.18
8.20
9.88
11.10
12.07
12.16
avx2-lookup-original
0.68
1.19
2.12
3.04
5.43
6.84
7.97
8.32
avx2-harley-seal
1.15
1.80
2.79
4.16
8.81
11.38
13.33
14.65
cpu
4.09
5.02
6.60
7.95
8.24
8.42
7.91
8.22
sse-cpu
0.55
4.20
5.18
6.28
6.45
6.62
6.18
6.43
avx2-cpu
0.60
0.60
4.14
5.17
5.62
5.80
5.88
5.86
builtin-popcnt
4.09
4.18
4.27
4.58
4.36
5.03
5.40
5.58
builtin-popcnt32
2.57
2.48
1.70
2.19
2.15
2.20
2.26
2.24
builtin-popcnt-unrolled
4.54
6.27
7.64
9.15
8.87
8.75
7.78
8.26
builtin-popcnt-unrolled32
2.71
3.00
3.22
3.55
3.47
3.40
3.55
3.64
builtin-popcnt-unrolled-errata
4.09
5.38
6.60
7.95
8.15
8.35
8.20
8.41
builtin-popcnt-unrolled-errata-manual
3.68
4.77
6.18
7.47
7.59
7.47
7.50
7.99
builtin-popcnt-movdq
5.84
6.17
5.58
6.04
5.86
5.27
5.41
5.57
builtin-popcnt-movdq-unrolled
3.71
4.33
5.58
6.49
6.65
6.76
6.61
6.87
builtin-popcnt-movdq-unrolled_manual
3.31
4.17
5.10
5.99
6.12
6.15
6.49
6.67
Download haswell-i7-4770-gcc5.3.0-avx2.csv