Skip to content

spmv_becnhmarks

Luc Berger edited this page Apr 2, 2019 · 70 revisions

SpMV benchmarks

SpMV and its multivector variants are critical algorithms used in a very large set of linear algebra methods for residual update, matrix polynomial evaluation and projections.

The tests are run using Kokkos-Kernels' performance tests: KokkosSparse_spmv.cpp for single vector and KokkosSparse_spmv_mv.cpp for single and multiple vectors.

OpenMP

KNL (Bowman)

Batch script used for job submission on KNL for single vector performance:

#!/bin/bash
#SBATCH -N 1
#SBATCH -p knl-alpha
#SBATCH --time=12:00:00

module load intel/compilers/18.2.199

cd $HOME/kokkoskernels_benchmark/kokkos-kernels/example/buildlib/perf_test

export OMP_PLACES=threads
export OMP_PROC_BIND=spread 

export OMP_NUM_THREADS=16
./KokkosSparse_spmv.exe -l 100 -s 27000  
./KokkosSparse_spmv.exe -l 100 -s 64000  
./KokkosSparse_spmv.exe -l 100 -s 128000 
./KokkosSparse_spmv.exe -l 100 -s 216000 
./KokkosSparse_spmv.exe -l 100 -s 512000 
./KokkosSparse_spmv.exe -l 100 -s 1000000

export OMP_NUM_THREADS=32
./KokkosSparse_spmv.exe -l 100 -s 27000  
./KokkosSparse_spmv.exe -l 100 -s 64000  
./KokkosSparse_spmv.exe -l 100 -s 128000 
./KokkosSparse_spmv.exe -l 100 -s 216000 
./KokkosSparse_spmv.exe -l 100 -s 512000 
./KokkosSparse_spmv.exe -l 100 -s 1000000

export OMP_NUM_THREADS=64
./KokkosSparse_spmv.exe -l 100 -s 27000  
./KokkosSparse_spmv.exe -l 100 -s 64000  
./KokkosSparse_spmv.exe -l 100 -s 128000 
./KokkosSparse_spmv.exe -l 100 -s 216000 
./KokkosSparse_spmv.exe -l 100 -s 512000 
./KokkosSparse_spmv.exe -l 100 -s 1000000

Table of results:

Matrix size 16 threads 32 threads 64 threads
27,000 0.154 0.084 0.051
64,000 0.310 0.155 0.091
128,000 0.664 0.311 0.158
216,000 1.081 0.570 0.329
512,000 2.661 1.409 0.932
1,000,000 5.320 2.841 1.915

The batch scripts used to generate the tables below can be found here: OMP_16, OMP_32 and OMP_64, a post-processing python script to format the output data in tables can be found here, I was lazy so you need to manually modify make_table.py to process each output file.

Tables of results:

OMP_NUM_THREADS=16

New algorithm:

Matrix size 1 vectors 2 vectors 3 vectors 4 vectors 8 vectors 16 vectors
27000 0.157 0.212 0.213 0.253 0.328 0.621
64000 0.335 0.456 0.541 0.687 1.037 2.207
128000 0.714 0.989 1.169 1.412 2.204 4.762
216000 1.184 1.672 1.886 2.247 3.129 5.825
512000 2.811 4.471 4.910 6.056 11.275 33.227
1000000 5.642 8.817 9.163 11.355 19.404 74.938
Matrix size 1 vectors 2 vectors 3 vectors 4 vectors 8 vectors 16 vectors
27000 0.162 0.217 0.216 0.252 0.333 0.599
64000 0.333 0.455 0.543 0.675 0.998 2.218
128000 0.711 0.993 1.175 1.415 2.095 4.698
216000 1.187 1.680 1.900 2.237 3.079 5.857
512000 2.806 4.458 4.924 6.046 12.100 31.827
1000000 5.591 7.994 9.107 11.294 19.435 75.200

Reference implementation:

Matrix size 1 vectors 2 vectors 3 vectors 4 vectors 8 vectors 16 vectors
27000 0.164 0.202 0.216 0.259 0.333 0.596
64000 0.328 0.481 0.566 0.692 1.008 2.210
128000 0.700 1.039 1.223 1.448 2.110 4.747
216000 1.159 1.766 1.919 2.288 3.173 5.811
512000 2.816 4.186 5.245 6.162 11.451 32.191
1000000 5.623 8.334 9.414 11.637 19.528 74.845
Matrix size 1 vectors 2 vectors 3 vectors 4 vectors 8 vectors 16 vectors
27000 0.160 0.203 0.214 0.256 0.336 0.588
64000 0.336 0.475 0.565 0.706 1.007 2.204
128000 0.694 1.040 1.225 1.442 2.111 4.742
216000 1.145 1.773 1.918 2.293 3.175 5.785
512000 2.810 4.227 5.222 6.292 11.401 32.095
1000000 5.635 8.332 9.417 11.635 19.853 75.401

OMP_NUM_THREADS=32

New algorithm:

Matrix size 1 vectors 2 vectors 3 vectors 4 vectors 8 vectors 16 vectors
27000 0.079 0.102 0.113 0.130 0.167 0.286
64000 0.164 0.220 0.260 0.329 0.502 1.104
128000 0.334 0.529 0.579 0.704 1.133 2.409
216000 0.623 0.973 1.011 1.173 1.661 3.098
512000 1.490 2.084 2.557 3.190 5.723 18.003
1000000 2.962 4.291 4.774 6.688 15.861 43.831
Matrix size 1 vectors 2 vectors 3 vectors 4 vectors 8 vectors 16 vectors
27000 0.079 0.118 0.113 0.133 0.166 0.312
64000 0.162 0.219 0.261 0.329 0.495 1.101
128000 0.338 0.482 0.575 0.707 1.135 2.571
216000 0.603 0.887 0.992 1.178 1.683 3.028
512000 1.503 2.320 2.560 3.180 6.163 17.075
1000000 2.938 4.657 5.017 6.023 11.376 46.943

Reference implementation:

Matrix size 1 vectors 2 vectors 3 vectors 4 vectors 8 vectors 16 vectors
27000 0.089 0.108 0.113 0.133 0.179 0.289
64000 0.167 0.236 0.271 0.340 0.508 1.115
128000 0.332 0.511 0.609 0.726 1.105 2.405
216000 0.603 0.916 1.016 1.206 1.656 3.112
512000 1.492 2.281 2.718 3.210 5.812 17.056
1000000 2.952 4.280 5.060 6.168 12.575 47.401
Matrix size 1 vectors 2 vectors 3 vectors 4 vectors 8 vectors 16 vectors
27000 0.088 0.108 0.113 0.135 0.168 0.292
64000 0.162 0.235 0.269 0.333 0.488 1.111
128000 0.326 0.504 0.610 0.723 1.098 2.404
216000 0.607 0.921 1.020 1.208 1.657 3.058
512000 1.492 2.210 2.721 3.297 5.816 17.007
1000000 2.939 4.296 5.175 6.489 15.315 45.148

OMP_NUM_THREADS=64

New algorithm:

Matrix size 1 vectors 2 vectors 3 vectors 4 vectors 8 vectors 16 vectors
27000 0.064 0.094 0.072 0.089 0.102 0.178
64000 0.091 0.138 0.144 0.177 0.259 0.651
128000 0.172 0.271 0.304 0.377 0.628 1.508
216000 0.319 0.515 0.533 0.643 1.041 2.102
512000 0.937 1.266 1.674 2.044 5.745 26.373
1000000 1.894 2.555 3.491 4.856 18.989 84.367
Matrix size 1 vectors 2 vectors 3 vectors 4 vectors 8 vectors 16 vectors
27000 0.048 0.066 0.072 0.083 0.103 0.171
64000 0.090 0.137 0.138 0.176 0.265 0.671
128000 0.173 0.269 0.303 0.379 0.640 1.594
216000 0.332 0.523 0.542 0.665 0.962 2.406
512000 0.935 1.306 1.546 2.123 5.499 27.139
1000000 1.861 2.602 3.189 5.367 19.786 85.826

Reference implementation:

Matrix size 1 vectors 2 vectors 3 vectors 4 vectors 8 vectors 16 vectors
27000 0.053 0.068 0.071 0.085 0.116 0.173
64000 0.096 0.128 0.145 0.177 0.268 0.619
128000 0.173 0.251 0.330 0.383 0.614 1.665
216000 0.340 0.507 0.541 0.700 1.126 2.362
512000 0.927 1.338 1.597 2.078 5.334 26.975
1000000 1.870 2.631 3.957 4.654 18.905 83.759
Matrix size 1 vectors 2 vectors 3 vectors 4 vectors 8 vectors 16 vectors
27000 0.054 0.068 0.071 0.084 0.103 0.181
64000 0.090 0.124 0.142 0.185 0.258 0.635
128000 0.182 0.262 0.322 0.380 0.629 1.462
216000 0.344 0.488 0.571 0.691 1.034 2.690
512000 0.970 1.255 1.617 2.054 5.615 27.134
1000000 1.878 2.498 3.470 4.670 19.119 88.259

SkyLake (Blake)

OMP_NUM_THREADS=16

New algorithm (vectorize reduction loop): Layout Left

Matrix size 1 vectors 2 vectors 3 vectors 4 vectors 8 vectors 16 vectors
27000 0.027 0.044 0.044 0.045 0.079 0.155
64000 0.055 0.097 0.102 0.114 0.377 0.896
128000 0.105 0.191 0.209 0.229 0.566 1.558
216000 0.174 0.316 0.358 0.368 0.867 2.071
512000 0.488 0.850 1.006 1.054 3.390 8.131
1000000 1.079 1.678 2.083 2.140 5.359 23.347
Matrix size 1 vectors 2 vectors 3 vectors 4 vectors 8 vectors 16 vectors
27000 0.026 0.043 0.044 0.045 0.078 0.154
64000 0.054 0.095 0.101 0.106 0.372 0.703
128000 0.104 0.196 0.211 0.218 0.563 1.518
216000 0.174 0.316 0.358 0.370 0.851 2.083
512000 0.473 0.857 1.020 1.056 3.312 8.201
1000000 1.089 1.690 2.054 2.130 5.307 23.374

Reference implementation:

Matrix size 1 vectors 2 vectors 3 vectors 4 vectors 8 vectors 16 vectors
27000 0.029 0.047 0.054 0.073 0.101 0.182
64000 0.052 0.097 0.119 0.177 0.367 0.724
128000 0.102 0.195 0.246 0.344 0.607 1.807
216000 0.167 0.332 0.431 0.567 0.981 2.097
512000 0.494 0.888 1.199 1.528 2.955 8.248
1000000 1.111 1.791 2.400 3.004 5.744 23.199
Matrix size 1 vectors 2 vectors 3 vectors 4 vectors 8 vectors 16 vectors
27000 0.029 0.046 0.054 0.073 0.101 0.188
64000 0.052 0.097 0.119 0.163 0.284 0.832
128000 0.102 0.196 0.246 0.338 0.836 1.785
216000 0.167 0.338 0.431 0.569 0.935 2.118
512000 0.474 0.897 1.561 1.523 2.833 8.865
1000000 1.056 1.774 2.412 3.005 5.679 22.964

For comparison, new implementation Layout Right;

Matrix size 1 vectors 2 vectors 3 vectors 4 vectors 8 vectors 16 vectors
27000 0.049 0.037 0.046 0.048 0.072 0.062
64000 0.075 0.081 0.102 0.115 0.183 0.168
128000 0.147 0.164 0.225 0.239 0.390 0.359
216000 0.246 0.277 0.362 0.386 0.692 0.740
512000 0.692 0.713 0.971 1.129 1.995 3.474
1000000 1.362 1.636 2.116 2.373 5.209 12.492
Matrix size 1 vectors 2 vectors 3 vectors 4 vectors 8 vectors 16 vectors
27000 0.035 0.037 0.046 0.050 0.072 0.065
64000 0.075 0.081 0.102 0.106 0.184 0.172
128000 0.147 0.162 0.225 0.223 0.390 0.354
216000 0.273 0.278 0.362 0.383 0.694 0.915
512000 0.655 0.726 0.973 1.430 2.142 3.614
1000000 1.357 1.602 2.115 2.344 5.693 12.489

Power8 (White)

Based on a Power8 architecture with two sockets and 8 cores per socket two tests are performed, one with 8 thread and one with 16 threads, results are presented in the table below.

OMP_NUM_THREADS=8

New algorithm:

Matrix size 1 vectors 2 vectors 3 vectors 4 vectors 8 vectors 16 vectors
27000 0.067 0.107 0.126 0.130 0.196 0.308
64000 0.151 0.238 0.289 0.298 0.525 0.973
128000 0.294 0.471 0.576 0.599 1.410 2.876
216000 0.492 0.816 1.185 1.010 2.125 4.438
512000 1.166 2.115 3.104 3.228 9.331 26.152
1000000 2.378 6.427 8.572 8.384 18.573 43.197
Matrix size 1 vectors 2 vectors 3 vectors 4 vectors 8 vectors 16 vectors
27000 0.068 0.104 0.125 0.130 0.196 0.309
64000 0.151 0.238 0.291 0.300 0.473 1.355
128000 0.295 0.470 0.574 0.600 1.413 3.019
216000 0.492 0.877 0.982 1.022 2.102 4.562
512000 1.169 2.145 3.137 3.193 9.329 26.536
1000000 2.378 6.416 8.471 8.394 18.535 43.243

Reference implementation:

Matrix size 1 vectors 2 vectors 3 vectors 4 vectors 8 vectors 16 vectors
27000 0.077 0.098 0.124 0.137 0.203 0.353
64000 0.150 0.223 0.287 0.317 0.522 1.146
128000 0.292 0.446 0.577 0.672 1.224 2.952
216000 0.476 0.765 1.003 1.161 2.111 4.458
512000 1.170 2.082 3.154 4.039 8.459 27.922
1000000 2.406 6.431 8.353 10.026 18.469 44.887
Matrix size 1 vectors 2 vectors 3 vectors 4 vectors 8 vectors 16 vectors
27000 0.076 0.099 0.125 0.137 0.201 0.355
64000 0.150 0.225 0.288 0.315 0.529 1.291
128000 0.291 0.446 0.577 0.656 1.167 2.973
216000 0.476 0.750 0.981 1.171 2.109 4.427
512000 1.173 2.095 3.150 4.041 8.504 27.901
1000000 2.412 6.365 8.119 9.919 18.307 44.461

OMP_NUM_THREADS=16

New algorithm:

Matrix size 1 vectors 2 vectors 3 vectors 4 vectors 8 vectors 16 vectors
27000 0.042 0.061 0.069 0.073 0.111 0.162
64000 0.084 0.155 0.192 0.160 0.309 0.758
128000 0.157 0.275 0.300 0.309 0.784 1.672
216000 0.257 0.430 0.501 0.526 1.010 2.076
512000 0.584 1.076 1.453 1.535 4.939 14.283
1000000 1.219 2.980 4.100 4.116 9.496 23.458
Matrix size 1 vectors 2 vectors 3 vectors 4 vectors 8 vectors 16 vectors
27000 0.042 0.069 0.069 0.073 0.106 0.181
64000 0.085 0.129 0.153 0.163 0.305 0.756
128000 0.157 0.245 0.300 0.312 0.828 1.267
216000 0.255 0.413 0.503 0.525 1.002 2.040
512000 0.584 1.094 1.501 1.494 3.818 13.125
1000000 1.216 3.028 4.145 4.085 9.546 21.227

Reference implementation:

Matrix size 1 vectors 2 vectors 3 vectors 4 vectors 8 vectors 16 vectors
27000 0.047 0.059 0.072 0.078 0.112 0.192
64000 0.084 0.123 0.155 0.169 0.257 0.638
128000 0.156 0.244 0.297 0.344 0.600 1.470
216000 0.248 0.411 0.506 0.595 1.035 2.119
512000 0.589 1.056 1.493 1.833 4.053 14.436
1000000 1.229 3.004 3.913 4.851 9.462 22.884
Matrix size 1 vectors 2 vectors 3 vectors 4 vectors 8 vectors 16 vectors
27000 0.047 0.058 0.072 0.078 0.112 0.193
64000 0.085 0.124 0.153 0.169 0.283 0.596
128000 0.155 0.264 0.298 0.353 0.599 1.463
216000 0.249 0.389 0.502 0.606 1.031 2.107
512000 0.587 1.033 1.480 1.828 4.070 14.099
1000000 1.235 2.968 4.038 4.973 9.465 22.105

Cuda

P100 (White)

First single vector algorithm

Matrix size Time
27,000 0.028
64,000 0.070
128,000 0.118
216,000 0.178
512,000 0.405
1,000,000 0.786

Second multiple vector algorithm

New algorithm:

Matrix size 1 vectors 2 vectors 3 vectors 4 vectors 8 vectors 16 vectors
27000 0.029 0.035 0.041 0.047 0.071 0.118
64000 0.072 0.069 0.084 0.097 0.160 0.289
128000 0.121 0.121 0.156 0.189 0.318 0.600
216000 0.180 0.203 0.271 0.330 0.563 1.079
512000 0.411 0.501 0.680 0.818 1.403 2.734
1000000 0.797 1.052 1.356 1.630 2.830 5.874
Matrix size 1 vectors 2 vectors 3 vectors 4 vectors 8 vectors 16 vectors
27000 0.028 0.035 0.041 0.047 0.071 0.118
64000 0.072 0.069 0.083 0.097 0.160 0.289
128000 0.121 0.120 0.156 0.189 0.319 0.600
216000 0.180 0.203 0.270 0.330 0.563 1.079
512000 0.411 0.501 0.678 0.819 1.403 2.731
1000000 0.796 0.997 1.352 1.629 2.831 5.866

Reference implementation:

Matrix size 1 vectors 2 vectors 3 vectors 4 vectors 8 vectors 16 vectors
27000 0.028 0.033 0.040 0.048 0.075 0.132
64000 0.072 0.074 0.091 0.111 0.188 0.358
128000 0.120 0.130 0.166 0.207 0.362 0.705
216000 0.180 0.218 0.279 0.352 0.631 1.250
512000 0.411 0.506 0.661 0.845 1.637 3.164
1000000 0.795 1.003 1.309 1.703 3.116 6.706
Matrix size 1 vectors 2 vectors 3 vectors 4 vectors 8 vectors 16 vectors
27000 0.028 0.032 0.040 0.048 0.075 0.131
64000 0.072 0.071 0.091 0.111 0.187 0.356
128000 0.121 0.123 0.166 0.206 0.360 0.704
216000 0.180 0.207 0.278 0.351 0.627 1.248
512000 0.410 0.478 0.659 0.841 1.538 3.155
1000000 0.793 1.000 1.304 1.702 3.094 6.678

V100 (Lassen)

Clone this wiki locally