You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hint: ./build/bench_transpose [Exec]
Warmup: 0.9945 s, result 224 (displayed to avoid compiler optimizing warmup away)
A matrix shape: (M: 4000, N: 2000)
Output shape: (M: 2000, N: 4000)
Required number of operations: 8.000 millions
Required bytes: 32.000 MB
Arithmetic intensity: 0.250 FLOP/byte
Laser ForEachStrided
Collected 250 samples in 0.500 seconds
Average time: 1.518 ms
Stddev time: 2.158 ms
Min time: 1.153 ms
Max time: 25.356 ms
Perf: 5.271 GMEMOPs/s
Display output[1] to make sure it's not optimized away
0.7808474898338318
Naive transpose
Collected 250 samples in 0.266 seconds
Average time: 1.062 ms
Stddev time: 0.418 ms
Min time: 0.936 ms
Max time: 3.818 ms
Perf: 7.530 GMEMOPs/s
Display output[1] to make sure it's not optimized away
0.7808474898338318
Naive transpose - input row iteration
Collected 250 samples in 0.400 seconds
Average time: 1.598 ms
Stddev time: 2.117 ms
Min time: 0.969 ms
Max time: 23.107 ms
Perf: 5.006 GMEMOPs/s
Display output[1] to make sure it's not optimized away
0.7808474898338318
Collapsed OpenMP
Collected 250 samples in 0.411 seconds
Average time: 1.642 ms
Stddev time: 2.530 ms
Min time: 0.924 ms
Max time: 31.653 ms
Perf: 4.871 GMEMOPs/s
Display output[1] to make sure it's not optimized away
0.7808474898338318
Collapsed OpenMP - input row iteration
Collected 250 samples in 0.445 seconds
Average time: 1.781 ms
Stddev time: 2.011 ms
Min time: 1.162 ms
Max time: 24.661 ms
Perf: 4.492 GMEMOPs/s
Display output[1] to make sure it's not optimized away
0.7808474898338318
Cache blocking
Collected 250 samples in 0.068 seconds
Average time: 0.270 ms
Stddev time: 0.222 ms
Min time: 0.239 ms
Max time: 2.669 ms
Perf: 29.637 GMEMOPs/s
Display output[1] to make sure it's not optimized away
0.7808474898338318
Cache blocking - input row iteration
Collected 250 samples in 0.179 seconds
Average time: 0.715 ms
Stddev time: 0.279 ms
Min time: 0.657 ms
Max time: 3.240 ms
Perf: 11.184 GMEMOPs/s
Display output[1] to make sure it's not optimized away
0.7808474898338318
2D Tiling
Collected 250 samples in 0.066 seconds
Average time: 0.265 ms
Stddev time: 0.159 ms
Min time: 0.241 ms
Max time: 2.447 ms
Perf: 30.189 GMEMOPs/s
Display output[1] to make sure it's not optimized away
0.7808474898338318
2D Tiling - input row iteration
Collected 250 samples in 0.056 seconds
Average time: 0.223 ms
Stddev time: 0.095 ms
Min time: 0.203 ms
Max time: 1.459 ms
Perf: 35.896 GMEMOPs/s
Display output[1] to make sure it's not optimized away
0.7808474898338318
Cache blocking with Prefetch
Collected 250 samples in 0.069 seconds
Average time: 0.277 ms
Stddev time: 0.160 ms
Min time: 0.252 ms
Max time: 2.446 ms
Perf: 28.844 GMEMOPs/s
Display output[1] to make sure it's not optimized away
0.7808474898338318
2D Tiling + Prefetch - input row iteration
Collected 250 samples in 0.175 seconds
Average time: 0.698 ms
Stddev time: 1.759 ms
Min time: 0.371 ms
Max time: 18.627 ms
Perf: 11.455 GMEMOPs/s
Display output[1] to make sure it's not optimized away
0.7808474898338318
Production implementation
Collected 250 samples in 0.144 seconds
Average time: 0.574 ms
Stddev time: 0.975 ms
Min time: 0.382 ms
Max time: 12.650 ms
Perf: 13.933 GMEMOPs/s
Display output[1] to make sure it's not optimized away
0.7808474898338318
Without OpenMP: nim cpp --passC:"-D_GNU_SOURCE" --passL:"-lpthread" -r -d:release -o:build/bench_transpose benchmarks/transpose/transpose_bench.nim
Singlethreaded results:
Hint: ./build/bench_transpose [Exec]
Warmup: 0.9940 s, result 224 (displayed to avoid compiler optimizing warmup away)
A matrix shape: (M: 4000, N: 2000)
Output shape: (M: 2000, N: 4000)
Required number of operations: 8.000 millions
Required bytes: 32.000 MB
Arithmetic intensity: 0.250 FLOP/byte
Laser ForEachStrided
Collected 250 samples in 9.080 seconds
Average time: 35.957 ms
Stddev time: 0.289 ms
Min time: 35.666 ms
Max time: 37.249 ms
Perf: 0.222 GMEMOPs/s
Display output[1] to make sure it's not optimized away
0.7808474898338318
Naive transpose
Collected 250 samples in 8.580 seconds
Average time: 34.320 ms
Stddev time: 0.320 ms
Min time: 32.876 ms
Max time: 35.604 ms
Perf: 0.233 GMEMOPs/s
Display output[1] to make sure it's not optimized away
0.7808474898338318
Naive transpose - input row iteration
Collected 250 samples in 8.637 seconds
Average time: 34.549 ms
Stddev time: 0.243 ms
Min time: 34.378 ms
Max time: 35.767 ms
Perf: 0.232 GMEMOPs/s
Display output[1] to make sure it's not optimized away
0.7808474898338318
Collapsed OpenMP
Collected 250 samples in 8.674 seconds
Average time: 34.695 ms
Stddev time: 0.361 ms
Min time: 33.291 ms
Max time: 36.134 ms
Perf: 0.231 GMEMOPs/s
Display output[1] to make sure it's not optimized away
0.7808474898338318
Collapsed OpenMP - input row iteration
Collected 250 samples in 8.694 seconds
Average time: 34.775 ms
Stddev time: 0.339 ms
Min time: 34.471 ms
Max time: 36.496 ms
Perf: 0.230 GMEMOPs/s
Display output[1] to make sure it's not optimized away
0.7808474898338318
Cache blocking
Collected 250 samples in 2.383 seconds
Average time: 9.533 ms
Stddev time: 0.172 ms
Min time: 9.345 ms
Max time: 10.990 ms
Perf: 0.839 GMEMOPs/s
Display output[1] to make sure it's not optimized away
0.7808474898338318
Cache blocking - input row iteration
Collected 250 samples in 4.512 seconds
Average time: 18.047 ms
Stddev time: 0.232 ms
Min time: 17.833 ms
Max time: 19.423 ms
Perf: 0.443 GMEMOPs/s
Display output[1] to make sure it's not optimized away
0.7808474898338318
2D Tiling
Collected 250 samples in 3.625 seconds
Average time: 14.498 ms
Stddev time: 0.236 ms
Min time: 14.244 ms
Max time: 15.882 ms
Perf: 0.552 GMEMOPs/s
Display output[1] to make sure it's not optimized away
0.7808474898338318
2D Tiling - input row iteration
Collected 250 samples in 2.491 seconds
Average time: 9.964 ms
Stddev time: 0.222 ms
Min time: 9.820 ms
Max time: 11.652 ms
Perf: 0.803 GMEMOPs/s
Display output[1] to make sure it's not optimized away
0.7808474898338318
Cache blocking with Prefetch
Collected 250 samples in 2.583 seconds
Average time: 10.331 ms
Stddev time: 0.169 ms
Min time: 9.836 ms
Max time: 11.829 ms
Perf: 0.774 GMEMOPs/s
Display output[1] to make sure it's not optimized away
0.7808474898338318
2D Tiling + Prefetch - input row iteration
Collected 250 samples in 2.699 seconds
Average time: 10.796 ms
Stddev time: 0.216 ms
Min time: 10.669 ms
Max time: 12.463 ms
Perf: 0.741 GMEMOPs/s
Display output[1] to make sure it's not optimized away
0.7808474898338318
Production implementation
Collected 250 samples in 2.712 seconds
Average time: 10.849 ms
Stddev time: 0.181 ms
Min time: 10.708 ms
Max time: 12.350 ms
Perf: 0.737 GMEMOPs/s
Display output[1] to make sure it's not optimized away
0.7808474898338318
The text was updated successfully, but these errors were encountered:
Laurae2
changed the title
Transpose does not scale with OpenMP
Transpose does not scale well with multithread
Dec 26, 2018
Using Dual Intel Xeon Gold 6154 on commit 990e59f.
Compilation flags used:
nim cpp --passC:"-D_GNU_SOURCE" --passL:"-lpthread" -r -d:release -d:openmp -o:build/bench_transpose benchmarks/transpose/transpose_bench.nim
Multithreaded results:
Without OpenMP:
nim cpp --passC:"-D_GNU_SOURCE" --passL:"-lpthread" -r -d:release -o:build/bench_transpose benchmarks/transpose/transpose_bench.nim
Singlethreaded results:
The text was updated successfully, but these errors were encountered: