Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transpose does not scale well with multithread #13

Open
Laurae2 opened this issue Dec 26, 2018 · 0 comments
Open

Transpose does not scale well with multithread #13

Laurae2 opened this issue Dec 26, 2018 · 0 comments

Comments

@Laurae2
Copy link

Laurae2 commented Dec 26, 2018

Using Dual Intel Xeon Gold 6154 on commit 990e59f.

Compilation flags used: nim cpp --passC:"-D_GNU_SOURCE" --passL:"-lpthread" -r -d:release -d:openmp -o:build/bench_transpose benchmarks/transpose/transpose_bench.nim

Multithreaded results:

Hint: ./build/bench_transpose  [Exec]
Warmup: 0.9945 s, result 224 (displayed to avoid compiler optimizing warmup away)

A matrix shape: (M: 4000, N: 2000)
Output shape: (M: 2000, N: 4000)
Required number of operations:     8.000 millions
Required bytes:                   32.000 MB
Arithmetic intensity:              0.250 FLOP/byte

Laser ForEachStrided
Collected 250 samples in 0.500 seconds
Average time: 1.518 ms
Stddev  time: 2.158 ms
Min     time: 1.153 ms
Max     time: 25.356 ms
Perf:         5.271 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Naive transpose
Collected 250 samples in 0.266 seconds
Average time: 1.062 ms
Stddev  time: 0.418 ms
Min     time: 0.936 ms
Max     time: 3.818 ms
Perf:         7.530 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Naive transpose - input row iteration
Collected 250 samples in 0.400 seconds
Average time: 1.598 ms
Stddev  time: 2.117 ms
Min     time: 0.969 ms
Max     time: 23.107 ms
Perf:         5.006 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Collapsed OpenMP
Collected 250 samples in 0.411 seconds
Average time: 1.642 ms
Stddev  time: 2.530 ms
Min     time: 0.924 ms
Max     time: 31.653 ms
Perf:         4.871 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Collapsed OpenMP - input row iteration
Collected 250 samples in 0.445 seconds
Average time: 1.781 ms
Stddev  time: 2.011 ms
Min     time: 1.162 ms
Max     time: 24.661 ms
Perf:         4.492 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Cache blocking
Collected 250 samples in 0.068 seconds
Average time: 0.270 ms
Stddev  time: 0.222 ms
Min     time: 0.239 ms
Max     time: 2.669 ms
Perf:         29.637 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Cache blocking - input row iteration
Collected 250 samples in 0.179 seconds
Average time: 0.715 ms
Stddev  time: 0.279 ms
Min     time: 0.657 ms
Max     time: 3.240 ms
Perf:         11.184 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

2D Tiling
Collected 250 samples in 0.066 seconds
Average time: 0.265 ms
Stddev  time: 0.159 ms
Min     time: 0.241 ms
Max     time: 2.447 ms
Perf:         30.189 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

2D Tiling - input row iteration
Collected 250 samples in 0.056 seconds
Average time: 0.223 ms
Stddev  time: 0.095 ms
Min     time: 0.203 ms
Max     time: 1.459 ms
Perf:         35.896 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Cache blocking with Prefetch
Collected 250 samples in 0.069 seconds
Average time: 0.277 ms
Stddev  time: 0.160 ms
Min     time: 0.252 ms
Max     time: 2.446 ms
Perf:         28.844 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

2D Tiling + Prefetch - input row iteration
Collected 250 samples in 0.175 seconds
Average time: 0.698 ms
Stddev  time: 1.759 ms
Min     time: 0.371 ms
Max     time: 18.627 ms
Perf:         11.455 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Production implementation
Collected 250 samples in 0.144 seconds
Average time: 0.574 ms
Stddev  time: 0.975 ms
Min     time: 0.382 ms
Max     time: 12.650 ms
Perf:         13.933 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Without OpenMP: nim cpp --passC:"-D_GNU_SOURCE" --passL:"-lpthread" -r -d:release -o:build/bench_transpose benchmarks/transpose/transpose_bench.nim

Singlethreaded results:

Hint: ./build/bench_transpose  [Exec]
Warmup: 0.9940 s, result 224 (displayed to avoid compiler optimizing warmup away)

A matrix shape: (M: 4000, N: 2000)
Output shape: (M: 2000, N: 4000)
Required number of operations:     8.000 millions
Required bytes:                   32.000 MB
Arithmetic intensity:              0.250 FLOP/byte

Laser ForEachStrided
Collected 250 samples in 9.080 seconds
Average time: 35.957 ms
Stddev  time: 0.289 ms
Min     time: 35.666 ms
Max     time: 37.249 ms
Perf:         0.222 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Naive transpose
Collected 250 samples in 8.580 seconds
Average time: 34.320 ms
Stddev  time: 0.320 ms
Min     time: 32.876 ms
Max     time: 35.604 ms
Perf:         0.233 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Naive transpose - input row iteration
Collected 250 samples in 8.637 seconds
Average time: 34.549 ms
Stddev  time: 0.243 ms
Min     time: 34.378 ms
Max     time: 35.767 ms
Perf:         0.232 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Collapsed OpenMP
Collected 250 samples in 8.674 seconds
Average time: 34.695 ms
Stddev  time: 0.361 ms
Min     time: 33.291 ms
Max     time: 36.134 ms
Perf:         0.231 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Collapsed OpenMP - input row iteration
Collected 250 samples in 8.694 seconds
Average time: 34.775 ms
Stddev  time: 0.339 ms
Min     time: 34.471 ms
Max     time: 36.496 ms
Perf:         0.230 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Cache blocking
Collected 250 samples in 2.383 seconds
Average time: 9.533 ms
Stddev  time: 0.172 ms
Min     time: 9.345 ms
Max     time: 10.990 ms
Perf:         0.839 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Cache blocking - input row iteration
Collected 250 samples in 4.512 seconds
Average time: 18.047 ms
Stddev  time: 0.232 ms
Min     time: 17.833 ms
Max     time: 19.423 ms
Perf:         0.443 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

2D Tiling
Collected 250 samples in 3.625 seconds
Average time: 14.498 ms
Stddev  time: 0.236 ms
Min     time: 14.244 ms
Max     time: 15.882 ms
Perf:         0.552 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

2D Tiling - input row iteration
Collected 250 samples in 2.491 seconds
Average time: 9.964 ms
Stddev  time: 0.222 ms
Min     time: 9.820 ms
Max     time: 11.652 ms
Perf:         0.803 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Cache blocking with Prefetch
Collected 250 samples in 2.583 seconds
Average time: 10.331 ms
Stddev  time: 0.169 ms
Min     time: 9.836 ms
Max     time: 11.829 ms
Perf:         0.774 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

2D Tiling + Prefetch - input row iteration
Collected 250 samples in 2.699 seconds
Average time: 10.796 ms
Stddev  time: 0.216 ms
Min     time: 10.669 ms
Max     time: 12.463 ms
Perf:         0.741 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Production implementation
Collected 250 samples in 2.712 seconds
Average time: 10.849 ms
Stddev  time: 0.181 ms
Min     time: 10.708 ms
Max     time: 12.350 ms
Perf:         0.737 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318
@Laurae2 Laurae2 changed the title Transpose does not scale with OpenMP Transpose does not scale well with multithread Dec 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant