-
Notifications
You must be signed in to change notification settings - Fork 27
[ODK] RNS Dixon parallelized benchmarks
A. Breust edited this page Jul 10, 2019
·
15 revisions
DIM | BSIZE | PRIMES | THRDS | RNSFGEMM | USER (R) | REAL (R) |
---|---|---|---|---|---|---|
500 | 100 | 1 | 4 | BothParallel | 112.3800 | 31.3896 |
500 | 100 | 2 | 4 | BothParallel | 48.1640 | 13.2718 |
500 | 100 | 3 | 4 | BothParallel | 35.5600 | 10.2058 |
500 | 100 | 4 | 4 | BothParallel | 27.3280 | 7.9700 |
500 | 100 | 6 | 4 | BothParallel | 22.7920 | 6.9586 |
500 | 100 | 8 | 4 | BothParallel | 19.1320 | 6.1989 |
500 | 100 | 12 | 4 | BothParallel | 15.5600 | 5.4505 |
500 | 100 | 16 | 4 | BothParallel | 14.0360 | 5.0601 |
500 | 100 | 20 | 4 | BothParallel | 13.6080 | 5.2611 |
500 | 100 | 24 | 4 | BothParallel | 12.7600 | 5.0058 |
500 | 100 | 28 | 4 | BothParallel | 12.7240 | 4.9811 |
500 | 100 | 32 | 4 | BothParallel | 12.7200 | 4.9913 |
500 | 100 | 36 | 4 | BothParallel | 12.8000 | 5.3516 |
500 | 100 | 40 | 4 | BothParallel | 12.9360 | 5.3580 |
- Pourquoi le ralentissement (reproductible) à p = 20 ?
DIM | BSIZE | PRIMES | THRDS | R_TOT | R_INIT | R_LIFT | R_CRT_RAT |
---|---|---|---|---|---|---|---|
500 | 100 | 1 | 4 | 29.7002 | .1759 | 29.1154 | .4071 |
500 | 100 | 2 | 4 | 13.7751 | .1924 | 12.9467 | .6342 |
500 | 100 | 3 | 4 | 10.3130 | .2017 | 9.2142 | .8953 |
500 | 100 | 4 | 4 | 8.9845 | .2166 | 7.9184 | .8473 |
500 | 100 | 6 | 4 | 7.6334 | .2880 | 6.2613 | 1.0818 |
500 | 100 | 8 | 4 | 6.1640 | .3085 | 4.8471 | 1.0060 |
500 | 100 | 12 | 4 | 6.0381 | .4121 | 4.4215 | 1.2019 |
500 | 100 | 16 | 4 | 5.3471 | .4997 | 3.7399 | 1.1043 |
500 | 100 | 20 | 4 | 5.4405 | .6030 | 3.3833 | 1.4507 |
500 | 100 | 24 | 4 | 5.0931 | .6895 | 3.1115 | 1.2883 |
500 | 100 | 28 | 4 | 5.0154 | .7767 | 3.0238 | 1.2107 |
500 | 100 | 32 | 4 | 4.9846 | .8715 | 2.9518 | 1.1568 |
500 | 100 | 36 | 4 | 5.3823 | .9762 | 2.9195 | 1.4819 |
500 | 100 | 40 | 4 | 5.6153 | 1.0543 | 3.0158 | 1.5403 |
500 | 100 | 44 | 4 | 5.4004 | 1.1510 | 2.8320 | 1.4120 |
500 | 100 | 48 | 4 | 5.4033 | 1.2592 | 2.7961 | 1.3431 |
500 | 100 | 52 | 4 | 5.4471 | 1.3523 | 2.7674 | 1.3216 |
500 | 100 | 56 | 4 | 5.5458 | 1.4470 | 2.8158 | 1.2766 |
500 | 100 | 60 | 4 | 5.4927 | 1.5231 | 2.7218 | 1.2412 |
500 | 100 | 64 | 4 | 5.5324 | 1.6058 | 2.6910 | 1.2298 |
500 | 100 | 68 | 4 | 6.0481 | 1.7179 | 2.6463 | 1.6777 |
500 | 100 | 72 | 4 | 6.0070 | 1.8015 | 2.6233 | 1.5760 |
500 | 100 | 76 | 4 | 6.0631 | 1.9172 | 2.6064 | 1.5332 |
500 | 100 | 80 | 4 | 6.1072 | 2.0084 | 2.5915 | 1.5005 |
500 | 100 | 84 | 4 | 6.3278 | 2.1375 | 2.6368 | 1.5469 |
500 | 100 | 88 | 4 | 6.2854 | 2.1727 | 2.6138 | 1.4916 |
500 | 100 | 92 | 4 | 6.3047 | 2.3038 | 2.5624 | 1.4316 |
500 | 100 | 96 | 4 | 6.2986 | 2.3486 | 2.5501 | 1.3928 |
500 | 100 | 100 | 4 | 6.3985 | 2.4235 | 2.5477 | 1.4201 |
500 | 100 | 104 | 4 | 6.4521 | 2.5241 | 2.5341 | 1.3869 |
500 | 100 | 108 | 4 | 6.5842 | 2.6276 | 2.5588 | 1.3909 |
500 | 100 | 112 | 4 | 6.6160 | 2.7266 | 2.5416 | 1.3407 |
500 | 100 | 116 | 4 | 6.6710 | 2.8107 | 2.5459 | 1.3067 |
500 | 100 | 120 | 4 | 6.7433 | 2.9029 | 2.5369 | 1.2957 |
500 | 100 | 124 | 4 | 6.8303 | 2.9827 | 2.5515 | 1.2883 |
500 | 100 | 128 | 4 | 6.9664 | 3.0790 | 2.6000 | 1.2790 |
- DIM 100 BSIZE 100
- DIM 200 BSIZE 100
- DIM 300 BSIZE 100
- DIM 500 BSIZE 100
- DIM 100 BSIZE 1000
- Proposition : ???
These runs have been done with the following setup:
- Parallel init for the inverses precomputations.
- Parallel for the euclidian division and apply of inverses.
- Parallel FGEMM for residues update with runtime configurable ParSeqHelper::Compose
- Parallel and cache-friendly
R[j] <= R[j] / pj
- Matrix-based
fconvert_rns
andaddin(R, Q)
.
NOTE: Previous runs (2019-06-25) were done with ./test-solve-full, these are now done with ./benchmark-dense-solve. Each result is the median value over three iterations.
DIM | BSIZE | PRIMES | THRDS | RNSFGEMM | USER (D/R) | REAL (D/R) |
---|---|---|---|---|---|---|
100 | 100 | 12 | 4 | ParallelRnsOnly | .1760/.3800 | .1729/.1308 |
100 | 100 | 12 | 2 | ParallelRnsOnly | .1640/.2720 | .1599/.1541 |
100 | 100 | 12 | 1 | ParallelRnsOnly | .1520/.2000 | .1551/.1970 |
500 | 100 | 12 | 4 | ParallelRnsOnly | 9.8840/17.1360 | 9.9488/5.8152 |
500 | 100 | 12 | 2 | ParallelRnsOnly | 10.8560/10.0880 | 10.8792/6.0366 |
500 | 100 | 12 | 1 | ParallelRnsOnly | 10.2360/7.4880 | 10.2764/7.5018 |
- Faire
omp_set_num_threads()
a un petit impact sur le Dixon séquentiel. - Dès n = 500, DixonRNS est plus performant que Dixon séquentiel, même sans threading.
DIM | BSIZE | PRIMES | THRDS | RNSFGEMM | USER (D/R) | REAL (D/R) |
---|---|---|---|---|---|---|
500 | 100 | 1 | 4 | ParallelRnsOnly | 12.0240/120.2600 | 12.0441/32.0655 |
500 | 100 | 4 | 4 | ParallelRnsOnly | 10.3920/31.0160 | 10.4289/9.0184 |
500 | 100 | 8 | 4 | ParallelRnsOnly | 11.8560/20.5560 | 11.8805/6.3450 |
500 | 100 | 16 | 4 | ParallelRnsOnly | 11.0520/14.9640 | 11.0828/5.2746 |
500 | 100 | 32 | 4 | ParallelRnsOnly | 9.9400/13.8520 | 9.9576/5.7068 |
500 | 100 | 64 | 4 | ParallelRnsOnly | 10.9080/15.8160 | 10.9519/8.8123 |
- Le nombre de nombres premiers à utiliser dépend du problème. Il n'y a pas de valeur magique.
DIM | BSIZE | PRIMES | THRDS | RNSFGEMM | USER (D/R) | REAL (D/R) |
---|---|---|---|---|---|---|
500 | 100 | 12 | 4 | ParallelRnsOnly | 11.0280/17.1680 | 11.0519/6.0090 |
500 | 100 | 12 | 4 | ParallelFgemmOnly | 10.8480/18.9560 | 10.8860/6.3640 |
500 | 100 | 12 | 4 | BothSequential | 10.8560/14.1320 | 10.8713/6.0654 |
500 | 100 | 12 | 4 | BothParallel | 10.7680/16.5080 | 10.8045/5.8532 |
500 | 100 | 12 | 8 | ParallelRnsOnly | 11.1320/13.3560 | 11.2400/6.0732 |
500 | 100 | 12 | 8 | ParallelFgemmOnly | 9.4920/10.2120 | 9.5009/6.3665 |
500 | 100 | 12 | 8 | BothSequential | 12.1280/10.1600 | 12.1606/6.3014 |
500 | 100 | 12 | 8 | BothParallel | 10.2440/13.5720 | 10.2528/6.2492 |
- Difficile de dire quoi que ce soit, il y a pas mal de variance en fonction des nombres premiers tirés.
DIM | BSIZE | PRIMES | THRDS | D_IT | D_LIFT | D_TOT | R_TOT | R_INIT | R_LIFT | R_CRT_RAT |
---|---|---|---|---|---|---|---|---|---|---|
500 | 100 | 12 | 4 | 4112 | 10.7063 | 10.8053 | 6.1891 | .4569 | 4.4476 | 1.2825 |
500 | 100 | 16 | 4 | 4058 | 11.6784 | 11.8271 | 5.6998 | .4931 | 4.0675 | 1.1370 |
500 | 100 | 32 | 4 | 4029 | 11.2154 | 11.3647 | 5.5958 | .8869 | 3.4842 | 1.2208 |
500 | 100 | 64 | 4 | 4039 | 11.2943 | 11.4433 | 5.7470 | 1.6033 | 2.8665 | 1.2717 |
500 | 100 | 128 | 4 | 4063 | 11.3224 | 11.4718 | 7.0930 | 3.0806 | 2.6821 | 1.3212 |
- Chercher R_INIT = R_LIFT n'est pas le meilleur. Mais sur une machine très parallèle, R_LIFT gagnera beaucoup à avoir beaucoup de nombres premiers.
DIM | BSIZE | PRIMES | THRDS | RNSFGEMM | USER (D/R) | REAL (D/R) |
---|---|---|---|---|---|---|
500 | 100 | 12 | 4 | ParallelRnsOnly | 10.6400/17.1400 | 10.6651/5.9145 |
500 | 100 | 12 | 8 | ParallelRnsOnly | 10.3560/13.9040 | 10.3891/5.8261 |
- Pas de slow-down avec PALADIN. Merci Zhu!
DIM | BSIZE | PRIMES | THRDS | RNSFGEMM | USER (D/R) | REAL (D/R) |
---|---|---|---|---|---|---|
100 | 100 | 32 | 32 | ParallelRnsOnly | .3440/29.4120 | .2657/1.6594 |
500 | 100 | 32 | 32 | ParallelRnsOnly | 12.6920/198.9280 | 12.6396/22.8117 |
1000 | 100 | 32 | 32 | ParallelRnsOnly | 205.2720/575.9480 | 205.4940/94.2071 |
1000 | 100 | 256 | 32 | ParallelRnsOnly | /365.5400 | /84.3257 |
2000 | 100 | 256 | 32 | ParallelRnsOnly | 1259.9800/1763.0900 | 1262.3600/389.6000 |
These runs have been done with the following setup:
- Nothing within the Init is done in parallel.
-
COULD HAVE PARALLELIZATION ON THE COMPUTATION OF INVERSES
-
- Parallel for the euclidian division and apply of inverses.
- Parallel FGEMM for residues update with
FFLAS::ParSeqHelper::Compose<RNSParallel, FGEMMSequential>(4, 4)
-
COULD BE IMPROVED BY NOT HARDCODING ANYTHING
-
-
R[j] <= R[j] / pj
is not parallel-
FIND OUT WHY IT IS NOT WORKING SIMPLY
-
-
fconvert_rns
to getR
in ZZ is done on matrix.
Naming conventions:
-
PRIMES
: The number of primes used for RNS Dixon. Basically how much we can parallelized. -
D_LIFT
: Classic Dixon lifting + RatRecon -
D_IT
: Classic Dixon number of iterations -
R_IT
: RNS Dixon number of iterations -
R_INIT
: RNS Dixon precomputing inverses mod all pj -
R_LIFT
: RNS Dixon lifting accumulations -
R_CRTP
: RNS Dixon CRT all progress() -
R_CRT
: RNS Dixon CRT result() -
R_RAT
: RNS Dixon rational reconstruction
DIM | BITSIZE | PRIMES | D_IT | D_LIFT | R_IT | R_INIT | R_LIFT | R_CRTP | R_CRT | R_RAT |
---|---|---|---|---|---|---|---|---|---|---|
100 | 100 | 256 | 800 | .3171 | 4 | .6772 | .2350 | .0379 | .0128 | .0248 |
100 | 100 | 128 | 805 | .3123 | 8 | .3398 | .2805 | .0285 | .0128 | .0250 |
100 | 100 | 64 | 807 | .3143 | 16 | .1727 | .3388 | .0237 | .0128 | .0248 |
100 | 100 | 32 | 802 | .3106 | 31 | .0890 | .5142 | .0193 | .0125 | .0242 |
100 | 100 | 16 | 798 | .3036 | 61 | .0477 | .9056 | .0161 | .0116 | .0243 |
100 | 100 | 8 | 801 | .3153 | 121 | .0273 | 1.5501 | .0125 | .0094 | .0242 |
100 | 100 | 4 | 822 | .3083 | 242 | .0172 | 2.9304 | .0105 | .0060 | .0237 |
100 | 100 | 2 | 817 | .3133 | 489 | .0118 | 5.8470 | .0086 | .0001 | .0236 |
100 | 100 | 1 | 813 | .3108 | 978 | .0095 | 11.5726 | .0001 | .0001 | .0238 |
100 | 100 | 32 | 802 | .3106 | 31 | .0890 | .5142 | .0193 | .0125 | .0242 |
100 | 200 | 32 | 1575 | .6803 | 60 | .0915 | 1.0971 | .0460 | .0348 | .0751 |
200 | 100 | 32 | 1631 | 1.2162 | 61 | .3204 | 1.3498 | .0915 | .0721 | .1044 |
300 | 100 | 32 | 2421 | 3.0914 | 91 | .7353 | 2.6728 | .2338 | .1972 | .2580 |
400 | 100 | 32 | 3250 | 14.3534 | 122 | 2.0951 | 10.9242 | .6531 | .6323 | .5688 |
DIM | BITSIZE | PRIMES | D_IT | D_LIFT | R_IT | R_INIT | R_LIFT | R_CRTP | R_CRT | R_RAT |
---|---|---|---|---|---|---|---|---|---|---|
100 | 100 | 32 | 807 | .1426 | 31 | .1494 | .1103 | .0105 | .0066 | .0135 |
100 | 200 | 32 | 1571 | .3465 | 60 | .1532 | .3258 | .0249 | .0182 | .0408 |
200 | 100 | 32 | 1608 | .6997 | 61 | .4154 | .5350 | .0492 | .0371 | .0557 |
300 | 100 | 32 | 2452 | 1.8967 | 91 | .7970 | 1.2881 | .1215 | .1020 | .1354 |
400 | 100 | 32 | 3227 | 5.2680 | 122 | 1.3045 | 2.5307 | .2442 | .2108 | .2635 |
1000 | 100 | 32 | 8109 | 90.4061 | 307 | 7.5955 | 25.1093 | 2.1447 | 1.8424 | 2.1459 |
One D_LIFT
details:
DIM | BITSIZE | PRIMES | DIV and c = A^{-1} r | FGEMM R <= R - Ac | R <= R / p | CONVERT r <= Q + R |
---|---|---|---|---|---|---|
1000 | 100 | 32 | 0.0219879 | 0.0230379 | 0.014802 | 0.0106769 |