[ODK] Meeting 2018 10 17

Minutes for the October 17, 2018 meeting

For a 4k by 4k matrix, on Dahu cluster, 2 nodes vs. 4 nodes (32 cores/node), timings are 554 vs. 375 seconds (speedup factor 1.48) not on average.
On average, using 5 nodes (150 cores, max is 160), for a 4k by 4k matrix, run time is 404s with early termination.
On average, using 1 node of Luke42 (40 cores), for a 4k by 4k matrix, run time is 895s. with early termination.
On average, using 1 node of Luke42 (40 cores), for a 4k by 4k matrix, run time is 978s. with no early termination.
Practical issue: for a 10k by 10k matrix, not enough RAM to store the matrices (the GMP one and the mod p one) on each node. For solve computation only:
In GFlops: 2.47 for the 4 nodes, 128 cores 4k by 4k matrix. Not too shabby, but not great either.
In GFlops: 3.34 for the 2 nodes, 64 cores 4k by 4k matrix. Quite better.

Matrix order	Cluster	Nb nodes	Nb cores	Early?	Run time	GFlops (`solve` only)
4000	Dahu	2	64	No	554	3.34
4000	Dahu	4	128	No	375	2.47
4000	Dahu	5	150	Yes	404
4000	Dahu	2	60	Yes	643
4000	Luke42	1	40	Yes	895
4000	Luke42	1	40	No	978
1000	Luke42	1	40	Yes	9.14
1000	Luke42	1	40	Yes	7.83

Cost: $\frac{2}{3}n^3 \frac{\frac{n}{2})log(n) + 2log(||A||{\infinity})}{23} + n\frac{\frac{n}{2}(log(n) + 2log(||A||{\infinity})}{23}$

Use the OMP code -> hybrid mode
Or: send the matrix and the primes to each node, and use CRA-OMP. This is a new algorithm though, which will be time consuming to write.

communication time seems negligible.
would be nice to time precisely each step and check task time balance, using timestamps for instance.

utmost importance: get timings on how long each task takes, check if it scales properly. Run for instance a sequential run, a parallel run on one node, a parallel run on 2 nodes and time accurately each step, see how it balances and scales. It is necessary to be sure that CRT and RR steps are not taking too long, as they are not yet possible to do in parallel, so are a huge bottleneck for scaling. Ideally, reduction mod p and solve mod p should be the tasks who takes longer. Use small instances (200 by 200 for instance). For timers,user_timer gives total time, real_timer gives the actual time, waiting/halting times included. Using inputs with a large number of bits would be nice but should have a huge impact on the mod p step.
run benchmarks on instances as large as possible, algorithm on full mode, using all 32 cores per node.

Automatise benchs to measure precisely each step of the algorithm full mode, on small instances (100 by 100). Plot nb nodes vs. time results.
Plot the sequential Dixon solver.
Time Linbox vs. IML (authored by Storjohann, no longer maintained, used in Sage for interger solve).