-
Notifications
You must be signed in to change notification settings - Fork 27
[ODK] Meeting 2018 10 17
A. Breust edited this page Nov 7, 2018
·
1 revision
- For a 4k by 4k matrix, on Dahu cluster, 2 nodes vs. 4 nodes (32 cores/node), timings are 554 vs. 375 seconds (speedup factor 1.48) not on average.
- On average, using 5 nodes (150 cores, max is 160), for a 4k by 4k matrix, run time is 404s with early termination.
- On average, using 1 node of Luke42 (40 cores), for a 4k by 4k matrix, run time is 895s. with early termination.
- On average, using 1 node of Luke42 (40 cores), for a 4k by 4k matrix, run time is 978s. with no early termination.
- Practical issue: for a 10k by 10k matrix, not enough RAM to store the matrices (the GMP one
and the mod p one) on each node.
For
solve
computation only: - In GFlops: 2.47 for the 4 nodes, 128 cores 4k by 4k matrix. Not too shabby, but not great either.
- In GFlops: 3.34 for the 2 nodes, 64 cores 4k by 4k matrix. Quite better.
Matrix order | Cluster | Nb nodes | Nb cores | Early? | Run time | GFlops (solve only) |
---|---|---|---|---|---|---|
4000 | Dahu | 2 | 64 | No | 554 | 3.34 |
4000 | Dahu | 4 | 128 | No | 375 | 2.47 |
4000 | Dahu | 5 | 150 | Yes | 404 | |
4000 | Dahu | 2 | 60 | Yes | 643 | |
4000 | Luke42 | 1 | 40 | Yes | 895 | |
4000 | Luke42 | 1 | 40 | No | 978 | |
1000 | Luke42 | 1 | 40 | Yes | 9.14 | |
1000 | Luke42 | 1 | 40 | Yes | 7.83 |
- Cost: $\frac{2}{3}n^3 \frac{\frac{n}{2})log(n) + 2log(||A||{\infinity})}{23} + n\frac{\frac{n}{2}(log(n) + 2log(||A||{\infinity})}{23}$
- Use the OMP code -> hybrid mode
- Or: send the matrix and the primes to each node, and use CRA-OMP. This is a new algorithm though, which will be time consuming to write.
- communication time seems negligible.
- would be nice to time precisely each step and check task time balance, using timestamps for instance.
- utmost importance: get timings on how long each task takes, check if it scales properly.
Run for instance a sequential run, a parallel run on one node, a parallel run on 2 nodes and time
accurately each step, see how it balances and scales. It is necessary to be sure that CRT and RR steps
are not taking too long, as they are not yet possible to do in parallel, so are a huge bottleneck for
scaling. Ideally, reduction mod p and solve mod p should be the tasks who takes longer. Use small
instances (200 by 200 for instance). For timers,
user_timer
gives total time,real_timer
gives the actual time, waiting/halting times included. Using inputs with a large number of bits would be nice but should have a huge impact on the mod p step. - run benchmarks on instances as large as possible, algorithm on full mode, using all 32 cores per node.
- Zhu: timings
- Alexis: scripts for plotting
- Automatise benchs to measure precisely each step of the algorithm full mode, on small instances (100 by 100). Plot nb nodes vs. time results.
- Plot the sequential Dixon solver.
- Time Linbox vs. IML (authored by Storjohann, no longer maintained, used in Sage for interger solve).