Skip to content

[ODK] Meeting 2018 10 17

A. Breust edited this page Nov 7, 2018 · 1 revision

Minutes for the October 17, 2018 meeting

1. Benchmarks

  • For a 4k by 4k matrix, on Dahu cluster, 2 nodes vs. 4 nodes (32 cores/node), timings are 554 vs. 375 seconds (speedup factor 1.48) not on average.
  • On average, using 5 nodes (150 cores, max is 160), for a 4k by 4k matrix, run time is 404s with early termination.
  • On average, using 1 node of Luke42 (40 cores), for a 4k by 4k matrix, run time is 895s. with early termination.
  • On average, using 1 node of Luke42 (40 cores), for a 4k by 4k matrix, run time is 978s. with no early termination.
  • Practical issue: for a 10k by 10k matrix, not enough RAM to store the matrices (the GMP one and the mod p one) on each node. For solve computation only:
  • In GFlops: 2.47 for the 4 nodes, 128 cores 4k by 4k matrix. Not too shabby, but not great either.
  • In GFlops: 3.34 for the 2 nodes, 64 cores 4k by 4k matrix. Quite better.
Matrix order Cluster Nb nodes Nb cores Early? Run time GFlops (solve only)
4000 Dahu 2 64 No 554 3.34
4000 Dahu 4 128 No 375 2.47
4000 Dahu 5 150 Yes 404
4000 Dahu 2 60 Yes 643
4000 Luke42 1 40 Yes 895
4000 Luke42 1 40 No 978
1000 Luke42 1 40 Yes 9.14
1000 Luke42 1 40 Yes 7.83
  • Cost: $\frac{2}{3}n^3 \frac{\frac{n}{2})log(n) + 2log(||A||{\infinity})}{23} + n\frac{\frac{n}{2}(log(n) + 2log(||A||{\infinity})}{23}$

2. Memory issues

  • Use the OMP code -> hybrid mode
  • Or: send the matrix and the primes to each node, and use CRA-OMP. This is a new algorithm though, which will be time consuming to write.

3. What about small instances

  • communication time seems negligible.
  • would be nice to time precisely each step and check task time balance, using timestamps for instance.

4. For the next week

  • utmost importance: get timings on how long each task takes, check if it scales properly. Run for instance a sequential run, a parallel run on one node, a parallel run on 2 nodes and time accurately each step, see how it balances and scales. It is necessary to be sure that CRT and RR steps are not taking too long, as they are not yet possible to do in parallel, so are a huge bottleneck for scaling. Ideally, reduction mod p and solve mod p should be the tasks who takes longer. Use small instances (200 by 200 for instance). For timers,user_timer gives total time, real_timer gives the actual time, waiting/halting times included. Using inputs with a large number of bits would be nice but should have a huge impact on the mod p step.
  • run benchmarks on instances as large as possible, algorithm on full mode, using all 32 cores per node.

5. Work repartition

  • Zhu: timings
  • Alexis: scripts for plotting

6. tl;dr

  • Automatise benchs to measure precisely each step of the algorithm full mode, on small instances (100 by 100). Plot nb nodes vs. time results.
  • Plot the sequential Dixon solver.
  • Time Linbox vs. IML (authored by Storjohann, no longer maintained, used in Sage for interger solve).
Clone this wiki locally