Skip to content

WeeklyTelcon_20160119

Jeff Squyres edited this page Nov 18, 2016 · 1 revision

Open MPI Weekly Telcon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Geoff Paulsen
  • Jeff Squyres
  • Brad Benton
  • Edgar Gabriel
  • Howard
  • Joshua Hursey
  • Joshua Ladd
  • Nathan Hjelm
  • Ralph
  • Sylvain Jeaugey
  • Todd Kordenbrock

Agenda

Review 1.10

  • Milestones: https://github.com/open-mpi/ompi-release/milestones/v1.10.2
  • Need to verify that library versions are still correct.
  • Cisco Weekend MTT tests didn't look good.
    • Build failure also.
    • usNIC unable to connect. Maybe a cluster issue.
    • Autogen --force didn't bring to 1.10, should remove from Cisco MTT.
    • Ralph will try to replicate MPI_Abort. Abort test itself.
    • 1.10 C Strided mutex lock issue. Nathan not surprised if it might be a bug. 1 fail. specific build config.
      • enable memchecker build could be affecting timing. Nathan will take a look... should be simple.
    • Jeff will look at MTT things after call.
    • High CPU utilization on Async progress thread. Ralph will take a look. From -GE.
  • After all of these issues are resolved / addressed can ship 1.10.2

Review 2.0.x

  • Wiki: https://github.com/open-mpi/ompi/wiki/Releasev20
  • Blocker Issues: https://github.com/open-mpi/ompi/issues?utf8=%E2%9C%93&q=is%3Aopen+milestone%3Av2.0.0+label%3Ablocker
    • Nathan's progression decay function progress?
    • Did Mellanox's UCX Modex stuff get merged in?
  • Milestones: https://github.com/open-mpi/ompi-release/milestones/v2.0.0
    • Last week discussed OMPI-IO + Luster slow on 2.0.0 (and master) branches. Discussed making ROMIO default for OMPI on Luster (only).
    • Last week discussed Group Comms weren't working for Comms of powers of 2. Nathan found massive memory issue.
    • Pull Requests - Several that Jeff, Ralph, or Howard need to review.
    • PR 896 - not going to help us avoid Luster issue. Reduce priority of Luster below ROMIO.
      • Edgar Tested on Cray.
    • 894, 890, 900, 901 - Jeff and Howard are good with. Jeff will merge in.
    • Travis is now being run on 2.0 branch.
    • Issue 1299 - hang - want to get that into 2.0.0 - Nathan can you look at?
    • Issue 1301 - check max CQ size before creating CQ. Joshua Ladd will assign it to someone.
    • Should start marking these as 2.0.0 blockers.
    • Issue 1252 - Performance - Nathan going to write a decay function for progression. Will create a Pull Request and Geoff Paulsen will test. Last big one, and kind of important.
    • HWThreads - Ralph has no interest in going backwards to support physical CPUs. A real mess of switching if it's physical or virtual.
      • What is the desire? Recent OS and BIOS seem to get it right. AMD and Intel seem to be different, and seems to come up. Generated a TON of confusion among users.
      • Perhaps Mike has a use case that really demands it. Ralph will talk with him.

Review Master?

  • Edgar's PR into master PR (Try to work around Luster, by switching over to use ROMIO).
    • Not sure if issues he's seeing on Cray or on his cluster. Could be related, but need to get cluster running again.
    • Wanted to see if any warnings from jenkins.
    • But running that portion of code on Edgar's cluster, hits many issues.
    • BTL flags = 305 perf got horrible (used to get better).
    • did something else change in configure ? Hitting one issue after another independant of OMPIO.
    • OMPIO is not finding PFS2 correctly during configure. Jeff can use screen share with Edgar.
    • Issues only show up with 96 procs to hit, which makes debugging more difficult.

MTT status:

  • Cisco some timeouts having

Status Updates:

  • LANL - Nathan - Not much, just trying to see if can find issue for Progress slowdown. Continue to iterate on RDMA stuff to look for any remaining bugs.
    • Howard - reviewing PR on 2.0.0. Backlog of things for Edgar.
  • Houston - New Component he's developed over last few weeks. Now competative on Cray, but too late for 2.0.0, s dynamic gen 2 - a number of new features unimplemented, but room to grow.
  • HLRS - no update.
  • IBM - Hired Joshua Hersey.
    • Working on deciding internally to use GITHUB Enterprise, or GITLAB based approach.
    • Working with David Solt on first PR, getting process setup for other developers.
    • Working on writing up RFC proposals.

Status Update Rotation

  1. LANL, Houston, HLRS, IBM
  2. Cisco, ORNL, UTK, NVIDIA
  3. Mellanox, Sandia, Intel

Back to 2016 WeeklyTelcon-2016

Clone this wiki locally