-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Containers: Fix mpirun issue where it cannot contact the workers
There is an intermittent issue where mpirun cannot contact the workers even though nslookup can successfully resolve their DNS hostnames in the Init Container. This is seen somewhat infrequently, but has happened enough. The end result causes the user containers to restart (if restartLimit > 0), and it always seems to work on the second try. This seems to solve the issue by using the Init Continer to use mpirun to contact the workers and just get their hostnames. This replaces the use of nslookup and ensures that mpirun can be successful on the launcher. To support this, the Init Container must run as the given UID/GID rather than root. It also speeds up container start times as we only need to run 1 Init Container for all of the workers rather than an Init Container for each worker. I have not been able to reproduce the original error using int-test, which would (in)frequently catch this. Signed-off-by: Blake Devcich <[email protected]>
- Loading branch information
Showing
1 changed file
with
41 additions
and
25 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters