-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide a warning about Slurm-related envar #2077
Open
rhc54
wants to merge
1
commit into
openpmix:master
Choose a base branch
from
rhc54:topic/warn
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per our discussion on this PR, how about this for the message? (this was typed directly in github; please word wrap as appropriate)
Note that this is slightly different
%s
expansion than the current message.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of issues here. First, the value we can see is the final envar value - i.e., if Slurm is going to overwrite it, they already have done so. We have no idea if the user attempted to set an envar for that param, nor what that value might have been.
Second, we still need to address the problem this PR creates - it will block any execution under Slurm 23.11 and beyond. I do think that (if we are going to do this) we need some flag the user can set to indicate that we should ignore the envar value (i.e., unsetenv it) and continue executing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, gotcha -- no, I don't think we want that. I thought we talked about adding (abstraction-breaking, but probably still necessary) code in the tools codes directly -- before
prte_init()
and friends...?That's why I was thinking we could show the value and have good confidence that it's the user's value -- not the SLURM-set value.
Am I thinking incorrectly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm afraid you're off - Slurm sets their envar before invoking
mpirun
(they just throw it into every allocation, just in case it is needed), so we only see the end result of what they do. If the user provided an envar, Slurm will have already overwritten it before invokingmpirun
- so we can't know about it.I guess that's only true for batch runs, though, as otherwise the user would have the chance to overwrite it (probably without realizing they did). In that case, our
srun
launch would fail due to the changed cmd line. Very awkward situation. Probably should just focus on the non-interactive use-case as that is the one where the problem would appear (i.e., envars getting overwritten).If we detect an envar and error out, then there is no way the user can execute
mpirun
under Slurm 23.11 or above because Slurm will always have set their new envar. I guess we could detect and parse to see if other things beyond their addition are present - IIRC, they did release a fix to try and avoid overwriting the user's settings (not sure about that, but maybe). But that starts making this rather complicated.So it seems to me we have a few non-ideal choices:
Option 3 is beginning to sound like the best to me, but I'm open 🤷♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I guess I was thinking something different. Are you saying that this will happen:
(and similar if I
srun
orsbatch
a script that executesecho $PRTE_MCA_plm_slurm_args
)And if the above is correct, is the script below enough to defeat the SLURM silent env var writing?
shell$ cat myscript.sh export PRTE_MCA_plm_slurm_args=foo mpirun a.out shell$ sbatch -N 2 ./myscript.sh
Or does SLURM set it in every allocation and also set it after
mpirun
and beforeprte
(i.e., in the remote environment before it fork/exec'sprte
)? I.e., in all possible cases, SLURM sets this env variable, and it's just 100% not possible for the user to overwrite it.Yeah, that's clearly bad -- we don't want that.
Per your specific options:
Hmm. I guess I need to think about this more...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty sure this is true
Pretty sure this is not true, but I have no way of testing it.
I don't believe this is true, but I honestly am not sure. There fix was aimed at
mpirun
users, but IIUC the envar was set at time-of-allocation. I don't see any way to set it between thempirun
and fork/exec ofprterun
as Slurm isn't involved there. So it would seem like users could override it if they chose, so long as they realized they need to include the new cmd line option.The issue, in my mind, is that the envar is overriding what users put in their system and user-level default param files because the envar gets processed after those files. It isn't a question of someone setting an envar that doesn't work - it's a problem of putting values into default param files and unexplainingly finding them ignored. The user didn't put anything in the environment - their set values just "disappear" somehow.
Not sure what to do with it. Still a puzzlement on right path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha.
But even at the
mpirun
level (which we do have control of in Open MPI, obviously), how can we tell the difference between the user setting this env var and SLURM itself setting this env var?Oh, so I totally misunderstood this, then. Ok.
There is an MCA API that finds the source of an MCA param value -- but IIRC, it just checks for the value that will be used (which, if the value is in both the param file and the environment, it'll report that the value is in the environment).
Sigh. What a pickle. I'm honestly not sure what the right answer is here. The only way I can think to do this is to have a horrible abstraction break of putting a check for these specific variables when reading param files, and if we're in the SLURM job and the corresponding
getenv()
returns non-NULL, emit the warning (this is arguably quite a bit worse abstraction break than checking for these env vars inmain()
or the like).The goal is simple: warn users that their value is silently being ignored. How to detect that is the problem.☹️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the problem is that it is impossible to tell who set the envar.
Only thing I can think of to try is to check for the envar and then parse the value if it is found. We know what Slurm is setting - so if we see that value, we could silently ignore it. If we see any other value, then we probably can safely assume that the user provided it, and let the MCA system handle it as usual.
Would take a little checking to ensure we correctly identify the Slurm value. I can give that a try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Afraid I cannot find a new enough version of Slurm I can put in my container - and as reported elsewhere, I'm having zero success getting Slurm to run in the containers. I give up. Let's just silently ignore it?