-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Difference in CLI variable expansion single-node vs multi-node #3967
Comments
@BramVanroy could you please try with #4007? I believe the issue is that with multi-node, we are doing the following:
And for single-node, we are doing this: DeepSpeed/deepspeed/launcher/runner.py Line 503 in 0a0819b
The PR I linked to above match the multi-node behavior on single-node. |
On closer inspection from the unit tests failures with #4007, my original fix was wrong. I've added the proper fix to #4007 now. The problem is actually with python/argparse. We can replicate the behavior outside of DeepSpeed: # example.py
import sys
import argparse
print(sys.argv)
parser = argparse.ArgumentParser()
parser.add_argument("--arg1", type=str)
parser.add_argument("--arg2", action="store_true")
args = parser.parse_args()
print(args) Then if we run the following, we can see that the .venv ❯ export EXTRA_ARGS="--arg1 hello --arg2"
.venv ❯ python3 example.py ${EXTRA_ARGS}
['example.py', '--arg1 hello --arg2']
usage: example.py [-h] [--arg1 ARG1] [--arg2]
example.py: error: unrecognized arguments: --arg1 hello --arg2 My best guess for why you are seeing different behavior with the multi-node setup is that the multi-node launcher is parsing this string before passing the args to python. The fix in #4007 should now allow passing arguments via a bash string for both single-node and multi-node. |
Tested the fix in #4007 and it's working. Closing this issue, but please re-open if you find the problem is still not fixed. Thanks! |
A small update on this issue: After #4769 merged, if you want to pass args in a bash string, you will need to do the following: echo ${EXTRA_ARGS}|xargs deepspeed --hostfile=$OUTPUT_DIR/hostfile train.py --deepspeed ds_config_zero2.json |
Splitting work from #4769 because we are still debugging transformers integration issues. Parsing was broken for user arguments (see #4795). Additionally, parsing of user arguments is tricky and there are lots of edge cases. For example: #4660, #4716, #3967. I've attempted to accommodate all of the possible types of string inputs and added unit tests.
Splitting work from microsoft#4769 because we are still debugging transformers integration issues. Parsing was broken for user arguments (see microsoft#4795). Additionally, parsing of user arguments is tricky and there are lots of edge cases. For example: microsoft#4660, microsoft#4716, microsoft#3967. I've attempted to accommodate all of the possible types of string inputs and added unit tests.
I am launching my training jobs through a shell script which in turn submits a PBS job to our cluster. This allows me to set some extra parameters and pass accept extra arguments to the shell script, e.g.
sh myscrtipt.sh -e "--arg1 hello --arg2
", which I can then pass to the pbs script as an environment variable. The pbs script then interpolates it to start a Python command. The command under scrutinty in the pbs file looks like this:where
${EXTRA_ARGS}
is, as per the example, the string--arg1 hello --arg2
.The shell should expand this correctly into different chunks that Python can parse correctly. And this works but only if I use more than one node. Interestingly, when I use just a single node, my EXTRA_ARGS are not correctly interpolated - instead they are seen as a single string `"--arg1 hello --arg2". So it seems that under-the-hood deepspeed is passing the CLI arguments to different nodes differently compared to when there is just one node. But I have not figured out where that distinction is made in the code.
Potentially related: #3961
The text was updated successfully, but these errors were encountered: