Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Difference in CLI variable expansion single-node vs multi-node #3967

Closed
BramVanroy opened this issue Jul 15, 2023 · 4 comments
Closed
Assignees
Labels
bug Something isn't working training

Comments

@BramVanroy
Copy link
Contributor

BramVanroy commented Jul 15, 2023

I am launching my training jobs through a shell script which in turn submits a PBS job to our cluster. This allows me to set some extra parameters and pass accept extra arguments to the shell script, e.g. sh myscrtipt.sh -e "--arg1 hello --arg2", which I can then pass to the pbs script as an environment variable. The pbs script then interpolates it to start a Python command. The command under scrutinty in the pbs file looks like this:

deepspeed --hostfile=$OUTPUT_DIR/hostfile train.py --deepspeed ds_config_zero2.json ${EXTRA_ARGS}

where ${EXTRA_ARGS} is, as per the example, the string --arg1 hello --arg2.

The shell should expand this correctly into different chunks that Python can parse correctly. And this works but only if I use more than one node. Interestingly, when I use just a single node, my EXTRA_ARGS are not correctly interpolated - instead they are seen as a single string `"--arg1 hello --arg2". So it seems that under-the-hood deepspeed is passing the CLI arguments to different nodes differently compared to when there is just one node. But I have not figured out where that distinction is made in the code.

Potentially related: #3961

@mrwyattii
Copy link
Contributor

@BramVanroy could you please try with #4007?

I believe the issue is that with multi-node, we are doing the following:

return list(map(lambda x: x if x.startswith("-") else f"'{x}'", self.args.user_args))

And for single-node, we are doing this:

cmd = deepspeed_launch + [args.user_script] + args.user_args

The PR I linked to above match the multi-node behavior on single-node.

@mrwyattii
Copy link
Contributor

On closer inspection from the unit tests failures with #4007, my original fix was wrong. I've added the proper fix to #4007 now.

The problem is actually with python/argparse. We can replicate the behavior outside of DeepSpeed:

# example.py
import sys
import argparse
print(sys.argv)

parser = argparse.ArgumentParser()
parser.add_argument("--arg1", type=str)
parser.add_argument("--arg2", action="store_true")
args = parser.parse_args()
print(args)

Then if we run the following, we can see that the EXTRA_ARGS are parsed as a single string:

.venv ❯ export EXTRA_ARGS="--arg1 hello --arg2"
.venv ❯ python3 example.py ${EXTRA_ARGS}
['example.py', '--arg1 hello --arg2']
usage: example.py [-h] [--arg1 ARG1] [--arg2]
example.py: error: unrecognized arguments: --arg1 hello --arg2

My best guess for why you are seeing different behavior with the multi-node setup is that the multi-node launcher is parsing this string before passing the args to python.

The fix in #4007 should now allow passing arguments via a bash string for both single-node and multi-node.

@mrwyattii
Copy link
Contributor

Tested the fix in #4007 and it's working. Closing this issue, but please re-open if you find the problem is still not fixed. Thanks!

@mrwyattii
Copy link
Contributor

A small update on this issue:

After #4769 merged, if you want to pass args in a bash string, you will need to do the following:

echo ${EXTRA_ARGS}|xargs deepspeed --hostfile=$OUTPUT_DIR/hostfile train.py --deepspeed ds_config_zero2.json

mrwyattii added a commit that referenced this issue Dec 15, 2023
Splitting work from #4769 because we are still debugging transformers
integration issues.

Parsing was broken for user arguments (see #4795). Additionally, parsing
of user arguments is tricky and there are lots of edge cases. For
example: #4660, #4716, #3967. I've attempted to accommodate all of the
possible types of string inputs and added unit tests.
mauryaavinash95 pushed a commit to mauryaavinash95/DeepSpeed that referenced this issue Feb 17, 2024
Splitting work from microsoft#4769 because we are still debugging transformers
integration issues.

Parsing was broken for user arguments (see microsoft#4795). Additionally, parsing
of user arguments is tricky and there are lots of edge cases. For
example: microsoft#4660, microsoft#4716, microsoft#3967. I've attempted to accommodate all of the
possible types of string inputs and added unit tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

2 participants