[BUG] Difference in CLI variable expansion single-node vs multi-node #3967

BramVanroy · 2023-07-15T15:01:13Z

I am launching my training jobs through a shell script which in turn submits a PBS job to our cluster. This allows me to set some extra parameters and pass accept extra arguments to the shell script, e.g. sh myscrtipt.sh -e "--arg1 hello --arg2", which I can then pass to the pbs script as an environment variable. The pbs script then interpolates it to start a Python command. The command under scrutinty in the pbs file looks like this:

deepspeed --hostfile=$OUTPUT_DIR/hostfile train.py --deepspeed ds_config_zero2.json ${EXTRA_ARGS}

where ${EXTRA_ARGS} is, as per the example, the string --arg1 hello --arg2.

The shell should expand this correctly into different chunks that Python can parse correctly. And this works but only if I use more than one node. Interestingly, when I use just a single node, my EXTRA_ARGS are not correctly interpolated - instead they are seen as a single string `"--arg1 hello --arg2". So it seems that under-the-hood deepspeed is passing the CLI arguments to different nodes differently compared to when there is just one node. But I have not figured out where that distinction is made in the code.

Potentially related: #3961

The text was updated successfully, but these errors were encountered:

mrwyattii · 2023-07-20T20:43:11Z

@BramVanroy could you please try with #4007?

I believe the issue is that with multi-node, we are doing the following:

DeepSpeed/deepspeed/launcher/multinode_runner.py

Line 64 in 0a0819b

    
           return list(map(lambda x: x if x.startswith("-") else f"'{x}'", self.args.user_args))

And for single-node, we are doing this:

DeepSpeed/deepspeed/launcher/runner.py

Line 503 in 0a0819b

cmd = deepspeed_launch + [args.user_script] + args.user_args

The PR I linked to above match the multi-node behavior on single-node.

mrwyattii · 2023-07-20T22:56:08Z

On closer inspection from the unit tests failures with #4007, my original fix was wrong. I've added the proper fix to #4007 now.

The problem is actually with python/argparse. We can replicate the behavior outside of DeepSpeed:

# example.py
import sys
import argparse
print(sys.argv)

parser = argparse.ArgumentParser()
parser.add_argument("--arg1", type=str)
parser.add_argument("--arg2", action="store_true")
args = parser.parse_args()
print(args)

Then if we run the following, we can see that the EXTRA_ARGS are parsed as a single string:

.venv ❯ export EXTRA_ARGS="--arg1 hello --arg2"
.venv ❯ python3 example.py ${EXTRA_ARGS}
['example.py', '--arg1 hello --arg2']
usage: example.py [-h] [--arg1 ARG1] [--arg2]
example.py: error: unrecognized arguments: --arg1 hello --arg2

My best guess for why you are seeing different behavior with the multi-node setup is that the multi-node launcher is parsing this string before passing the args to python.

The fix in #4007 should now allow passing arguments via a bash string for both single-node and multi-node.

mrwyattii · 2023-07-21T16:03:36Z

Tested the fix in #4007 and it's working. Closing this issue, but please re-open if you find the problem is still not fixed. Thanks!

mrwyattii · 2023-12-14T22:11:41Z

A small update on this issue:

After #4769 merged, if you want to pass args in a bash string, you will need to do the following:

echo ${EXTRA_ARGS}|xargs deepspeed --hostfile=$OUTPUT_DIR/hostfile train.py --deepspeed ds_config_zero2.json

Splitting work from #4769 because we are still debugging transformers integration issues. Parsing was broken for user arguments (see #4795). Additionally, parsing of user arguments is tricky and there are lots of edge cases. For example: #4660, #4716, #3967. I've attempted to accommodate all of the possible types of string inputs and added unit tests.

Splitting work from microsoft#4769 because we are still debugging transformers integration issues. Parsing was broken for user arguments (see microsoft#4795). Additionally, parsing of user arguments is tricky and there are lots of edge cases. For example: microsoft#4660, microsoft#4716, microsoft#3967. I've attempted to accommodate all of the possible types of string inputs and added unit tests.

BramVanroy added bug Something isn't working training labels Jul 15, 2023

mrwyattii self-assigned this Jul 20, 2023

mrwyattii mentioned this issue Jul 20, 2023

Fix user arg parsing in single node deployment #4007

Merged

mrwyattii closed this as completed Jul 21, 2023

YudiZh mentioned this issue Sep 5, 2023

fix user args parsing of string with spaces on runner #4265

Merged

mrwyattii mentioned this issue Dec 14, 2023

Fix for HF integrations CI #4769

Open

mrwyattii mentioned this issue Dec 15, 2023

Refactor launcher user arg parsing #4824

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Difference in CLI variable expansion single-node vs multi-node #3967

[BUG] Difference in CLI variable expansion single-node vs multi-node #3967

BramVanroy commented Jul 15, 2023 •

edited

Loading

mrwyattii commented Jul 20, 2023

mrwyattii commented Jul 20, 2023

mrwyattii commented Jul 21, 2023

mrwyattii commented Dec 14, 2023

[BUG] Difference in CLI variable expansion single-node vs multi-node #3967

[BUG] Difference in CLI variable expansion single-node vs multi-node #3967

Comments

BramVanroy commented Jul 15, 2023 • edited Loading

mrwyattii commented Jul 20, 2023

mrwyattii commented Jul 20, 2023

mrwyattii commented Jul 21, 2023

mrwyattii commented Dec 14, 2023

BramVanroy commented Jul 15, 2023 •

edited

Loading