Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Which versions of Pyxis, Slurm and enroot for running NeMo-Megatron-Launcher on one Node with 8 * A100? #73

Open
starlitsky2010 opened this issue Jun 8, 2023 · 2 comments

Comments

@starlitsky2010
Copy link

Environment:

Pyxis v0.14.0
Slrum19.05.5
enroot: enroot+caps_3.4.1

Get Method:

# cd pyxis
git log and found tag:
commit ea7bb88a4f31f3535334f92cbcc1324d60b113d8 (HEAD -> master, tag: v0.14.0)

# srun -V
slurm-wlm 19.05.5

Error Info:

launcher_scripts# cat results/download_gpt3_pile/download/log-nemo-megatron-download_gpt3_pile_23_0.err

Problem I've met:
srun: unrecognized option '--container-image'
srun: unrecognized option '--container-image'
Try "srun --help" for more information

Thanks
Aaron

@roclark
Copy link
Member

roclark commented Jun 8, 2023

Hey @starlitsky2010! Are Pyxis and enroot installed on all nodes in the cluster and at the same version as well? The Slurm version is a bit older than what we've tested previously so it's possible that would benefit from an update if practical. The oldest version we've documented with NeMo Framework on Slurm that I'm aware of was the following:

  • Slurm: 20.11.7
  • Pyxis: 0.9.1

So your Pyxis version should be fine, but Slurm could potentially be updated, though I can't say that's definitively the problem at the moment.

Was Pyxis/enroot installed recently? Have the Slurm daemons been restarted?

@starlitsky2010
Copy link
Author

starlitsky2010 commented Jun 9, 2023

Hi @roclark ,

Pyxis and enroot installed on all nodes. It should be the slurm version too old (19.05.5), it's not compatible with the latest version pyxis.

I've tested v0.7.0. When I srun --help. the container relative options will be shown.
For Ubuntu 20.04, it will install slurm-wlm 19.05.5 automatically by command below:
sudo apt install slurmd slurmctld -y

Do you have any Ubuntu version recommended?
How did you install the slurm? Could you help provide some links about it?

I'll try the following version later.
Slurm: 20.11.7
Pyxis: 0.9.1

Thanks
Aaron

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants