Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Amazon Linux ec2-user cannot run interactive job #7

Open
griffji opened this issue Nov 17, 2021 · 4 comments
Open

Amazon Linux ec2-user cannot run interactive job #7

griffji opened this issue Nov 17, 2021 · 4 comments

Comments

@griffji
Copy link

griffji commented Nov 17, 2021

Hello,

I'm testing Amazon Linux and I get it to install and work partially, but the ec2-user cannot run a interactive job. I can attach the slightly modified playbook for review if you like. However, can you provide some insight into the matter?

[root@head ~]# sacctmgr show user ec2-user accounts

  User   Def Acct     Admin

[root@head ~]# exit

logout

[ec2-user@head ~]$ sinfo

PARTITION AVAIL TIMELIMIT NODES STATE NODELIST

compute* up infinite 1 unk* dal2grid04

compute* up infinite 3 idle dal2grid[01-03]

[ec2-user@head ~]$ cat /etc/os-release

NAME="Amazon Linux"

VERSION="2"

ID="amzn"

ID_LIKE="centos rhel fedora"

VERSION_ID="2"

PRETTY_NAME="Amazon Linux 2"

ANSI_COLOR="0;33"

CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"

HOME_URL="https://amazonlinux.com/"

[ec2-user@head ~]$ sinfo

PARTITION AVAIL TIMELIMIT NODES STATE NODELIST

compute* up infinite 1 unk* dal2grid04

compute* up infinite 3 idle dal2grid[01-03]

[ec2-user@head ~]$ sinfo -V

slurm 20.11.8

[ec2-user@head ~]$
UbuntuSlurm
AmazonLinuxSlurm

Thanks

JG

@pescobar
Copy link
Member

Slurm users are automatically added to the slurm accounting db on the first job submission using a lua job submission plugin

The lua script should be deployed to /etc/slurm/job_submit.lua in the master host.

You should debug why the user ec2-user is not being added to slurmdb. It seems it's working as expected for user ubuntu. Try to grep for failed to lookup uid in your slurm logs or extend the logging here to try to debug it

@pescobar
Copy link
Member

have you verified if after executing sbatch --wrap="hostname" as ec2-user the user is created ? You should see user ec2-user executing sacctmgr show users

@griffji
Copy link
Author

griffji commented Nov 18, 2021

[ec2-user@head ~]$ vim test-job.sh
[ec2-user@head ~]$ sbatch test-job.sh
sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified
[ec2-user@head ~]$ srun hostname
srun: error: Unable to allocate resources: Invalid account or account/partition combination specified
[ec2-user@head ~]$ sacctmgr show users
User Def Acct Admin


  root       root Administ+

[ec2-user@head ~]$ sudo tail -F /var/log/slurm/slurm
slurmctld.log slurmctld.log-20211114 slurmdbd.log slurmdbd.log-20211114
[ec2-user@head ~]$ sudo tail -F /var/log/slurm/slurmctld.log |grep -i error
[2021-11-18T18:21:55.726] error: Nodes dal2grid04 not responding
[2021-11-18T18:26:55.222] error: Nodes dal2grid04 not responding
[2021-11-18T18:31:55.719] error: Nodes dal2grid04 not responding
[2021-11-18T18:36:55.211] error: Nodes dal2grid04 not responding
[2021-11-18T18:41:55.704] error: Nodes dal2grid04 not responding
[2021-11-18T18:45:11.718] error: job_submit/lua: /etc/slurm/job_submit.lua: /etc/slurm/job_submit.lua:15: attempt to concatenate local 'username' (a nil value)
[2021-11-18T18:45:11.718] error: User 1000 not found
[2021-11-18T18:45:26.665] error: job_submit/lua: /etc/slurm/job_submit.lua: /etc/slurm/job_submit.lua:15: attempt to concatenate local 'username' (a nil value)

[2021-11-18T18:45:26.665] error: User 1000 not found
[2021-11-18T18:46:55.198] error: Nodes dal2grid04 not responding
^C
[ec2-user@head ~]$ tail -F /var/log/slurm/slurmdbd.log |grep -i error
tail: cannot open ‘/var/log/slurm/slurmdbd.log’ for reading: Permission denied
^C
[ec2-user@head ~]$ sudo tail -F /var/log/slurm/slurmdbd.log |grep -i error
[2021-11-16T23:51:55.234] error: mysql_real_connect failed: 2002 Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)
[2021-11-16T23:51:55.236] error: The database must be up when starting the MYSQL plugin. Trying again in 5 seconds.
[2021-11-16T23:52:00.254] error: Database settings not recommended values: innodb_buffer_pool_size innodb_log_file_size innodb_lock_wait_timeout

@pescobar
Copy link
Member

I would suggest to verify that user ec2-user exists in every host and it has same uid in all them.

If the user exists I guess the lua submit script is failing for some reason when trying to add the user ec2-user to the slurm accounting db. I guess the problem could be with the - in the username. I would try to add extra debug lines to the lua submit script (like here) to try to debug the problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants