Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sacct call fails when there are too many job ids? #45

Open
terrycojones opened this issue Apr 30, 2019 · 1 comment
Open

sacct call fails when there are too many job ids? #45

terrycojones opened this issue Apr 30, 2019 · 1 comment

Comments

@terrycojones
Copy link
Member

It looks like a call to sacct can fail when there are too many job ids.

$ make status   
Traceback (most recent call last):
  File "/rds/project/djs200/rds-djs200-acorg/bt/root/share/virtualenvs/365/bin/slurm-pipeline-status.py", line 61, in <module>
    status = SlurmPipelineStatus(args.specification, fieldNames=args.fieldNames)
  File "/rds/project/djs200/rds-djs200-acorg/bt/root/share/virtualenvs/365/lib/python3.6/site-packages/slurm_pipeline/status.py", line 24, in __init__
    self.sacct = SAcct(self.specification, jobIds, fieldNames=fieldNames)
  File "/rds/project/djs200/rds-djs200-acorg/bt/root/share/virtualenvs/365/lib/python3.6/site-packages/slurm_pipeline/sacct.py", line 68, in __init__
    ', '.join(map(str, sorted(jobIdsOfInterest)))))
slurm_pipeline.error.SAcctError: sacct did not return information about the following job ids: 11133989, 11133990, 11133991, 11133992, 11133993, 11133994, 11133995, 11133996, 11133997, 11133998, 11133999, 11134001, 11134002, 11134003, 11134004, 11134005, 11134006, 11134007, 11134008, 11134009, 11134010, 11134011, 11134012, 11134013, 11134014, 11134015, 11134016, 11134017, 11134018, 11134019, 11134020, 11134021, 11134022, 11134023, 11134024, 11134025, 11134026, 11134027, 11134028, 11134029, 11134030, 11134031, 11134032, 11134033, 11134034, 11134040, 11134041, 11134042, 11134043, 11134044, 11134045, 11134046, 11134047, 11134048, 11134049, 11134050, 11134051, 11134052, 11134053, 11134054, 11134055, 11134056, 11134057, 11134058, 11134059, 11134060, 11134061, 11134062, 11134063, 11134064, 11134065, 11134066, 11134067, 11134068, 11134069, 11134070, 11134071, 11134072, 11134073, 11134074, 11134075, 11134076, 11134077, 11134078, 11134079, 11134080, 11134081, 11134082, 11134083, 11134084, 11134085, 11134086, 11134087, 11134088, 11134089, 11134090, 11134091, 11134092, 11134093, 11134094, 11134095, 11134096, 11134097
make: *** [status] Error 1

The code should print out what it tried to do (I mean the actual sacct command line) so that it can be tried manually by the user.

I have a feeling this is going to require a different approach (and which was already suggested, I'm pretty sure), of asking sacct for all job ids since a certain time and then parsing that output for just the job ids of interest. That has the downside that you may be receiving many thousands of job ids in which you have no interest, but probably it can be restricted to the user who started the jobs, or similar.

@terrycojones
Copy link
Member Author

terrycojones commented May 3, 2019

Stuart Rankin clears up what's going on here:

SLURM consists of two pieces - the controller and the database daemon. Information about jobs
starts in the controller and is conveyed asynchronously to the database daemon, and sacct talks
to the database daemon. This means that at busy times, there may be a delay before job information 
reaches the database and can be queried by sacct. Eventually the information will be transferred and
the controller will forget that particular job. Note that job 11260826 is still pending, and is known to
the controller.

I suggest that it would be more reliable if your script started by using squeue to query the job (this
talks to the controller and has its own format options), and falls back to sacct if squeue responds
"no such job".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant