Fairshare explanation #25

pmenstrom · 2024-02-21T16:42:38Z

The Campus Cluster user documentation currently only mentions that the secondary queue uses fairshare but Weddie thinks it is active on all of the queues. The delta documentation doesn't mention fairshare at all.

The SLURM “sshare -l” command shows output on CC, Nightingale, and Delta but I am not sure if all 3 systems use it or if SLURM just collects the data regardless.

A general description of fairshare would be a good candidate for cross cluster content. It should probably just be a generic discussion of how fairshare works in SLURM and maybe mention what the "sshare" command output means for the user. I believe that NCSA has a policy not to reveal the specifics of the job scheduling algorithms so that the users don't try to game the system.

lhelms2 · 2024-03-28T19:05:32Z

This will be part of the new Slurm page, it can be published before the rest of the page or with the rest.

craigsteffen · 2024-03-28T19:39:00Z

As both Delta and Hydro start to get more user load, jobs will start to take longer to run, and so we'll get more and more tickets that we in the Blue Waters team used to internally refer to as "why my job not run?" tickets.

So yeah, we'll definitely want to put this up so that we can refer to it. "I believe that NCSA has a policy not to reveal the specifics of the job scheduling algorithms so that the users don't try to game the system.". I think maybe "policy" is too strong of a word here, but broadly yes, we work really hard to not tell users the exact parameters of the job system. First, as you say, mildly knowledgable users would probably use that knowledge to game the job system. But in addition to that; it would be very difficult to keep that up to date. Scheduler parameters can sometimes be tweaked on a very short timescale to react to user job behavior (or lack therof).

Since I have feelings about scheduling and schedulers and stuff, I've added myself as an assignee on this. I will do my absolute best to really contribute to this documentation, and good intentions and all that, but I definitely want to keep track of this.

pmenstrom · 2024-03-28T19:44:20Z

I think it was a policy a couple decades ago :-)
I haven't dealt with fairshare since Moab.

Just having some text we can point users to when they feel their jobs are unfairly being delayed will be a big help. (Why my job not start)

May also want to include a description of reservations "scontrol show reservations" and node state "sinfo -N -p " and how they affect job starts.

craigsteffen · 2024-03-28T19:55:13Z

Right, @pmenstrom , good point. Having either a link to a page with job system commands that tell you what your status is and why your job is likely waiting, or this being that page, would be good.

Also, again, either a link to a page with (or else being that page) recommendations on how to structure a job request so that it's more likely to run and give you what you want. I typed out a version of this in a ticket yesterday. Things like "only request resources you're actually likely to use. If your code runs between 4 and 5 hours, then request (say) 6 hours, not 24, because it will get scheduled sooner, other jobs will get scheduled sooner, and if your code malfunctions and runs away but doesn't end the job, you won't have wasted nearly as much allocation.

Which in turn might ask the question: how do I know how long my job runs? Well, let me link you to our page (or section of page) on benchmarking and scaling up my jobs.

Hmm. Ok, I may assign creating skeletons of those pages to myself.

pmenstrom · 2024-03-28T20:00:22Z

Sounds good, let me know when you have it sketched out and I might throw a few more things into it. May end up with things that should be on separate, related pages...
Definitely run into questions when there is a system reservation before a PM and long jobs aren't starting even though enough nodes are idle to run their job.

pmenstrom · 2024-05-14T15:08:39Z

This has come up again in another user ticket.
I will search Jira and see what answers we have given in the past.

lhelms2 · 2024-05-14T15:21:39Z

@pmenstrom can you please add the number of the current ticket? I'll track it to help inform an initial writeup. Thank you!

pmenstrom · 2024-05-14T15:23:31Z

It is an old campus cluster ticket. They were asking for a bump in job priority and also asking for fairshare details ICCPADM-4810

lhelms2 · 2024-05-14T15:35:30Z

I'm going to work on getting the Slurm pages (including a fairshare discussion) user-presentable in the near term and we can continue to refine them once they're published.

lhelms2 self-assigned this Mar 28, 2024

craigsteffen self-assigned this Mar 28, 2024

lhelms2 mentioned this issue May 14, 2024

create page on basic code benchmarking, job optimization, and job scheduling (why job not run) #32

Open

lhelms2 added the improvement Improvements to the documentation label Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fairshare explanation #25

Fairshare explanation #25

pmenstrom commented Feb 21, 2024

lhelms2 commented Mar 28, 2024

craigsteffen commented Mar 28, 2024

pmenstrom commented Mar 28, 2024

craigsteffen commented Mar 28, 2024

pmenstrom commented Mar 28, 2024

pmenstrom commented May 14, 2024

lhelms2 commented May 14, 2024

pmenstrom commented May 14, 2024

lhelms2 commented May 14, 2024

Fairshare explanation #25

Fairshare explanation #25

Comments

pmenstrom commented Feb 21, 2024

lhelms2 commented Mar 28, 2024

craigsteffen commented Mar 28, 2024

pmenstrom commented Mar 28, 2024

craigsteffen commented Mar 28, 2024

pmenstrom commented Mar 28, 2024

pmenstrom commented May 14, 2024

lhelms2 commented May 14, 2024

pmenstrom commented May 14, 2024

lhelms2 commented May 14, 2024