Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rudimentary GPU load balancing #89

Open
dcommander opened this issue Nov 15, 2018 · 4 comments
Open

Rudimentary GPU load balancing #89

dcommander opened this issue Nov 15, 2018 · 4 comments

Comments

@dcommander
Copy link
Member

This issue is to track an idea for a script-based GPU load balancer that would be invoked by vglrun. Ideally this script would, if VGL_DISPLAY/-d is auto:

  1. Call nvidia-smi and amdconfig to build a list of available GPUs for which load information can be obtained, as well as the load of each GPU.
  2. Somehow figure out which X screen is attached to the least-loaded GPU. I'm not quite sure how to do this. nvidia-smi can be used to figure out whether any Xorg processes are attached to the least-loaded GPU (not sure whether amdconfig can do the same thing), and ps can obtain the command line (and thus the X display number) of the Xorg process, but that doesn't tell me which X screen is attached to which GPU (there can be multiple screens per Xorg process.)
  3. Set VGL_DISPLAY to the X screen of the least-loaded GPU.
  4. ???
  5. Profit.

This does not depend on #10, but it would need to accommodate that new feature. The idea is that #10 would work the way the old GLP feature used to work on Solaris/SPARC, i.e. it would be activated by specifying a device path rather than an X display in VGL_DISPLAY/-d. Similarly, this feature could be extended by specifying autoegl instead of auto, thus instructing the script to find an EGL device path instead of an X display. If (2) above proves impossible or unwieldy, then this feature might have to rely on EGL.

@al3x609
Copy link

al3x609 commented Feb 8, 2019

Hi, this is my configuration
image

I just started an Xserver server process, but the Xorg process run over the two GPU,

I consider that the parameter -d, should automatically choose the GPU less overloaded depending on the percentage of use gpu and memory percentage (utilization.gpu, utilization.memory) from nvidia-smi query.

e.g

nvidia-smi --query-gpu=utilization.gpu,utilization.memory --format=csv
utilization.gpu [%], utilization.memory [%]
2 % --------------------- 1 %
43 % --------------------- 39 %

@algorythmic
Copy link

Somehow figure out which X screen is attached to the least-loaded GPU.

I've only been able to do this with NV-CONTROL (NV_CTRL_BINARY_DATA_XSCREENS_USING_GPU or NV_CTRL_BINARY_DATA_GPUS_USED_BY_XSCREEN).

For example with their sample here:

$ DISPLAY=:0 ./nv-control-dpy --query-gpus
...
GPU Information:
  number of GPUs: 2
  number of X screens using GPU 0: 2
    Indices of X screens using GPU 0:  0 1
  number of X screens using GPU 1: 2
    Indices of X screens using GPU 1:  2 3
...

@dcommander
Copy link
Member Author

That would be fine if we knew that all GPUs were attached to display :0, but we can’t assume that (some deployments attach them to separate X displays.) The purpose here is to divine the appropriate value of VGL_DISPLAY for the least-loaded GPU, which might be attached to any X display in the system.

@algorythmic
Copy link

I thought that once the least-loaded GPU is identified, parsing the command line of the associated X process would tell you the display number (as you described in #89 (comment)).

Then, once you know the display number, the NV-CONTROL method can be used to determine which screen(s) of the identified display are on the GPU in question.

For example, the following code works for me (relying on the nv-control-dpy sample from the nvidia-settings repo just as a POC):

# find GPU and PID of least-utilized GPU that has X running
read gpu pid _ < <(nvidia-smi pmon -c 1 | awk '$8 == "X" { print $1,$2,$4 }' \
    | sort -n -k3 | head -n1)
# find DISPLAY in command line of X PID
display=$(grep -xz ':[0-9]\+' /proc/$pid/cmdline)
# use NV-CONTROL to find first SCREEN on GPU and DISPLAY identified
screen=$(DISPLAY=$display ./nv-control-dpy --query-gpus \
    | sed -n "s/ *Indices of X screens using GPU $gpu:  //p" | cut -d' ' -f1)
# set VGL_DISPLAY
VGL_DISPLAY=${display}.${screen}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants