Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provider Stats Endpoint fails to account for Service Count in GPU deployments #147

Open
andy108369 opened this issue Nov 16, 2023 · 0 comments
Labels
repo/provider Akash provider-services repo issues sev2

Comments

@andy108369
Copy link
Contributor

Description:
There is an issue in the provider stats endpoint concerning GPU utilization reporting, specifically when handling deployments requesting GPUs across service count >1. This problem is evident in provider version 0.4.7 and Akash Network version 0.26.2.

Issue Details:
The current implementation of the provider stats endpoint does not correctly factor in the 'service count' for deployments that request GPUs. This results in an inaccurate display of the total GPU usage & availability.

Example Scenario:
Consider a GPU deployment consisting of two services:

  1. First service with count: 14 and gpu: 2.
  2. Second service with count: 1 and gpu: 2.

Theoretically, the total GPU usage should be 30 (calculated as 14*2 + 1*2), but this is not reflected in the provider stats.

Observed Output:
For the provider at provider.akash-ai.com (akash1c6rsz4f59nkus3s5qauxxh969j2mtkkn2clk2e), the stats endpoint incorrectly reports only 4 GPUs in use (should be 30 in use). The script output is as follows (based on the :8443/stats report you can see below):

$ provider_info.sh provider.akash-ai.com
type       cpu      gpu  ram                 ephemeral          persistent
used       180      4    428                 3700               0
pending    0        0    0                   0                  0
available  564.8    2    1735.996597290039   3378.038669425994  0
node       171      0    448.06262588500977  869.5096673564985  N/A
node       170.78   0    447.92784881591797  869.5096673564985  N/A
node       171.495  0    447.97326850891113  869.5096673564985  N/A
node       51.525   2    392.0328540802002   769.5096673564985  N/A

Expected Behavior:
The provider stats endpoint should accurately represent the total number of GPUs in use, incorporating the 'service count' in its calculation for deployments with GPU requests.

Impact:
This inaccurate reporting can lead to misunderstandings regarding resource availability and utilization, potentially affecting scheduling decisions and overall resource management on the Akash Network.


Additional info

root@node1:~# kubectl get deployment -A -o yaml | grep -Ei 'gpu|readyReplicas'
...
    readyReplicas: 1
                - key: akash.network/capabilities.gpu.vendor.nvidia.model.a100
              nvidia.com/gpu: "2"
              nvidia.com/gpu: "2"
    readyReplicas: 1
                - key: akash.network/capabilities.gpu.vendor.nvidia.model.a100
          image: REDACTED
              nvidia.com/gpu: "2"
              nvidia.com/gpu: "2"
    readyReplicas: 14
root@node1:~# 
$ curl -s -k https://provider.akash-ai.com:8443/status | jq -r . 
{
  "cluster": {
    "leases": 1,
    "inventory": {
      "active": [
        {
          "cpu": 180000,
          "gpu": 4,
          "memory": 459561500672,
          "storage_ephemeral": 3972844748800
        }
      ],
      "available": {
        "nodes": [
          {
            "cpu": 171000,
            "gpu": 0,
            "memory": 481103581184,
            "storage_ephemeral": 933628896213
          },
          {
            "cpu": 170780,
            "gpu": 0,
            "memory": 480958865408,
            "storage_ephemeral": 933628896213
          },
          {
            "cpu": 171495,
            "gpu": 0,
            "memory": 481007634432,
            "storage_ephemeral": 933628896213
          },
          {
            "cpu": 51525,
            "gpu": 2,
            "memory": 420942071808,
            "storage_ephemeral": 826254713813
          }
        ]
      }
    }
  },
  "bidengine": {
    "orders": 0
  },
  "manifest": {
    "deployments": 0
  },
  "cluster_public_hostname": "provider.akash-ai.com",
  "address": "akash1c6rsz4f59nkus3s5qauxxh969j2mtkkn2clk2e"
}
@andy108369 andy108369 added repo/provider Akash provider-services repo issues awaiting-triage labels Nov 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
repo/provider Akash provider-services repo issues sev2
Projects
None yet
Development

No branches or pull requests

2 participants