Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provider Feature Discovery Service #141

Open
anilmurty opened this issue Oct 16, 2023 · 19 comments
Open

Provider Feature Discovery Service #141

anilmurty opened this issue Oct 16, 2023 · 19 comments
Assignees
Labels
P1 repo/provider Akash provider-services repo issues

Comments

@anilmurty
Copy link

anilmurty commented Oct 16, 2023

Feature discovery service that runs on providers and makes capabilities visible to clients

@anilmurty
Copy link
Author

Feature discovery service that runs on providers and makes capabilities visible to clients

@anilmurty
Copy link
Author

API definition in progress. Will need a network upgrade

@anilmurty
Copy link
Author

anilmurty commented Oct 17, 2023

Initial prototype and demo working. Work on inventory operator (gathers info from the provider, including quantity and type of hardware) next along with storing data

@brewsterdrinkwater
Copy link
Collaborator

  • Artur and Scott realized the information from third party APIs were not giving the right labels. Own databases are needed.
  • Found a much cleaner path which does gRPC server streaming.
  • Over the last couple of days, Scott and Artur have been talking more about the Spec. Further meetings will happen to finalize the spec.
  • Focus first is what GPUs are on the Akash Network. This will be helpful for Cloudmos, Praetor, and other clients.
  • Want to have allocation/ unallocated for GPUs as well.
  • The mechanics and hard work behind the scenes is done. Working on testing next.

@brewsterdrinkwater
Copy link
Collaborator

oct 31:

  • issue with gRPC types - streaming (scott is working on this)
  • feature discovery within the pod is complete
  • allocation discovery is available as well. The logic is done and working.
  • can send without issue to a client.
  • need to write some documentation on how to do this properly in the future.

@anilmurty anilmurty transferred this issue from akash-network/community Nov 7, 2023
@anilmurty
Copy link
Author

Nov 7:

Demo from Scott of the daemonset running across all worker nodes and of the client discovery service

Next: Implement API endpoint for clients to be able to query

Greg question: Can we have historical data about capabilities (GPU models and count in particular) stored on the blockchain so that it is queryable by anyone? - this will be important if we want to offer incentives to providers.

@chainzero
Copy link
Collaborator

chainzero commented Nov 7, 2023

Example JSON output of current implementation:

{
  "nodes": [
    {
      "cpu": {
        "info": [
          {
            "id": "0",
            "vendor": "GenuineIntel",
            "model": "Intel(R) Xeon(R) CPU @ 2.30GHz",
            "vcores": 2
          }
        ],
        "quantity": {
          "allocatable": "1",
          "allocated": "1"
        }
      },
      "memory": {
        "info": null,
        "quantity": {
          "allocatable": "15337752Ki",
          "allocated": "1612240896"
        }
      },
      "gpus": {
        "info": [
          {
            "vendor": "nvidia",
            "name": "t4",
            "modelid": "1eb8",
            "interface": "pci-e",
            "memory": "15gb"
          }
        ],
        "quantity": {
          "allocatable": "1",
          "allocated": "1"
        }
      },
      "storage": {
        "info": null,
        "quantity": {
          "allocatable": "253869360Ki",
          "allocated": "1073741824"
        }
      }
    },
    {
      "cpu": {
        "info": [
          {
            "id": "0",
            "vendor": "GenuineIntel",
            "model": "Intel(R) Xeon(R) CPU @ 2.20GHz",
            "vcores": 2
          }
        ],
        "quantity": {
          "allocatable": "2",
          "allocated": "0"
        }
      },
      "memory": {
        "info": null,
        "quantity": {
          "allocatable": "16365480Ki",
          "allocated": "513875968"
        }
      },
      "gpus": {
        "info": null,
        "quantity": {
          "allocatable": "0",
          "allocated": "0"
        }
      },
      "storage": {
        "info": null,
        "quantity": {
          "allocatable": "253869360Ki",
          "allocated": "0"
        }
      }
    }
  ],
  "storage": null
}

@chainzero
Copy link
Collaborator

Update in provider Feature Discovery:

  • Fully tested successfully hardware and allocation discovery and exposure via API in code merged into provider services/inventory operator code in dev environment

  • As local dev env is not accessible by others - mocked daemon set, intel gather, and API endpoint exposure on mainnet provider

  • REST API with JSON data return can be accessed via:
    curl -ks https://provider.akashtesting.xyz/features | jq

  • Mainnet provider has a single host with GPU resources. Revealed via discovery and the API endpoint as follows. Host has one NVIIDIA T4 that is currently allocated to a test deployment.

  • Note - the protobuf definition currently an empty quantity field and a populated quantitylocal field. This will not be part of the eventual production spec and was introduced temporarily to circumvent a Kubernetes API Machinery un-exported field issue. Will solve with more elegant solution prior to release.

      "gpus": {
        "quantity": null,
        "info": [
          {
            "vendor": "nvidia",
            "name": "t4",
            "modelid": "1eb8",
            "interface": "pci-e",
            "memory": "15gb"
          }
        ],
        "quantitylocal": {
          "allocatable": "1",
          "allocated": "1"
        }
      }

@brewsterdrinkwater
Copy link
Collaborator

brewsterdrinkwater commented Nov 21, 2023

November 21st, 2023:

  • API implementation is done.
  • working on error handling now.
  • Will start merging provider feature discovery work
  • Engineering to connect with Cloudmos team.
  • Working on 1 kubernetes issue

@brewsterdrinkwater
Copy link
Collaborator

November 28th, 2023:

This is code complete. Artur will code review when he returns. This will NOT require a node upgrade.

@brewsterdrinkwater
Copy link
Collaborator

brewsterdrinkwater commented Dec 12, 2023

December 12th, 2023:

  • Still working on 1 issue related to kubernetes, gRPC and protobuff.
  • Scott had a work around.
  • Going to try and find a resolve for this issue.
  • Will review code and release following that.
  • First deliverable will be an endpoint

@brewsterdrinkwater
Copy link
Collaborator

December 19th:

  • Blocker removed: Core team finally figured out for k8s resource type was giving us panics and
    is updating codebase with the fix. we will release is as soon as review passes
  • Artur will review code
  • Scott testing the revision, while waiting for code review

@troian troian moved this from In Progress (prioritized) to In Test (or staging) in Core Product and Engineering Roadmap Jan 9, 2024
@troian troian self-assigned this Jan 9, 2024
@brewsterdrinkwater
Copy link
Collaborator

January 9th, 2024:

  • Feature is completed.
  • Final Testing coming next.
  • Plan to release tomorrow, pending testing.

@brewsterdrinkwater
Copy link
Collaborator

brewsterdrinkwater commented Jan 16, 2024

January 16th, 2024:

  • Continued testing.
  • Feature discovery is part of the inventory operator, so doing a great deal of end to end testing.
  • Targeted to push out this week.
  • Clients will be able to use inventory API when tool is released

@brewsterdrinkwater
Copy link
Collaborator

January 23rd:

  • Had a few issues around gRPC ; issues have been solved.
  • Making new release and new helm charts.
  • Testing will continue

troian added a commit to akash-network/provider that referenced this issue Jan 23, 2024
troian added a commit to akash-network/provider that referenced this issue Jan 23, 2024
troian added a commit to akash-network/provider that referenced this issue Jan 23, 2024
troian added a commit to akash-network/provider that referenced this issue Jan 23, 2024
troian added a commit to akash-network/provider that referenced this issue Jan 23, 2024
troian added a commit to akash-network/provider that referenced this issue Jan 23, 2024
troian added a commit to akash-network/provider that referenced this issue Jan 24, 2024
troian added a commit to akash-network/provider that referenced this issue Jan 24, 2024
troian added a commit to akash-network/provider that referenced this issue Jan 24, 2024
troian added a commit to akash-network/provider that referenced this issue Jan 24, 2024
troian added a commit to akash-network/provider that referenced this issue Jan 24, 2024
troian added a commit to akash-network/provider that referenced this issue Jan 24, 2024
troian added a commit to akash-network/provider that referenced this issue Jan 24, 2024
troian added a commit to akash-network/provider that referenced this issue Jan 24, 2024
troian added a commit to akash-network/provider that referenced this issue Jan 24, 2024
troian added a commit to akash-network/provider that referenced this issue Jan 24, 2024
troian added a commit to akash-network/provider that referenced this issue Jan 24, 2024
troian added a commit to akash-network/provider that referenced this issue Jan 24, 2024
troian added a commit to akash-network/provider that referenced this issue Jan 24, 2024
troian added a commit to akash-network/provider that referenced this issue Jan 24, 2024
troian added a commit to akash-network/provider that referenced this issue Jan 24, 2024
troian added a commit to akash-network/provider that referenced this issue Jan 24, 2024
troian added a commit to akash-network/provider that referenced this issue Jan 24, 2024
troian added a commit to akash-network/provider that referenced this issue Jan 24, 2024
troian added a commit to akash-network/provider that referenced this issue Jan 25, 2024
troian added a commit to akash-network/provider that referenced this issue Jan 25, 2024
troian added a commit to akash-network/provider that referenced this issue Jan 25, 2024
troian added a commit to akash-network/provider that referenced this issue Jan 25, 2024
troian added a commit to akash-network/provider that referenced this issue Jan 25, 2024
troian added a commit to akash-network/provider that referenced this issue Jan 25, 2024
troian added a commit to akash-network/provider that referenced this issue Jan 25, 2024
* feat: feature discovery enablement

* refactor(operator): feature discovery

refs akash-network/support#141

Signed-off-by: Artur Troian <[email protected]>

---------

Signed-off-by: Artur Troian <[email protected]>
Co-authored-by: chainzero <[email protected]>
@brewsterdrinkwater
Copy link
Collaborator

brewsterdrinkwater commented Jan 30, 2024

January 30th, 2024:

  • Tested in Sandbox multiple times; and thinks are working well right now.
  • Helm chart update: pushed this morning.
  • Next Up is helm chart release to a provider on mainnet and test again.
  • Going to make upgrade instructions for the provider. Some small changes to the provider.
  • Inventory API is available right now.

@brewsterdrinkwater
Copy link
Collaborator

February 6th, 2024:

  • Inventory operator is cleaned up.
  • Next step is to make sure helm charts stay in sync.
  • Will test again with clean clusters. Updating existing clusters causes problems currently.
  • End to end testing is being updated.
  • Max C. is working on scraping tool
  • Steadily rolling out coming soon on Mainnet.

@brewsterdrinkwater
Copy link
Collaborator

February 13th, 2024:

  • Continued testing of refactored work.
  • Helm charts have been updated.
  • Will update the community when ready.
  • Spinning up provider on Sandbox and testing. Will spin up on mainnet provider, afterward.

Action items:

  • Instructions need to be written.
  • Testing on sandbox and mainnet.

@brewsterdrinkwater
Copy link
Collaborator

February 20th, 2024:

  • Testing is happening on sandbox and mainnet.
  • Instructions are updated, and will be tested again by community provider.
  • Will add provider upgrade in stages.
  • Cloudmos team has updated API for scraping.

@anilmurty anilmurty moved this from In Test (or staging) to Released (in Prod) in Core Product and Engineering Roadmap Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 repo/provider Akash provider-services repo issues
Projects
Status: Released (in Prod)
Development

No branches or pull requests

4 participants