-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AMD Support #142
Comments
Oct 24 sync:
|
We have access to a cluster that includes a Nvidia L40 and an AMD 210 GPU. @andy108369 is working on testing out setting up a provider with them. Current status: L40 works out of the box (as expected), AMD does not. Per @troian , we filter on "Nvidia" GPUs in nodes and providers. Artur needs to work on removing this filter and setting up a testnet for Andrey to test with. Removing this filtering likely shouldn't need a network upgrade |
December 5th, 2023:
|
refs akash-network/support#142 Signed-off-by: Artur Troian <[email protected]>
refs akash-network/support#142 Signed-off-by: Artur Troian <[email protected]>
refs akash-network/support#142 Signed-off-by: Artur Troian <[email protected]>
refs akash-network/support#142 Signed-off-by: Artur Troian <[email protected]>
refs akash-network/support#142 Signed-off-by: Artur Troian <[email protected]>
refs akash-network/support#142 Signed-off-by: Artur Troian <[email protected]>
refs akash-network/support#142 Signed-off-by: Artur Troian <[email protected]>
December 12th, 2023
|
December 19th, 2023
Next Steps:
|
Updates:
|
Test run results
Next steps:
|
This is possible - requires
Verification:
|
January 16th, 2024:
|
Additional notes: We currently have a limitation (applies to both Nvidia and AMD) where we (K8s) cannot allow mixing of models on the same node. it is fine to mix models on the provider (accross) as long as each node only has GPUs of same model. |
January 23rd:
|
pushed the AMD GPU support doc, now available at https://docs.akash.network/other-resources/experimental/amd-gpu-support
|
Support for AMD GPUs on Akash Network. There may not be any significant work necessary but first step is to test with an AMD GPU(s). This is very important because AMD is working on the MI 250 chipset which is expected to be a serious contender to Nvidia A100 and H100 chips. Here is a blog from MosaicML benchmarking and comparing its performance with Nvidia's chips: https://www.mosaicml.com/blog/amd-mi250
It seems like the initial work is validating whether the kubernetes device plugin for AMD can work for us (the way the Nvidia one has) https://github.com/RadeonOpenCompute/k8s-device-plugin#deployment
Is this something that a community person can help with?
The text was updated successfully, but these errors were encountered: