[Standardization] GPU naming convention needs further refinements #366

anjastrunk · 2023-10-23T13:04:14Z

GPU name in SCS Flavor Naming Standard need further refinement. The following aspects are described insufficiently:

GPU capabilities are added to flavor name as extension. See Complete Proposal for systematic flavor naming. However, the order of extensions is unclear. As there are three extensions currently, I would make sense and facilitate parsing/mapping if extensions have a dedicated order.
GPU naming supports suffix h, which can be set multiple times indicating a high-performance GPU. However "high-performance" is neither explained in detail, nor there is a mapping from h, hh, hhh... to measure a GPU properly as it is done for CPU Frequency`. In favor of comparison and interoperability, standard for GPU naming SHOULD be very strict and clear here.
There are no examples provided for flavor naming with GPU support, as is is done for CPU or Memory
Abbreviation SUs (Streaming Multiprocessors) and EUs (Execution Units)) are used without glossary/explanation. #375
According to SCS flavor naming, GPU generation, such as Ampere or Hopper for Nvidia, can be defined by adding appropriate suffix to GPU definition. IMO, placing generation is not sufficient as there is a huge performance difference between A10 and A100, both GPUs of generation "Ampere". Hence, we need a further refinement here, to point out GPU capabilities more precisely.
Standard does not support definition of number of physical or virtual GPUs

The text was updated successfully, but these errors were encountered:

garloff · 2023-11-06T16:34:04Z

GPU capabilities are added to flavor name as extension. See Complete Proposal for systematic flavor naming. However, the order of extensions is unclear. As there are three extensions currently, I would make sense and facilitate parsing/mapping if extensions have a dedicated order.

The order is and always has been fixed.
A sentence to make this clear is added with PR #374.

garloff · 2023-11-06T16:39:59Z

[ ] GPU naming supports suffix h, which can be set multiple times indicating a high-performance GPU. However "high-performance" is neither explained in detail, nor there is a mapping from h, hh, hhh... to measurable a GPU property as it is done for CPU Frequency`. In favor of comparison and interoperability, standard for GPU naming SHOULD be very strict and clear here.

Agreed. The wording suggests to use it for HBM memory, which I think is well-defined (and meaningful, as it does make a difference.)
But we don't have frequency criteria listed, mainly because this would create a large table, as the notion of "high" frequency is very much dependent on the GPU vendor and generation.
So with the current wording, we allow a vendor to use the h to differentiate two different flavors where one has a higher frequency GPU than the other (but otherwise the same). This is imperfect, as vendors will have different approaches to this without us defining it, so we may need to create this table ...

The other option is to narrow things down and say that h is HBM memory, period.

garloff · 2023-11-06T16:44:46Z

[ ] There are no examples provided for flavor naming with GPU support, as is is done for CPU or Memory

True. We could easily add this.
Why not use SCS-16V-64-500s_GNa-14h as an example? (This flavor exists on one of our partner clouds.)
PCI pass-through Nvidia Ampere with 14 SMs and HBM memory. (It could also have been specially high freq, and I happen to know it's HBM memory.)
Want to submit a PR? Want me to do it?

garloff · 2023-11-06T16:49:34Z

* [ ]  [Abbreviation SUs (Streaming Multiprocessors) and EUs (Execution Units)) are used without glossary/explanation. #375](https://github.com/SovereignCloudStack/standards/issues/375)

That can easily be addressed. I just added it to the PR #374 as it fit nicely.

Signed-off-by: Kurt Garloff <[email protected]>

garloff · 2023-11-06T16:54:09Z

Want to submit a PR? Want me to do it?

Added it also to PR #374.

garloff · 2023-11-06T16:59:59Z

* [ ]  According to SCS flavor naming, GPU generation, such as Ampere or Hopper for Nvidea, can be defined by adding appropriate suffix to GPU definition. IMO, placing generation is not sufficient as there is a huge performance difference between A40 and A100, both GPUs of generation "Ampere". Hence, we need a further refinement here, to point out GPU capabilities more precisely.

The number of SMs should give you an indication of how much performance to expect. Together with maybe the h qualifier (HBM memory). _GNa-84 (A40) vs _GNa-108h (A100).

garloff · 2023-11-06T17:13:51Z

Standard does not support definition of number of physical or virtual GPUs

True, that is a real limitation.

Another (and maybe more important) missing piece is that we don't specify the amount of VRAM that is available to the user, which may be a serious limitation. Does my 30b LLM model (in 4bit+ quantization, so it will require ~18GiB) fit or not?

So this would need a real extension:
_[Ix][G/g]X[N][-M[h][-O[h]]]
I is denoting the no of GPUs and -O the amount of memory (in GiB). h behind SMs/CUs/EUs would denote high freq., h behind O (memory) memory with bandwidth > 1TiB/s.
This would be backwards compatible.
If we wanted to allow for heterogeneous GPUs, we could allow multiple of these options.
Obviously, you may be able to come up with a better proposal.

anjastrunk · 2023-11-09T13:17:16Z

Standard does not support definition of number of physical or virtual GPUs

True, that is a real limitation.

Another (and maybe more important) missing piece is that we don't specify the amount of VRAM that is available to the user, which may be a serious limitation. Does my 30b LLM model (in 4bit+ quantization, so it will require ~18GiB) fit or not?

So this would need a real extension: _[Ix][G/g]X[N][-M[h][-O[h]]] I denoting the no of GPUs and -O the amount of memory (in GiB). h behind SMs/CUs/EUs would denote high freq., h behind O (memory) memory with bandwidth > 1TiB/s. This would be backwards compatible. If we wanted to allow for heterogeneous GPUs, we could allow multiple of these options. Obviously, you may be able to come up with a better proposal.

As I feel myself not competent enough to judge this approach, I will forward the improvement of GPU definition in SCS flavor standard to our GPU expert. This may take some time.

* Clarify SMs/CUs/EUs belonging to nVidia/AMD/Intel. Also add a line clarifying the order of extensions being fixed. * Explain CUs, EUs, SMs. This addresses #375. * Add GPU example as desired by #366. * Correct example. * Escape all underscores in examples. They get interpreted otherwise by markdown. * Use vendor-neutral terminology for GPU processing units * Typo and minor wording Signed-off-by: Kurt Garloff <[email protected]> Signed-off-by: Matthias Büchse <[email protected]> Co-authored-by: Matthias Büchse <[email protected]>

garloff · 2023-12-14T09:52:24Z

Any feedback?

cah-patrickthiem · 2024-04-09T15:11:46Z

Just for the record.
I did some research on how hyperscalers are doing the naming of GPU flavors to maybe get some inspiration or "common practices". However, neither of the big players seem to have any clear naming scheme. In the following I present my findings:

Microsoft Azure https://learn.microsoft.com/de-de/azure/virtual-machines/ncads-h100-v5

example: Standard_NC40ads_H100_v5 - where it start with a string, followed by 40 - which implies 40 vCPU cores, "ads" is not explained anywhere - there are several flavors with cryptic strings attached and not explained, H100 obviously refers to the Nvidia H100, v5 seems to be "version 5" of that flavor
in the flavor name itself, it is not visible that this flavor has 1 H100 GPUs, ok by itself, but there also is the Standard_NC80adis_H100_v5 flavor in which just one number changes from 40 to 80 (vCPU cores) but it has 2 H100 GPUs with it
in general, in Azure they seem to only define vCPU core count together with GPU model names, and of course some other stuff around it

Google Cloud https://cloud.google.com/compute/docs/gpus?hl=de#a100-gpus

example: g2-standard-16 "g2" is the type of flavor - my guess is that it just is a naming convention with no real meaning, "standard" on the other hand refers to the performance a user can expect - it is linked to low-midrange GPUs (in this example it is a Nvidia L4), the trailing number implies vCPU count (here 16)
example2: a2-highgpu-2g here again it starts with the type "a2" but followed by "highgpu" which translates to "high performance gpu" - in this category, however "2g" refers to 2 GPUs, with rising GPU count the vCPU core and RAM is rising accordingly, a2 seems to refer to A100 GPUs and it also seems that before A100 GPUs hit the market those "a2" flavors came with V100 GPUs (generation prior to A100s)
example3: a3-highgpu-8g here a3 implies that it is a flavor serving H100 GPUs

AWS https://aws.amazon.com/de/ec2/instance-types/

in AWS flavors are categorized by utility - those categories (like: general usage, for data processing, accelerated computing, etc.) are divided again into subcategories (for "accelerated computing" e.g. P4, P5, G5, F1 or Trn1, where the letter(s) imply the "architecture" and the number the generation in that specific architecture)
- P5 would be Nvidia H100, P4 would be Nvidia A100
examples: from category "accelerated computing": p3.2xlarge, p3.16xlarge
- p3 is referencing to Nvidia V100 GPUs
- 2xlarge meaning 1 GPU
- 16xlarge meaning 8 GPUs
- with rising GPU count the vCPU core and RAM is rising accordingly

garloff · 2024-10-13T15:49:38Z

Sidenote: The multi-GPU feature is not yet included in #780.

garloff · 2024-10-15T10:38:25Z

Discussion seems to have continued on #546

mbuechse · 2024-11-01T18:31:12Z

Can this be closed now thanks to #780?

mbuechse · 2024-11-01T18:31:42Z

I'm closing this. Please open a new one with whatever's remaining.

anjastrunk added the enhancement New feature or request label Oct 23, 2023

anjastrunk added this to Sovereign Cloud Stack Oct 23, 2023

github-project-automation bot moved this to Backlog in Sovereign Cloud Stack Oct 23, 2023

anjastrunk added the SCS-VP10 Related to tender lot SCS-VP10 label Oct 23, 2023

anjastrunk mentioned this issue Oct 23, 2023

Align IaaS Flavor description between SCS and Gaia-X SovereignCloudStack/gx-credential-generator#63

Closed

anjastrunk added the standards Issues / ADR / pull requests relevant for standardization & certification label Oct 26, 2023

garloff added a commit that referenced this issue Nov 6, 2023

Add GPU example as desired by #366.

6faa65d

Signed-off-by: Kurt Garloff <[email protected]>

anjastrunk self-assigned this Nov 9, 2023

anjastrunk assigned cah-patrickthiem and unassigned anjastrunk Mar 5, 2024

anjastrunk moved this from Backlog to Doing in Sovereign Cloud Stack Mar 25, 2024

cah-patrickthiem mentioned this issue Apr 3, 2024

GPU-flavor-naming-refinement #546

Closed

anjastrunk mentioned this issue Apr 15, 2024

[EPIC] IaaS standards #285

Open

59 tasks

garloff mentioned this issue Oct 13, 2024

Feat/add gpu vram #780

Merged

mbuechse closed this as completed Nov 1, 2024

github-project-automation bot moved this from Doing to Done in Sovereign Cloud Stack Nov 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Standardization] GPU naming convention needs further refinements #366

[Standardization] GPU naming convention needs further refinements #366

anjastrunk commented Oct 23, 2023 •

edited by cah-patrickthiem

Loading

garloff commented Nov 6, 2023

garloff commented Nov 6, 2023 •

edited

Loading

garloff commented Nov 6, 2023

garloff commented Nov 6, 2023

garloff commented Nov 6, 2023

garloff commented Nov 6, 2023

garloff commented Nov 6, 2023 •

edited

Loading

anjastrunk commented Nov 9, 2023

garloff commented Dec 14, 2023

cah-patrickthiem commented Apr 9, 2024 •

edited

Loading

garloff commented Oct 13, 2024

garloff commented Oct 15, 2024

mbuechse commented Nov 1, 2024

mbuechse commented Nov 1, 2024

[Standardization] GPU naming convention needs further refinements #366

[Standardization] GPU naming convention needs further refinements #366

Comments

anjastrunk commented Oct 23, 2023 • edited by cah-patrickthiem Loading

garloff commented Nov 6, 2023

garloff commented Nov 6, 2023 • edited Loading

garloff commented Nov 6, 2023

garloff commented Nov 6, 2023

garloff commented Nov 6, 2023

garloff commented Nov 6, 2023

garloff commented Nov 6, 2023 • edited Loading

anjastrunk commented Nov 9, 2023

garloff commented Dec 14, 2023

cah-patrickthiem commented Apr 9, 2024 • edited Loading

garloff commented Oct 13, 2024

garloff commented Oct 15, 2024

mbuechse commented Nov 1, 2024

mbuechse commented Nov 1, 2024

anjastrunk commented Oct 23, 2023 •

edited by cah-patrickthiem

Loading

garloff commented Nov 6, 2023 •

edited

Loading

garloff commented Nov 6, 2023 •

edited

Loading

cah-patrickthiem commented Apr 9, 2024 •

edited

Loading