Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Standardization] GPU naming convention needs further refinements #366

Closed
5 of 6 tasks
anjastrunk opened this issue Oct 23, 2023 · 14 comments
Closed
5 of 6 tasks

[Standardization] GPU naming convention needs further refinements #366

anjastrunk opened this issue Oct 23, 2023 · 14 comments
Assignees
Labels
enhancement New feature or request SCS-VP10 Related to tender lot SCS-VP10 standards Issues / ADR / pull requests relevant for standardization & certification

Comments

@anjastrunk
Copy link
Contributor

anjastrunk commented Oct 23, 2023

GPU name in SCS Flavor Naming Standard need further refinement. The following aspects are described insufficiently:

  • GPU capabilities are added to flavor name as extension. See Complete Proposal for systematic flavor naming. However, the order of extensions is unclear. As there are three extensions currently, I would make sense and facilitate parsing/mapping if extensions have a dedicated order.
  • GPU naming supports suffix h, which can be set multiple times indicating a high-performance GPU. However "high-performance" is neither explained in detail, nor there is a mapping from h, hh, hhh... to measure a GPU properly as it is done for CPU Frequency`. In favor of comparison and interoperability, standard for GPU naming SHOULD be very strict and clear here.
  • There are no examples provided for flavor naming with GPU support, as is is done for CPU or Memory
  • Abbreviation SUs (Streaming Multiprocessors) and EUs (Execution Units)) are used without glossary/explanation. #375
  • According to SCS flavor naming, GPU generation, such as Ampere or Hopper for Nvidia, can be defined by adding appropriate suffix to GPU definition. IMO, placing generation is not sufficient as there is a huge performance difference between A10 and A100, both GPUs of generation "Ampere". Hence, we need a further refinement here, to point out GPU capabilities more precisely.
  • Standard does not support definition of number of physical or virtual GPUs
@anjastrunk anjastrunk added the enhancement New feature or request label Oct 23, 2023
@anjastrunk anjastrunk added the SCS-VP10 Related to tender lot SCS-VP10 label Oct 23, 2023
@anjastrunk anjastrunk added the standards Issues / ADR / pull requests relevant for standardization & certification label Oct 26, 2023
@garloff
Copy link
Member

garloff commented Nov 6, 2023

  • GPU capabilities are added to flavor name as extension. See Complete Proposal for systematic flavor naming. However, the order of extensions is unclear. As there are three extensions currently, I would make sense and facilitate parsing/mapping if extensions have a dedicated order.

The order is and always has been fixed.
A sentence to make this clear is added with PR #374.

@garloff
Copy link
Member

garloff commented Nov 6, 2023

[ ] GPU naming supports suffix h, which can be set multiple times indicating a high-performance GPU. However "high-performance" is neither explained in detail, nor there is a mapping from h, hh, hhh... to measurable a GPU property as it is done for CPU Frequency`. In favor of comparison and interoperability, standard for GPU naming SHOULD be very strict and clear here.

Agreed. The wording suggests to use it for HBM memory, which I think is well-defined (and meaningful, as it does make a difference.)
But we don't have frequency criteria listed, mainly because this would create a large table, as the notion of "high" frequency is very much dependent on the GPU vendor and generation.
So with the current wording, we allow a vendor to use the h to differentiate two different flavors where one has a higher frequency GPU than the other (but otherwise the same). This is imperfect, as vendors will have different approaches to this without us defining it, so we may need to create this table ...

The other option is to narrow things down and say that h is HBM memory, period.

@garloff
Copy link
Member

garloff commented Nov 6, 2023

[ ] There are no examples provided for flavor naming with GPU support, as is is done for CPU or Memory

True. We could easily add this.
Why not use SCS-16V-64-500s_GNa-14h as an example? (This flavor exists on one of our partner clouds.)
PCI pass-through Nvidia Ampere with 14 SMs and HBM memory. (It could also have been specially high freq, and I happen to know it's HBM memory.)
Want to submit a PR? Want me to do it?

@garloff
Copy link
Member

garloff commented Nov 6, 2023

* [ ]  [Abbreviation SUs (Streaming Multiprocessors) and EUs (Execution Units)) are used without glossary/explanation. #375](https://github.com/SovereignCloudStack/standards/issues/375)

That can easily be addressed. I just added it to the PR #374 as it fit nicely.

garloff added a commit that referenced this issue Nov 6, 2023
@garloff
Copy link
Member

garloff commented Nov 6, 2023

Want to submit a PR? Want me to do it?

Added it also to PR #374.

@garloff
Copy link
Member

garloff commented Nov 6, 2023

* [ ]  According to SCS flavor naming, GPU generation, such as Ampere or Hopper for Nvidea, can be defined by adding appropriate suffix to GPU definition. IMO, placing generation is not sufficient as there is a huge performance difference between A40 and A100, both GPUs of generation "Ampere". Hence, we need a further refinement here, to point out GPU capabilities more precisely.

The number of SMs should give you an indication of how much performance to expect. Together with maybe the h qualifier (HBM memory). _GNa-84 (A40) vs _GNa-108h (A100).

@garloff
Copy link
Member

garloff commented Nov 6, 2023

Standard does not support definition of number of physical or virtual GPUs

True, that is a real limitation.

Another (and maybe more important) missing piece is that we don't specify the amount of VRAM that is available to the user, which may be a serious limitation. Does my 30b LLM model (in 4bit+ quantization, so it will require ~18GiB) fit or not?

So this would need a real extension:
_[Ix][G/g]X[N][-M[h][-O[h]]]
I is denoting the no of GPUs and -O the amount of memory (in GiB). h behind SMs/CUs/EUs would denote high freq., h behind O (memory) memory with bandwidth > 1TiB/s.
This would be backwards compatible.
If we wanted to allow for heterogeneous GPUs, we could allow multiple of these options.
Obviously, you may be able to come up with a better proposal.

@anjastrunk anjastrunk self-assigned this Nov 9, 2023
@anjastrunk
Copy link
Contributor Author

Standard does not support definition of number of physical or virtual GPUs

True, that is a real limitation.

Another (and maybe more important) missing piece is that we don't specify the amount of VRAM that is available to the user, which may be a serious limitation. Does my 30b LLM model (in 4bit+ quantization, so it will require ~18GiB) fit or not?

So this would need a real extension: _[Ix][G/g]X[N][-M[h][-O[h]]] I denoting the no of GPUs and -O the amount of memory (in GiB). h behind SMs/CUs/EUs would denote high freq., h behind O (memory) memory with bandwidth > 1TiB/s. This would be backwards compatible. If we wanted to allow for heterogeneous GPUs, we could allow multiple of these options. Obviously, you may be able to come up with a better proposal.

As I feel myself not competent enough to judge this approach, I will forward the improvement of GPU definition in SCS flavor standard to our GPU expert. This may take some time.

garloff added a commit that referenced this issue Nov 10, 2023
* Clarify SMs/CUs/EUs belonging to nVidia/AMD/Intel.
  Also add a line clarifying the order of extensions being fixed.
* Explain CUs, EUs, SMs.
  This addresses #375.
* Add GPU example as desired by #366.
* Correct example.
* Escape all underscores in examples.
  They get interpreted otherwise by markdown.
* Use vendor-neutral terminology for GPU processing units
* Typo and minor wording

Signed-off-by: Kurt Garloff <[email protected]>
Signed-off-by: Matthias Büchse <[email protected]>
Co-authored-by: Matthias Büchse <[email protected]>
@garloff
Copy link
Member

garloff commented Dec 14, 2023

Any feedback?

@cah-patrickthiem
Copy link

cah-patrickthiem commented Apr 9, 2024

Just for the record.
I did some research on how hyperscalers are doing the naming of GPU flavors to maybe get some inspiration or "common practices". However, neither of the big players seem to have any clear naming scheme. In the following I present my findings:

Microsoft Azure https://learn.microsoft.com/de-de/azure/virtual-machines/ncads-h100-v5

  • example: Standard_NC40ads_H100_v5 - where it start with a string, followed by 40 - which implies 40 vCPU cores, "ads" is not explained anywhere - there are several flavors with cryptic strings attached and not explained, H100 obviously refers to the Nvidia H100, v5 seems to be "version 5" of that flavor
  • in the flavor name itself, it is not visible that this flavor has 1 H100 GPUs, ok by itself, but there also is the Standard_NC80adis_H100_v5 flavor in which just one number changes from 40 to 80 (vCPU cores) but it has 2 H100 GPUs with it
  • in general, in Azure they seem to only define vCPU core count together with GPU model names, and of course some other stuff around it

Google Cloud https://cloud.google.com/compute/docs/gpus?hl=de#a100-gpus

  • example: g2-standard-16 "g2" is the type of flavor - my guess is that it just is a naming convention with no real meaning, "standard" on the other hand refers to the performance a user can expect - it is linked to low-midrange GPUs (in this example it is a Nvidia L4), the trailing number implies vCPU count (here 16)
  • example2: a2-highgpu-2g here again it starts with the type "a2" but followed by "highgpu" which translates to "high performance gpu" - in this category, however "2g" refers to 2 GPUs, with rising GPU count the vCPU core and RAM is rising accordingly, a2 seems to refer to A100 GPUs and it also seems that before A100 GPUs hit the market those "a2" flavors came with V100 GPUs (generation prior to A100s)
  • example3: a3-highgpu-8g here a3 implies that it is a flavor serving H100 GPUs

AWS https://aws.amazon.com/de/ec2/instance-types/

  • in AWS flavors are categorized by utility - those categories (like: general usage, for data processing, accelerated computing, etc.) are divided again into subcategories (for "accelerated computing" e.g. P4, P5, G5, F1 or Trn1, where the letter(s) imply the "architecture" and the number the generation in that specific architecture)
    • P5 would be Nvidia H100, P4 would be Nvidia A100
  • examples: from category "accelerated computing": p3.2xlarge, p3.16xlarge
    • p3 is referencing to Nvidia V100 GPUs
    • 2xlarge meaning 1 GPU
    • 16xlarge meaning 8 GPUs
    • with rising GPU count the vCPU core and RAM is rising accordingly

@garloff
Copy link
Member

garloff commented Oct 13, 2024

Sidenote: The multi-GPU feature is not yet included in #780.

@garloff
Copy link
Member

garloff commented Oct 15, 2024

Discussion seems to have continued on #546

@mbuechse
Copy link
Contributor

mbuechse commented Nov 1, 2024

Can this be closed now thanks to #780?

@mbuechse
Copy link
Contributor

mbuechse commented Nov 1, 2024

I'm closing this. Please open a new one with whatever's remaining.

@mbuechse mbuechse closed this as completed Nov 1, 2024
@github-project-automation github-project-automation bot moved this from Doing to Done in Sovereign Cloud Stack Nov 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request SCS-VP10 Related to tender lot SCS-VP10 standards Issues / ADR / pull requests relevant for standardization & certification
Projects
Status: Done
Development

No branches or pull requests

4 participants