Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable labels for ClusterUUID and CliqueId #965

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ArangoGutierrez
Copy link
Collaborator

No description provided.

internal/resource/types.go Outdated Show resolved Hide resolved
internal/lm/nvml.go Outdated Show resolved Hide resolved
internal/resource/nvml-device.go Show resolved Hide resolved
internal/resource/nvml-device.go Outdated Show resolved Hide resolved
internal/lm/nvml.go Outdated Show resolved Hide resolved
internal/lm/nvml.go Outdated Show resolved Hide resolved
internal/lm/nvml.go Outdated Show resolved Hide resolved
internal/lm/nvml.go Outdated Show resolved Hide resolved
internal/lm/nvml.go Outdated Show resolved Hide resolved
internal/lm/nvml.go Outdated Show resolved Hide resolved
@ArangoGutierrez ArangoGutierrez marked this pull request as draft September 25, 2024 08:21
@ArangoGutierrez ArangoGutierrez marked this pull request as ready for review September 25, 2024 08:27
@@ -218,6 +224,45 @@ func newGPUModeLabeler(devices []resource.Device) (Labeler, error) {
return labels, nil
}

func newImexDomainLabeler(devices []resource.Device) (Labeler, error) {
var commonClusterUUID, commonCliqueID string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could we separate these lines? It makes future diffs easier.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR now on new direction

cmd/gpu-feature-discovery/main.go Outdated Show resolved Hide resolved
cmd/gpu-feature-discovery/main.go Outdated Show resolved Hide resolved
cmd/gpu-feature-discovery/main.go Show resolved Hide resolved
Comment on lines 189 to 192
{{- if .Values.imex.enabled }}
- name: imex-nodes-config
mountPath: {{ .Values.imex.configFile | quote }}
{{- end }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we need an enabled flag around any of these mounting directives. Instead we should wrap it with a {{- if typeIs "string" .Values.imexNodesConfigFile }} such that when this is nil (the default), we use the default value for that is set for this file in the CLI.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted, will edit

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually thinking about this again -- we alway need to mount (at least) the folder for this, don't we? Otherwise the file won't be made available in the container (even though there is a default set path for it in the binarx).

internal/lm/nvml.go Outdated Show resolved Hide resolved
internal/lm/nvml.go Outdated Show resolved Hide resolved
internal/lm/nvml.go Outdated Show resolved Hide resolved
internal/lm/nvml.go Outdated Show resolved Hide resolved
internal/lm/nvml.go Outdated Show resolved Hide resolved
@@ -1,7 +1,7 @@
github.com/NVIDIA/go-gpuallocator v0.5.0 h1:166ICvPv2dU9oZ2J3kJ4y3XdbGCi6LhXgFZJtrqeu3A=
github.com/NVIDIA/go-gpuallocator v0.5.0/go.mod h1:zos5bTIN01hpQioOyu9oRKglrznImMQvm0bZllMmckw=
github.com/NVIDIA/go-nvlib v0.6.1 h1:0/5FvaKvDJoJeJ+LFlh+NDQMxMlVw9wOXrOVrGXttfE=
github.com/NVIDIA/go-nvlib v0.6.1/go.mod h1:9UrsLGx/q1OrENygXjOuM5Ey5KCtiZhbvBlbUIxtGWY=
github.com/NVIDIA/go-nvlib v0.6.2-0.20240928162840-41955a08425b h1:k5ptZB9RGUaR5RcK0R8Cfa4mtTHrSZZ73BFyD3c6KvM=
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change introduced at NVIDIA/go-nvlib#44

@@ -182,6 +186,10 @@ spec:
mountPath: "/etc/kubernetes/node-feature-discovery/features.d"
- name: host-sys
mountPath: "/sys"
{{- if and (eq (typeOf .Values.imexNodesConfigFile) "string") (ne .Values.imexNodesConfigFile "") }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Why not use typeIs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed back to typeIs


// ClusterUUID
byteSlice := gfInfo.ClusterUuid[:]
clusterUUID := string(byteSlice)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just:

Suggested change
clusterUUID := string(byteSlice)
clusterUUID := string(gfInfo.ClusterUuid)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The resulting label is nvidia.com/gpu.clusteruuid:�Hg�/B��U���6~�

@@ -254,3 +353,20 @@ func getDeviceClasses(devices []resource.Device) ([]uint32, error) {
}
return classes, nil
}

// generateUUID generates a UUID-like string based on an MD5 hash of the seed and random bytes.
func generateUUID(seed string) string {
Copy link
Member

@elezar elezar Oct 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Should this really be random? Given the same input, should we not expect the UUID to be the same?

Does using https://pkg.go.dev/hash/[email protected] not serve our purposes here? (I'm fine with using md5 -- in which case we should just add the relevant //nolint annotations.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried hash and md5, all return values too long not accepted as a label value. I have added the //nolint to use @klueska sugested UUID generator

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not random -- it just uses the rand package sourced with a specific seed to guarantee it always gives us back the same value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's likely no reason we couldn't use the crypto package for the same purpose though -- I just used rand when I first wrote this because it was the first one I found that did what I needed.

sort.Strings(ips)

// Join the sorted IPs into a single string
sortedIPs := strings.Join(ips, "\n")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be more readable to generate the config ID here instead of where we consume sortedIPs. We're not really interested in the IPs, but the ID associated with the unique set.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean not creating the variable sortedIPs but directly injecting the strings.Join(ips, "\n") when we call the generateUUID?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I in fact, i would reorganize this whole function to get the clusterUUID and cliqueIDs first, and then generate the IMEX domain ID and then inserting it directly in the labels without this intermediate veriable.

labels := Labels{
"nvidia.com/gpu.clusteruuid": commonClusterUUID,
"nvidia.com/gpu.cliqueid": commonCliqueID,
"nvidia.com/gpu.imex-domain": generateUUID(sortedIPs) + "-" + commonCliqueID,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this make sense if generateUUID doesn't return the same value across restarts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that is the same seed, the resulting UUID should be the same

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from all my testing, I have always got the same label 0f884867-ba2f-4294-9155-b495ff367eea , so the UUID is tied to the Ip's , even after a GFD restart

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote the generateUUID function this was borrowed from specifically to ensure that it always generates the same number given the same seed value.

@@ -199,6 +205,9 @@ spec:
- name: host-sys
hostPath:
path: "/sys"
- name: imex-nodes-config
hostPath:
path: "/etc/nvidia-imex"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be relative to hostDriverRoot

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mount is going to fail if this doesn't exist on the host.

Copy link
Contributor

@klueska klueska Oct 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there a way to make it just mount an empty directory if it doesn't exist?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the option is something like DirectoryOrCreate

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +38 to +39
imex:
nodesConfigFile: null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should be configurable from helm. Making it configurable from the config file should be enough.

Comment on lines 298 to 300
if err != nil {
return nil, fmt.Errorf("error getting clique ID: %v", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this random err check here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed, is a leftover from when we were calling getGpuFabric twice

Comment on lines 241 to 245
if err != nil && os.IsNotExist(err) {
return nil, nil
}
if err != nil {
return nil, fmt.Errorf("failed to open imex config file: %v", err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if err != nil && os.IsNotExist(err) {
return nil, nil
}
if err != nil {
return nil, fmt.Errorf("failed to open imex config file: %v", err)
}
if os.IsNotExist(err) {
return nil, nil
} else if err != nil {
return nil, fmt.Errorf("failed to open imex config file: %v", err)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the first one myself .... I have a hatred for else statements if they can be avoided.

Comment on lines 278 to 246
ok, err := d.IsFabricAttached()
if !ok {
continue
}
if err != nil {
return nil, fmt.Errorf("error checking imex capability: %v", err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename the ok variable. Should we also not check err first?

Suggested change
ok, err := d.IsFabricAttached()
if !ok {
continue
}
if err != nil {
return nil, fmt.Errorf("error checking imex capability: %v", err)
}
isFabricAttached, err := d.IsFabricAttached()
if err != nil {
return nil, fmt.Errorf("error checking imex capability: %v", err)
}
if !isFabricAttached {
continue
}

@@ -218,6 +231,86 @@ func newGPUModeLabeler(devices []resource.Device) (Labeler, error) {
return labels, nil
}

func newImexDomainLabeler(configFile string, device []resource.Device) (Labeler, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move the IMEX specifics to a separate file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with a new NewImexLabeler ?

@@ -218,6 +231,86 @@ func newGPUModeLabeler(devices []resource.Device) (Labeler, error) {
return labels, nil
}

func newImexDomainLabeler(configFile string, device []resource.Device) (Labeler, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func newImexDomainLabeler(configFile string, device []resource.Device) (Labeler, error) {
func newImexDomainLabeler(configFile string, devices []resource.Device) (Labeler, error) {


var commonClusterUUID string
var commonCliqueID string
for _, d := range device {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for _, d := range device {
for _, device := range devices {

byteSlice[10:16])

// CliqueID
cliqueId := fmt.Sprintf("%d", uint64(gfInfo.CliqueId))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since CliqueId is already a uint32 should we not just be able to Sprintf it directly?

Suggested change
cliqueId := fmt.Sprintf("%d", uint64(gfInfo.CliqueId))
cliqueId := fmt.Sprintf("%d", gfInfo.CliqueId)

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
@ArangoGutierrez
Copy link
Collaborator Author

Unit test has been fixed, and also golangci-lint is happy now, PR is ready for review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants