-
Notifications
You must be signed in to change notification settings - Fork 614
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable labels for ClusterUUID and CliqueId #965
base: main
Are you sure you want to change the base?
Conversation
3522b29
to
8b68698
Compare
8b68698
to
971ba42
Compare
986db77
to
91aa9c4
Compare
internal/lm/nvml.go
Outdated
@@ -218,6 +224,45 @@ func newGPUModeLabeler(devices []resource.Device) (Labeler, error) { | |||
return labels, nil | |||
} | |||
|
|||
func newImexDomainLabeler(devices []resource.Device) (Labeler, error) { | |||
var commonClusterUUID, commonCliqueID string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could we separate these lines? It makes future diffs easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR now on new direction
91aa9c4
to
cada570
Compare
fe63d86
to
53ca695
Compare
{{- if .Values.imex.enabled }} | ||
- name: imex-nodes-config | ||
mountPath: {{ .Values.imex.configFile | quote }} | ||
{{- end }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we need an enabled
flag around any of these mounting directives. Instead we should wrap it with a {{- if typeIs "string" .Values.imexNodesConfigFile }}
such that when this is nil
(the default), we use the default value for that is set for this file in the CLI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted, will edit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually thinking about this again -- we alway need to mount (at least) the folder for this, don't we? Otherwise the file won't be made available in the container (even though there is a default set path for it in the binarx).
f0e1d1b
to
97ea6a6
Compare
@@ -1,7 +1,7 @@ | |||
github.com/NVIDIA/go-gpuallocator v0.5.0 h1:166ICvPv2dU9oZ2J3kJ4y3XdbGCi6LhXgFZJtrqeu3A= | |||
github.com/NVIDIA/go-gpuallocator v0.5.0/go.mod h1:zos5bTIN01hpQioOyu9oRKglrznImMQvm0bZllMmckw= | |||
github.com/NVIDIA/go-nvlib v0.6.1 h1:0/5FvaKvDJoJeJ+LFlh+NDQMxMlVw9wOXrOVrGXttfE= | |||
github.com/NVIDIA/go-nvlib v0.6.1/go.mod h1:9UrsLGx/q1OrENygXjOuM5Ey5KCtiZhbvBlbUIxtGWY= | |||
github.com/NVIDIA/go-nvlib v0.6.2-0.20240928162840-41955a08425b h1:k5ptZB9RGUaR5RcK0R8Cfa4mtTHrSZZ73BFyD3c6KvM= |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change introduced at NVIDIA/go-nvlib#44
@@ -182,6 +186,10 @@ spec: | |||
mountPath: "/etc/kubernetes/node-feature-discovery/features.d" | |||
- name: host-sys | |||
mountPath: "/sys" | |||
{{- if and (eq (typeOf .Values.imexNodesConfigFile) "string") (ne .Values.imexNodesConfigFile "") }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Why not use typeIs
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed back to typeIs
f6ee1f7
to
19ca6b5
Compare
internal/resource/nvml-device.go
Outdated
|
||
// ClusterUUID | ||
byteSlice := gfInfo.ClusterUuid[:] | ||
clusterUUID := string(byteSlice) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just:
clusterUUID := string(byteSlice) | |
clusterUUID := string(gfInfo.ClusterUuid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The resulting label is nvidia.com/gpu.clusteruuid:�Hg�/B��U���6~�
@@ -254,3 +353,20 @@ func getDeviceClasses(devices []resource.Device) ([]uint32, error) { | |||
} | |||
return classes, nil | |||
} | |||
|
|||
// generateUUID generates a UUID-like string based on an MD5 hash of the seed and random bytes. | |||
func generateUUID(seed string) string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Should this really be random? Given the same input, should we not expect the UUID to be the same?
Does using https://pkg.go.dev/hash/[email protected] not serve our purposes here? (I'm fine with using md5
-- in which case we should just add the relevant //nolint
annotations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried hash and md5, all return values too long not accepted as a label value. I have added the //nolint
to use @klueska sugested UUID generator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not random -- it just uses the rand
package sourced with a specific seed to guarantee it always gives us back the same value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's likely no reason we couldn't use the crypto package for the same purpose though -- I just used rand
when I first wrote this because it was the first one I found that did what I needed.
internal/lm/nvml.go
Outdated
sort.Strings(ips) | ||
|
||
// Join the sorted IPs into a single string | ||
sortedIPs := strings.Join(ips, "\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be more readable to generate the config ID here instead of where we consume sortedIPs
. We're not really interested in the IPs, but the ID associated with the unique set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean not creating the variable sortedIPs
but directly injecting the strings.Join(ips, "\n")
when we call the generateUUID
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I in fact, i would reorganize this whole function to get the clusterUUID and cliqueIDs first, and then generate the IMEX domain ID and then inserting it directly in the labels without this intermediate veriable.
internal/lm/nvml.go
Outdated
labels := Labels{ | ||
"nvidia.com/gpu.clusteruuid": commonClusterUUID, | ||
"nvidia.com/gpu.cliqueid": commonCliqueID, | ||
"nvidia.com/gpu.imex-domain": generateUUID(sortedIPs) + "-" + commonCliqueID, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this make sense if generateUUID
doesn't return the same value across restarts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given that is the same seed, the resulting UUID should be the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from all my testing, I have always got the same label 0f884867-ba2f-4294-9155-b495ff367eea
, so the UUID is tied to the Ip's , even after a GFD restart
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote the generateUUID function this was borrowed from specifically to ensure that it always generates the same number given the same seed value.
19ca6b5
to
7ce987a
Compare
@@ -199,6 +205,9 @@ spec: | |||
- name: host-sys | |||
hostPath: | |||
path: "/sys" | |||
- name: imex-nodes-config | |||
hostPath: | |||
path: "/etc/nvidia-imex" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be relative to hostDriverRoot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mount is going to fail if this doesn't exist on the host.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't there a way to make it just mount an empty directory if it doesn't exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the option is something like DirectoryOrCreate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imex: | ||
nodesConfigFile: null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this should be configurable from helm. Making it configurable from the config file should be enough.
internal/lm/nvml.go
Outdated
if err != nil { | ||
return nil, fmt.Errorf("error getting clique ID: %v", err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this random err check here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed, is a leftover from when we were calling getGpuFabric twice
internal/lm/nvml.go
Outdated
if err != nil && os.IsNotExist(err) { | ||
return nil, nil | ||
} | ||
if err != nil { | ||
return nil, fmt.Errorf("failed to open imex config file: %v", err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if err != nil && os.IsNotExist(err) { | |
return nil, nil | |
} | |
if err != nil { | |
return nil, fmt.Errorf("failed to open imex config file: %v", err) | |
} | |
if os.IsNotExist(err) { | |
return nil, nil | |
} else if err != nil { | |
return nil, fmt.Errorf("failed to open imex config file: %v", err) | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer the first one myself .... I have a hatred for else statements if they can be avoided.
internal/lm/nvml.go
Outdated
ok, err := d.IsFabricAttached() | ||
if !ok { | ||
continue | ||
} | ||
if err != nil { | ||
return nil, fmt.Errorf("error checking imex capability: %v", err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename the ok
variable. Should we also not check err
first?
ok, err := d.IsFabricAttached() | |
if !ok { | |
continue | |
} | |
if err != nil { | |
return nil, fmt.Errorf("error checking imex capability: %v", err) | |
} | |
isFabricAttached, err := d.IsFabricAttached() | |
if err != nil { | |
return nil, fmt.Errorf("error checking imex capability: %v", err) | |
} | |
if !isFabricAttached { | |
continue | |
} |
7ce987a
to
cddeb88
Compare
internal/lm/nvml.go
Outdated
@@ -218,6 +231,86 @@ func newGPUModeLabeler(devices []resource.Device) (Labeler, error) { | |||
return labels, nil | |||
} | |||
|
|||
func newImexDomainLabeler(configFile string, device []resource.Device) (Labeler, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move the IMEX specifics to a separate file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with a new NewImexLabeler
?
internal/lm/nvml.go
Outdated
@@ -218,6 +231,86 @@ func newGPUModeLabeler(devices []resource.Device) (Labeler, error) { | |||
return labels, nil | |||
} | |||
|
|||
func newImexDomainLabeler(configFile string, device []resource.Device) (Labeler, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func newImexDomainLabeler(configFile string, device []resource.Device) (Labeler, error) { | |
func newImexDomainLabeler(configFile string, devices []resource.Device) (Labeler, error) { |
internal/lm/nvml.go
Outdated
|
||
var commonClusterUUID string | ||
var commonCliqueID string | ||
for _, d := range device { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for _, d := range device { | |
for _, device := range devices { |
internal/resource/nvml-device.go
Outdated
byteSlice[10:16]) | ||
|
||
// CliqueID | ||
cliqueId := fmt.Sprintf("%d", uint64(gfInfo.CliqueId)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since CliqueId
is already a uint32
should we not just be able to Sprintf
it directly?
cliqueId := fmt.Sprintf("%d", uint64(gfInfo.CliqueId)) | |
cliqueId := fmt.Sprintf("%d", gfInfo.CliqueId) |
cddeb88
to
dd99e77
Compare
66ed204
to
1e9f62e
Compare
Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
1e9f62e
to
f2ae517
Compare
Unit test has been fixed, and also golangci-lint is happy now, PR is ready for review |
No description provided.