Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes from main backport to the brach #95

Merged
merged 43 commits into from
Nov 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
569d071
Update formatting
iglunchadze Oct 28, 2024
17f82f7
Update wording
iglunchadze Oct 28, 2024
0e47bb9
Remove paragraphs from end
iglunchadze Oct 28, 2024
08f669c
fix public ip allocation for gpu nodes
cyril-k Nov 6, 2024
857cb83
Adding csi-driver-mounted-fs solution to support k8s-inference + upda…
idanb1 Nov 13, 2024
bd44e6c
MSP-3313: Add NSYS profiling in GPT3 mlperf implementation
rdjjke Nov 14, 2024
4249255
Merge pull request #83 from nebius/gpt3-impl-nsys-profiling/0
rdjjke Nov 17, 2024
197a5f5
Add configs for H200 nodes to GPT3 impl
rdjjke Nov 17, 2024
d1edccd
Merge pull request #88 from nebius/gpt3-impl-h200-configs/0
rdjjke Nov 17, 2024
e4383a9
Merge pull request #69 from nebius/fix/public-ips-k8s-nodes
malibora Nov 20, 2024
255915d
Merge branch 'main' into update/k8s-training-readme
malibora Nov 20, 2024
21399cd
Merge pull request #63 from iglunchadze/update/k8s-training-readme
malibora Nov 20, 2024
f83971b
nccl_use_infiniband true and nccl_benchmark_min_threshold 45
asteny Nov 20, 2024
0731d9b
Merge branch 'main' into fix/csi-driver-mounted-fs-to-k8s-inference
idanb1 Nov 20, 2024
d4a95b6
Merge pull request #82 from nebius/fix/csi-driver-mounted-fs-to-k8s-i…
idanb1 Nov 20, 2024
212e487
add nccl_use_infiniband to example
asteny Nov 20, 2024
61931bd
Merge pull request #90 from nebius/nccl_use_infiniband
asteny Nov 20, 2024
0ca10bf
bump soperator 1.15.3
asteny Nov 20, 2024
090ec6f
Merge pull request #91 from nebius/bump_1_15_3
asteny Nov 20, 2024
8584e20
Merge pull request #93 from nebius/release/soperator
asteny Nov 20, 2024
8791f52
Platform and preset moved to variables across the library;
elijah-k-nebius Nov 21, 2024
04ab264
Added "region" variable to control platform defaults (k8s-inference);
elijah-k-nebius Nov 21, 2024
5eed389
Added "region" variable to control platform defaults (k8s-inference (…
elijah-k-nebius Nov 21, 2024
1984342
Added "region" variable to control platform defaults (k8s-inference (…
elijah-k-nebius Nov 21, 2024
3c8423a
Added "region" variable to control platform defaults (k8s-training);
elijah-k-nebius Nov 21, 2024
ab7b2de
Added "region" variable to control platform defaults (GlusterFS module);
elijah-k-nebius Nov 21, 2024
5216c7a
Added "region" variable to control platform defaults (NFS Server);
elijah-k-nebius Nov 21, 2024
e704aad
Tf fmt
d3vil-st Nov 21, 2024
3f3ce80
Tests fixed;
elijah-k-nebius Nov 21, 2024
d5512e2
Tf fmt;
elijah-k-nebius Nov 21, 2024
2991b9e
Tests fixed (2);
elijah-k-nebius Nov 21, 2024
1948cca
Added "region" variable to control platform defaults (WireGuard);
elijah-k-nebius Nov 21, 2024
157adb9
Added "region" variable to control platform defaults (Slurm);
elijah-k-nebius Nov 21, 2024
4a55aaf
terraform.tfvars files refactored;
elijah-k-nebius Nov 21, 2024
f4e08df
TF fmt
d3vil-st Nov 21, 2024
5a18cda
Added region variables for tests
d3vil-st Nov 21, 2024
26de355
Clean region variable for tests
d3vil-st Nov 21, 2024
1f26685
Clean region variable for tests
d3vil-st Nov 21, 2024
8d281b8
Added "region" variable to control platform defaults (GlusterFS (2));
elijah-k-nebius Nov 21, 2024
b7f9b41
Presets fixed;
elijah-k-nebius Nov 21, 2024
398051a
Added "region" variable to control platform defaults (WireGuard (2));
elijah-k-nebius Nov 21, 2024
3a0ab6b
Tests fixed (3);
elijah-k-nebius Nov 21, 2024
88aab79
Merge pull request #94 from nebius/feature/platform-and-preset-moved-…
elijah-k-nebius Nov 25, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/terraform.yml
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ jobs:

env:
TF_VAR_subnet_id: vpcsubnet-e00dgdntmhgkeej1z3
TF_VAR_region: eu-north1
TF_VAR_loki_access_key_id: ${{ secrets.SA_ACCESS_KEY_ID }}
TF_VAR_loki_secret_key: ${{ secrets.SA_SECRET_KEY }}

Expand Down
18 changes: 15 additions & 3 deletions k8s-inference/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,7 @@ There are additional configurable variables in `variables.tf`.
# Cloud environment and network
parent_id = "" # The project-id in this context
subnet_id = "" # Use the command "nebius vpc v1alpha1 network list" to see the subnet id
region = "" # The project region.
ssh_user_name = "" # Username you want to use to connect to the nodes
ssh_public_key = {
key = "put your public ssh key here" OR
Expand Down Expand Up @@ -266,13 +267,13 @@ apiVersion: v1
metadata:
name: external-storage-persistent-volume
spec:
storageClassName: hostpath
storageClassName: csi-mounted-fs-path-sc
capacity:
storage: "<SIZE>"
accessModes:
- ReadWriteMany
hostPath:
path: "<HOST-PATH>" # "/mnt/filestore/<sub-directory>" or "/mnt/glusterfs/<sub-directory>"
path: "<HOST-PATH>" # "/mnt/data/<sub-directory>" or "/mnt/glusterfs/<sub-directory>"

---

Expand All @@ -281,10 +282,21 @@ apiVersion: v1
metadata:
name: external-storage-persistent-volumeclaim
spec:
storageClassName: hostpath
storageClassName: csi-mounted-fs-path-sc
accessModes:
- ReadWriteMany
resources:
requests:
storage: "<SIZE>"
```

## CSI limitations:
- FS should be mounted to all NodeGroups, because PV attachmend to pod runniing on Node without FS will fail
- One PV may fill up to all common FS size
- FS size will not be autoupdated if PV size exceed it spec size
- FS size for now can't be updated through API, only through NEBOPS. (thread)
- volumeMode: Block - is not possible

## Good to know:
- read-write many mode PV will work
- MSP started testing that solution to enable early integration with mk8s.
2 changes: 2 additions & 0 deletions k8s-inference/gluster-fs.tf
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,6 @@ module "glusterfs" {
disk_count_per_vm = var.glusterfs_disk_count_per_vm
disk_size = var.glusterfs_disk_size
ssh_public_key = local.ssh_public_key
platform = local.cpu_nodes_platform
preset = local.cpu_nodes_preset
}
7 changes: 6 additions & 1 deletion k8s-inference/helm.tf
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ module "o11y" {
enabled = var.enable_dcgm,
node_groups = {
node_group_name = {
gpus = tonumber(split("gpu-", var.gpu_nodes_preset)[0])
gpus = tonumber(split("gpu-", local.gpu_nodes_preset)[0])
instance_group_id = nebius_mk8s_v1_node_group.gpu.id
}
}
Expand All @@ -39,3 +39,8 @@ module "o11y" {
}
test_mode = var.test_mode
}

module "csi-mounted-fs-path" {
source = "../modules/csi-mounted-fs-path"
count = var.enable_filestore ? 1 : 0
}
22 changes: 22 additions & 0 deletions k8s-inference/locals.tf
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,28 @@ locals {
release-suffix = random_string.random.result
ssh_public_key = var.ssh_public_key.key != null ? var.ssh_public_key.key : (
fileexists(var.ssh_public_key.path) ? file(var.ssh_public_key.path) : null)

regions_default = {
eu-west1 = {
cpu_nodes_platform = "cpu-d3"
cpu_nodes_preset = "16vcpu-64gb"
gpu_nodes_platform = "gpu-h200-sxm"
gpu_nodes_preset = "1gpu-16vcpu-200gb"
}
eu-north1 = {
cpu_nodes_platform = "cpu-e2"
cpu_nodes_preset = "16vcpu-64gb"
gpu_nodes_platform = "gpu-h100-sxm"
gpu_nodes_preset = "1gpu-16vcpu-200gb"
}
}

current_region_defaults = local.regions_default[var.region]

cpu_nodes_preset = coalesce(var.cpu_nodes_preset, local.current_region_defaults.cpu_nodes_preset)
cpu_nodes_platform = coalesce(var.cpu_nodes_platform, local.current_region_defaults.cpu_nodes_platform)
gpu_nodes_platform = coalesce(var.gpu_nodes_platform, local.current_region_defaults.gpu_nodes_platform)
gpu_nodes_preset = coalesce(var.gpu_nodes_preset, local.current_region_defaults.gpu_nodes_preset)
}

resource "random_string" "random" {
Expand Down
12 changes: 6 additions & 6 deletions k8s-inference/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,8 @@ resource "nebius_mk8s_v1_node_group" "cpu-only" {
}
]
resources = {
platform = var.cpu_nodes_platform
preset = var.cpu_nodes_preset
platform = local.cpu_nodes_platform
preset = local.cpu_nodes_preset
}
filesystems = var.enable_filestore ? [
{
Expand Down Expand Up @@ -68,13 +68,13 @@ resource "nebius_mk8s_v1_node_group" "gpu" {
}
network_interfaces = [
{
subnet_id = var.subnet_id
public_ip = var.gpu_nodes_assign_public_ip ? {} : null
subnet_id = var.subnet_id
public_ip_address = var.gpu_nodes_assign_public_ip ? {} : null
}
]
resources = {
platform = var.gpu_nodes_platform
preset = var.gpu_nodes_preset
platform = local.gpu_nodes_platform
preset = local.gpu_nodes_preset
}
filesystems = var.enable_filestore ? [
{
Expand Down
19 changes: 11 additions & 8 deletions k8s-inference/terraform.tfvars
Original file line number Diff line number Diff line change
@@ -1,17 +1,20 @@
# Cloud environment and network
# parent_id = "" # The project-id in this context
# subnet_id = "" # Use the command "nebius vpc v1alpha1 network list" to see the subnet id
# ssh_user_name = "" # Username you want to use to connect to the nodes
# parent_id = "" # The project-id in this context
# subnet_id = "" # Use the command "nebius vpc v1alpha1 network list" to see the subnet id
# region = "" # Project region
# ssh_user_name = "" # Username you want to use to connect to the nodes
# ssh_public_key = {
# key = "put your public ssh key here" OR
# path = "put path to ssh key here"
# }

# K8s modes
cpu_nodes_count = 1 # Number of CPU nodes
cpu_nodes_preset = "16vcpu-64gb" # The CPU node preset
gpu_nodes_count = 1 # Number of GPU nodes
gpu_nodes_preset = "1gpu-16vcpu-200gb" # The GPU node preset. Set to "8gpu-128vcpu-1600gb", to deploy nodes with 8 GPUs.
# K8s nodes
cpu_nodes_count = 1 # Number of CPU nodes
gpu_nodes_count = 1 # Number of GPU nodes
# cpu_nodes_platform = # CPU nodes platofm
# cpu_nodes_preset = # CPU nodes preset
# gpu_nodes_platform = # GPU nodes platform
# gpu_nodes_preset = # GPU nodes preset

# Observability
enable_grafana = true # Enable or disable Grafana deployment with true or false
Expand Down
16 changes: 11 additions & 5 deletions k8s-inference/variables.tf
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# K8s cluster
# Global
variable "parent_id" {
description = "Project ID."
type = string
Expand All @@ -9,6 +9,12 @@ variable "subnet_id" {
type = string
}

variable "region" {
description = "The current region."
type = string
}

# K8s cluster
variable "k8s_version" {
description = "Kubernetes version to be used in the cluster."
type = string
Expand Down Expand Up @@ -114,13 +120,13 @@ variable "cpu_nodes_count" {
variable "cpu_nodes_platform" {
description = "Platform for nodes in the CPU-only node group."
type = string
default = "cpu-e2"
default = null
}

variable "cpu_nodes_preset" {
description = "CPU and RAM configuration for nodes in the CPU-only node group."
type = string
default = "16vcpu-64gb"
default = null
}

variable "cpu_disk_type" {
Expand All @@ -145,13 +151,13 @@ variable "gpu_nodes_count" {
variable "gpu_nodes_platform" {
description = "Platform for nodes in the GPU node group."
type = string
default = "gpu-h100-sxm"
default = null
}

variable "gpu_nodes_preset" {
description = "Configuration for GPU amount, CPU, and RAM for nodes in the GPU node group."
type = string
default = "1gpu-16vcpu-200gb"
default = null
}

variable "gpu_disk_type" {
Expand Down
Loading
Loading