Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Support for AWS Launch Template Configuration #2668

Merged
merged 19 commits into from
Sep 19, 2024

Conversation

viniciusdc
Copy link
Contributor

@viniciusdc viniciusdc commented Aug 30, 2024

Reference Issues or PRs

This is an extension version of the fantastic work suggested and developed by @joneszc on his original PR #2621

The significant difference from his PR is moving out the template and AMI-type constructs into its pydantic schema (also to allow users to customize the template further). This also reduces the logic conditionals with terraform HCL as well.

What does this implement/fix?

  • Adds new custom field launch_template, to aws provider config schema:

    • Allows admins to specify pre_bootstrap_commands to run during EC2 instance creation
    • Allows use of custom AMIs
  • Add new Input variable ami_type to handle the GPU x normal AMI selection. This previously happened in the terraform code only, but with the requirement for this type to be set to CUSTOM when using a user-made AMI, I moved that logic to python. (this is not exposed to user)

Note: Both these configurations depend on a new way of spinning up the node_groups using launch_templates instead of the usual handling by the aws provider resources and scaling group. This only affects resources that have the launch_template file populated but does lead to the node_group being recreated and all associated instances within it as well.

  • tests/tests_unit/test_cli_validate.py was also updated to better reflect the original exception in case of sachem errors during assertion.

Put a x in the boxes that apply

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds a feature)
  • Breaking change (fix or feature that would cause existing features not to work as expected)
  • Documentation Update
  • Code style update (formatting, renaming)
  • Refactoring (no functional changes, no API changes)
  • Build related changes
  • Other (please describe):

Testing

  • Did you test the pull request locally?
  • Did you add new tests?

How to test this PR?

Any other comments?

@viniciusdc viniciusdc marked this pull request as draft August 30, 2024 15:06
@tylergraff
Copy link
Contributor

@viniciusdc How can we help getting this PR prioritized for review & merge? We need this capability in order to make progress on our air-gapped deployment investigation.

viniciusdc and others added 2 commits September 6, 2024 14:22
@viniciusdc viniciusdc marked this pull request as ready for review September 6, 2024 17:24
@viniciusdc
Copy link
Contributor Author

@joneszc can you try out these changes and check if the MIME user_data file generated workers as expected? I will do a deployment this evening as well.

@viniciusdc
Copy link
Contributor Author

How to test

Those are valid configs that should work:

    test:
      instance: m5.xlarge
      min_nodes: 0
      max_nodes: 1
      gpu: false
      single_subnet: false
      permissions_boundary:
      launch_template:
        node_prebootstrap_command: |
          #!/bin/bash
          echo "Hello, Nebari!"
    test2:
      instance: m5.xlarge
      min_nodes: 0
      max_nodes: 1
      gpu: false
      single_subnet: false
      permissions_boundary:
      ami_type: CUSTOM
      launch_template:
        ami_id: ami-0c3f3b5f2f3f3f3f3
        node_prebootstrap_command: |
          #!/bin/bash
          echo "Hello, Nebari!"

launch_template can also be defined at the provider level and will affect all node_groups, though for ami_id to be passed the ami_type needs to be set to CUSTOM

@joneszc
Copy link
Contributor

joneszc commented Sep 6, 2024

      ami_type: CUSTOM
      launch_template:
        ami_id: ami-0c3f3b5f2f3f3f3f3
        node_prebootstrap_command: |
          #!/bin/bash
          echo "Hello, Nebari!"

@viniciusdc

Initial test resulted in the following error
image

@joneszc
Copy link
Contributor

joneszc commented Sep 6, 2024

@joneszc can you try out these changes and check if the MIME user_data file generated workers as expected? I will do a deployment this evening as well.

Hello @viniciusdc
My question with this ami_type check is whether a CUSTOM gpu-enabled AMI will be possible if a manually entered "CUSTOM" value is constantly overridden by the base AL2_x86_64_GPU ami_type.
I personally think the original Nebari approach of inferring the ami_type based on condition of gpu true/false would be best combined with an additional switch to auto-trigger a value of CUSTOM if the ami_id is specified by the user.

Also, the options to set ami_id & pre_bootstrap_command should be independent and not mutually inclusive. Will this validator here prevent the user from running a custom AMI unless including a pre_bootstrap_command? We have use cases for employing custom-ami without running commands, custom-ami with running commands, as well as running commands without custom-ami.

@viniciusdc viniciusdc added this to the 2024.9.1 Release milestone Sep 9, 2024
@viniciusdc
Copy link
Contributor Author

@joneszc @tylergraff I am testing this right now:

My question with this ami_type check is whether a CUSTOM gpu-enabled AMI will be possible if a manually entered "CUSTOM" value is constantly overridden by the base AL2_x86_64_GPU ami_type.
I personally think the original Nebari approach of inferring the ami_type based on condition of gpu true/false would be best combined with an additional switch to auto-trigger a value of CUSTOM if the ami_id is specified by the user.

You are right; that's what I had in mind when first including that validator, mainly to avoid the case where the user misplaced something in yaml. But I agree with you that the current code does more harm to extensibility than anything. I have a fix for that already and will commit shortly.

Also, the options to set ami_id & pre_bootstrap_command should be independent and not mutually inclusive. Will this validator here prevent the user from running a custom AMI unless including a pre_bootstrap_command? We have use cases for employing custom-ami without running commands, custom-ami with running commands, as well as running commands without custom-ami.

I agree that was hindsight on my part, I will address it shortly. The only thing I am curious about is this last part here:

as well as running commands without custom-ami.

I am not sure that will be possible when launching an instance using a launch_template while the ami_id field is optional. During the spawning request itself, it will prompt for one, but I am not entirely sure yet if it will be able to auto-assign one or if it will fail.

@viniciusdc
Copy link
Contributor Author

while testing found a minor bug with some validations of an internal _nebari method. Currently addressing it

│ /Nebari/nebari/src/_nebari/stages/terraform_state/__init__.py:2 │
│ 39 in deploy                                                                                     │
│                                                                                                  │
│   236 │   def deploy(                                                                            │
│   237 │   │   self, stage_outputs: Dict[str, Dict[str, Any]], disable_prompt: bool = False       │
│   238 │   ):                                                                                     │
│ ❱ 239 │   │   self.check_immutable_fields()                                                      │
│   240 │   │                                                                                      │
│   241 │   │   with super().deploy(stage_outputs, disable_prompt):                                │
│   242 │   │   │   env_mapping = {}                                                               │
│                                                                                                  │
│ /Nebari/nebari/src/_nebari/stages/terraform_state/__init__.py:2 │
│ 75 in check_immutable_fields                                                                     │
│                                                                                                  │
│   272 │   │   │   bottom_level_schema = self.config                                              │
│   273 │   │   │   if len(keys) > 1:                                                              │
│   274 │   │   │   │   print(keys)                                                                │
│ ❱ 275 │   │   │   │   bottom_level_schema = functools.reduce(                                    │
│   276 │   │   │   │   │   lambda m, k: getattr(m, k), keys[:-1], self.config                     │
│   277 │   │   │   │   )                                                                          │
│   278 │   │   │   extra_field_schema = schema.ExtraFieldSchema(                                  │
│                                                                                                  │
│ /Nebari/nebari/src/_nebari/stages/terraform_state/__init__.py:2 │
│ 76 in <lambda>                                                                                   │
│                                                                                                  │
│   273 │   │   │   if len(keys) > 1:                                                              │
│   274 │   │   │   │   print(keys)                                                                │
│   275 │   │   │   │   bottom_level_schema = functools.reduce(                                    │
│ ❱ 276 │   │   │   │   │   lambda m, k: getattr(m, k), keys[:-1], self.config                     │
│   277 │   │   │   │   )                                                                          │
│   278 │   │   │   extra_field_schema = schema.ExtraFieldSchema(                                  │
│   279 │   │   │   │   **bottom_level_schema.model_fields[keys[-1]].json_schema_extra or {}       │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AttributeError: 'dict' object has no attribute 'test'

@viniciusdc
Copy link
Contributor Author

viniciusdc commented Sep 13, 2024

Hi @tylergraff @joneszc while reviewing and testing this proposed changes (using launch_templates) I found the following very important information on AWS docs:

  • Default managed node groups on AWS can't have their associated launch_templates moved to a custom one directly.
    • Its worth noting though that while this is cited in various different sections in the AWS docs, none of then provide many details. So I will be testing this myself to be sure.
  • When a launch template is provided, tags specified in the node group config apply to the EKS Nodegroup resource only and are not propagated to EC2 instances.

A node group can enter a DEGRADED status with a message similar to the following error:

"The Amazon EC2 Launch Template : lt-xxxxxxxxxxxxxxxxx has a new version associated with your Autoscaling group, which is not managed by Amazon EKS. Expected Launch Template version: x".

This error occurs when the Amazon EC2 launch template version for your managed node group doesn't match the version that Amazon EKS creates. Existing node groups that don't use a custom launch template can't be directly updated. To resolve this error, create a launch template and version with your preferred settings. Then, use the launch template to create the node group. If the new node group is launched from your custom template, then create new versions of the template. You can use this template without placing the node group in a DEGRADED status.

and also this one:

Managed node groups are always deployed with a launch template to be used with the Amazon EC2 Auto Scaling group. When you don't provide a launch template, the Amazon EKS API creates one automatically with default values in your account. However, we don't recommend that you modify auto-generated launch templates. Furthermore, existing node groups that don't use a custom launch template can't be updated directly. Instead, you must create a new node group with a custom launch template to do so.

This is not a big problem as we now have ways to control this so that it never happens, but I just wanted to signal it as in our docs, we will be raising this launch_template parameter as an immutable one. This means that only new deployments or new node_groups will be able to benefit from it.

Also, when using launch_templates, upgrades to kubernetes versions and AMIs are not performed automatically anymore on the UI, so usage of the nebari deployments to upgrade the available kuberntes versions will be required.

@viniciusdc
Copy link
Contributor Author

Looking great!

[terraform]:   # module.kubernetes.aws_eks_node_group.main[3] must be replaced
[terraform]: -/+ resource "aws_eks_node_group" "main" {
[terraform]:       ~ ami_type               = "AL2_x86_64" -> (known after apply)
[terraform]:       ~ arn                    = "arn:aws:eks:us-east-1:******:nodegroup/vini-template-dev/test/----" -> (known after apply)
[terraform]:       ~ capacity_type          = "ON_DEMAND" -> (known after apply)
[terraform]:       ~ disk_size              = 0 -> 50 # forces replacement
[terraform]:       ~ id                     = "vini-template-dev:test" -> (known after apply)
[terraform]:       ~ instance_types         = [ # forces replacement
[terraform]:           + "t2.micro",
[terraform]:         ]
[terraform]:       + node_group_name_prefix = (known after apply)
[terraform]:       ~ release_version        = "1.29.6-20240910" -> (known after apply)
[terraform]:       ~ resources              = [
[terraform]:           - {
[terraform]:               - autoscaling_groups              = [
[terraform]:                   - {
[terraform]:                       - name = "eks-test-----"
[terraform]:                     },
[terraform]:                 ]
[terraform]:               - remote_access_security_group_id = ""
[terraform]:             },
[terraform]:         ] -> (known after apply)
[terraform]:       ~ status                 = "ACTIVE" -> (known after apply)
[terraform]:         tags                   = {
[terraform]:             "Environment"                                             = "dev"
[terraform]:             "Owner"                                                   = "terraform"
[terraform]:             "Project"                                                 = "vini-template"
[terraform]:             "k8s.io/cluster-autoscaler/node-template/label/dedicated" = "test"
[terraform]:             "propagate_at_launch"                                     = "true"
[terraform]:         }
[terraform]:       ~ version                = "1.29" -> (known after apply)
[terraform]:         # (6 unchanged attributes hidden)
[terraform]: 
[terraform]:       - launch_template {
[terraform]:           - id      = "lt-02c4517df4f8e2bbc" -> null
[terraform]:           - name    = "test" -> null
[terraform]:           - version = "1" -> null
[terraform]:         }
[terraform]: 
[terraform]:       - update_config {
[terraform]:           - max_unavailable            = 1 -> null
[terraform]:           - max_unavailable_percentage = 0 -> null
[terraform]:         }
[terraform]: 
[terraform]:         # (1 unchanged block hidden)
[terraform]:     }
[terraform]: 
[terraform]:   # module.kubernetes.aws_eks_node_group.main[4] will be created
[terraform]:   + resource "aws_eks_node_group" "main" {
[terraform]:       + ami_type               = (known after apply)
[terraform]:       + arn                    = (known after apply)
[terraform]:       + capacity_type          = (known after apply)
[terraform]:       + cluster_name           = "vini-template-dev"
[terraform]:       + disk_size              = 50
[terraform]:       + id                     = (known after apply)
[terraform]:       + instance_types         = [
[terraform]:           + "t2.micro",
[terraform]:         ]
[terraform]:       + labels                 = {
[terraform]:           + "dedicated" = "test1"
[terraform]:         }
[terraform]:       + node_group_name        = "test1"
[terraform]:       + node_group_name_prefix = (known after apply)
[terraform]:       + node_role_arn          = "arn:aws:iam::******:role/vini-template-dev-eks-node-group-role"
[terraform]:       + release_version        = (known after apply)
[terraform]:       + resources              = (known after apply)
[terraform]:       + status                 = (known after apply)
[terraform]:       + subnet_ids             = [
[terraform]:           + "subnet-******",
[terraform]:           + "subnet-******",
[terraform]:         ]
[terraform]:       + tags                   = {
[terraform]:           + "Environment"                                             = "dev"
[terraform]:           + "Owner"                                                   = "terraform"
[terraform]:           + "Project"                                                 = "vini-template"
[terraform]:           + "k8s.io/cluster-autoscaler/node-template/label/dedicated" = "test1"
[terraform]:           + "propagate_at_launch"                                     = "true"
[terraform]:         }
[terraform]:       + tags_all               = {
[terraform]:           + "Environment"                                             = "dev"
[terraform]:           + "Owner"                                                   = "terraform"
[terraform]:           + "Project"                                                 = "vini-template"
[terraform]:           + "k8s.io/cluster-autoscaler/node-template/label/dedicated" = "test1"
[terraform]:           + "propagate_at_launch"                                     = "true"
[terraform]:         }
[terraform]:       + version                = (known after apply)
[terraform]: 
[terraform]:       + scaling_config {
[terraform]:           + desired_size = 0
[terraform]:           + max_size     = 1
[terraform]:           + min_size     = 0
[terraform]:         }
[terraform]:     }
[terraform]: 
[terraform]:   # module.kubernetes.aws_launch_template.main["test"] will be destroyed
[terraform]:   # (because key ["test"] is not in for_each map)
[terraform]:   - resource "aws_launch_template" "main" {
[terraform]:       - arn                     = "arn:aws:ec2:us-east-******:launch-template/lt-02c4517df4f8e2bbc" -> null
[terraform]:       - default_version         = 1 -> null
[terraform]:       - disable_api_stop        = false -> null
[terraform]:       - disable_api_termination = false -> null
[terraform]:       - id                      = "lt-02c4517df4f8e2bbc" -> null
[terraform]:       - instance_type           = "t2.micro" -> null
[terraform]:       - latest_version          = 1 -> null
[terraform]:       - name                    = "test" -> null
[terraform]:       - security_group_names    = [] -> null
[terraform]:       - tags                    = {} -> null
[terraform]:       - tags_all                = {} -> null
[terraform]:       - user_data               = "TUlNRS1W***FsbCAteSBodG9wCgoKCgotLT09TVlCT1VOREFSWT09LS0K" -> null
[terraform]:       - vpc_security_group_ids  = [
[terraform]:           - "sg-******",
[terraform]:         ] -> null
[terraform]: 
[terraform]:       - block_device_mappings {
[terraform]:           - device_name = "/dev/xvda" -> null
[terraform]: 
[terraform]:           - ebs {
[terraform]:               - iops        = 0 -> null
[terraform]:               - throughput  = 0 -> null
[terraform]:               - volume_size = 50 -> null
[terraform]:               - volume_type = "gp2" -> null
[terraform]:             }
[terraform]:         }
[terraform]: 
[terraform]:       - metadata_options {
[terraform]:           - http_endpoint               = "enabled" -> null
[terraform]:           - http_put_response_hop_limit = 0 -> null
[terraform]:           - http_tokens                 = "required" -> null
[terraform]:           - instance_metadata_tags      = "enabled" -> null
[terraform]:         }
[terraform]:     }
[terraform]: 

@viniciusdc
Copy link
Contributor Author

running with the example bellow:

    test:
      instance: t2.micro
      min_nodes: 1
      max_nodes: 1
      gpu: false
      single_subnet: false
      launch_template:
        pre_bootstrap_command: "#!/bin/bash\n# This script is executed before the node is bootstrapped\n# You can use this script to install additional packages or configure the node\n# For example, to install the `htop` package, you can run:\n# sudo apt-get update\n# sudo apt-get install -y htop"
      permissions_boundary:

ami_id is not required; terraform resource handles that accordingly when applying the node_group. I haven't tested the custom ami_id yet (I could attest it switches to CUSTOM, though), as I don't have a custom image.

@Adam-D-Lewis Adam-D-Lewis mentioned this pull request Sep 17, 2024
29 tasks
@joneszc
Copy link
Contributor

joneszc commented Sep 17, 2024

ami_id is not required; terraform resource handles that accordingly when applying the node_group. I haven't tested the custom ami_id yet (I could attest it switches to CUSTOM, though), as I don't have a custom image.

Hello @viniciusdc
I just tested the custom ami_id option and the custom-ami nodes failed to join the cluster...

image

It appears you changed the format of the env variable statements set in the bootstrap.sh command.
I had specifically instantiated those variables formatted as ${<variable>} to avoid the following error:

Cloud-init v. 19.3-46.amzn2.0.2 running 'modules:final' at Tue, 17 Sep 2024 19:34:47 +0000. Up 18.47 seconds.
+ CLUSTER_NAME='{{ cluster_name }}'
+ B64_CLUSTER_CA='{{ cluster_cert_authority }}'
+ API_SERVER_URL='{{ cluster_endpoint }}'
+ /etc/eks/bootstrap.sh '{{' cluster_name '}}' --b64-cluster-ca '{{' cluster_cert_authority '}}' --apiserver-endpoint '{{' cluster_endpoint '}}'
2024-09-17T19:34:56+0000 [eks-bootstrap] INFO: starting...
2024-09-17T19:34:56+0000 [eks-bootstrap] INFO: --b64-cluster-ca='{{'
2024-09-17T19:34:56+0000 [eks-bootstrap] INFO: --apiserver-endpoint='{{'
2024-09-17T19:35:04+0000 [eks-bootstrap] INFO: Using kubelet version 1.29.6
2024-09-17T19:35:04+0000 [eks-bootstrap] INFO: Using containerd as the container runtime
Created symlink from /etc/systemd/system/multi-user.target.wants/sys-fs-bpf.mount to /etc/systemd/system/sys-fs-bpf.mount.
ΓÇÿ/etc/eks/configure-clocksource.serviceΓÇÖ -> ΓÇÿ/etc/systemd/system/configure-clocksource.serviceΓÇÖ
Created symlink from /etc/systemd/system/multi-user.target.wants/configure-clocksource.service to /etc/systemd/system/configure-clocksource.service.
2024-09-17T19:35:04+0000 [eks-bootstrap] INFO: Using IP family: ipv4
base64: invalid input
Exited with error on line 415
Sep 17 19:35:04 cloud-init[3100]: util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/part-002 [1]

As you can see above, the env variables don't render on the node when enclosed in {{' '}}

Additionally, when invoking a custom ami_id, but not employing any pre_bootstrap_command, an empty user_data section is populated. However, this empty user_data part does not seem to prevent nodes from joining the cluster:

image

Finally, the following conditional, checking whether or not ami_type is set to "CUSTOM", does not appear to be functioning:
image

The bootstrap.sh command fails to render and the node fails to join the cluster.

I was able to continue testing by, instead, using the following conditional with --disable-render:
include_bootstrap_cmd = each.value.launch_template.ami_id != null ? true : false

@viniciusdc
Copy link
Contributor Author

Hi @joneszc thank you so much for the testing, indeed while moving working around that template I ended up incorrectly assigning the values there, thanks for catching that. I fixed it this afternoon while testing the CUSTOM logic issue you related above. I had also noticed the blank user_data sections and fixed it. Will finish testing and commit the new changes

@viniciusdc
Copy link
Contributor Author

I am addressing an issue with the ami_id and launch_template.ami_profile_name that might be related to the faulty logic you saw with the CUSTOM. After you continued testing with the no render parameter, were you able to get your instance running with the custom code?

@joneszc
Copy link
Contributor

joneszc commented Sep 18, 2024

I am addressing an issue with the ami_id and launch_template.ami_profile_name that might be related to the faulty logic you saw with the CUSTOM. After you continued testing with the no render parameter, were you able to get your instance running with the custom code?

Hello @viniciusdc
Yes! testing both pre_bootstrap_command and setting a custom ami_id succeeded after using --disable-render while adjusting the include_bootstrap_cmd conditional as mentioned and also the user_data.tftpl environment vars format:

CLUSTER_NAME=${cluster_name}
B64_CLUSTER_CA=${cluster_cert_authority}
API_SERVER_URL=${cluster_endpoint}

/etc/eks/bootstrap.sh $CLUSTER_NAME --b64-cluster-ca $B64_CLUSTER_CA --apiserver-endpoint $API_SERVER_URL

@viniciusdc
Copy link
Contributor Author

viniciusdc commented Sep 18, 2024

Testing this is now complete, I ensured it was backward compatible with an existing deployment and the new nodes also succeded at joining the cluster, as shown bellow in a running pod/container in one of such nodes:

Attested pre_bootstrap_command works
Captura de Tela 2024-09-17 às 23 52 16

    test1:
      instance: t2.micro
      min_nodes: 0
      max_nodes: 1
      gpu: false
      single_subnet: false
      permissions_boundary:
      launch_template:
        pre_bootstrap_command: |
          #!/bin/bash
          mkdir test-userscript
          touch /test-userscript/userscript.txt
          echo "Created by bash shell script" >> /test-userscript/userscript.txt

Deployment with custom IAM (testing cluster)
Captura de Tela 2024-09-18 às 01 48 25

    test2:
      instance: t2.micro
      min_nodes: 0
      max_nodes: 1
      gpu: false
      single_subnet: false
      permissions_boundary:
      launch_template:
        ami_id: ami-0bca0e3b9fe48307a

@viniciusdc
Copy link
Contributor Author

For more details here's the doc's PR nebari-dev/nebari-docs#525

@viniciusdc
Copy link
Contributor Author

viniciusdc commented Sep 18, 2024

Its worth noticing that the AWS provider seems to have an issue with the update of scaling_config.desired_size field. So, if you are using a custom AMI and you would like to scale up from the nebari config, right now you will need to recreate the group to make that work or use the console UI to scale (advised) -- that only affects deployments whose node_group has the launch_template and uses a custom ami_id.

I will address this inconsistency with a follow up PR at another time.

@dcmcand dcmcand merged commit 5f5e53c into develop Sep 19, 2024
28 checks passed
@dcmcand dcmcand deleted the 2603-aws-node-launch-template branch September 19, 2024 14:51
joneszc referenced this pull request in nebari-dev/nebari-docs Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done 💪🏾
Development

Successfully merging this pull request may close these issues.

4 participants