Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFD - Add Nebari Configuration Options for Air-Gapped or Secure Deployments #52

Open
joneszc opened this issue Aug 15, 2024 · 7 comments

Comments

@joneszc
Copy link

joneszc commented Aug 15, 2024

Status Draft 🚧 / Open for comments 💬
Author(s) @joneszc
Date Created 15-08-2024
Date Last updated 15-08-2024
Decision deadline N/A

Title

Addition of Nebari Configuration Options for Deploying Nebari in Air-Gapped and/or Secure Environments

Summary

There are currently a number of options available to optimize Nebari configurations for deploying in AWS air-gapped networks, private subnets, or secure environments

For example:

  • Pointing to existing subnets and an existing security group for overriding the default Nebari provisioning of aws_vpc, aws_subnet, aws_internet_gateway, aws_security_group:
amazon_web_services:
  existing_subnet_ids:
    - subnet-xzxzxzxzxzxzxzx
    - subnet-zxzxzxzxzxzxzxzxz
  existing_security_group_id: sg-xzxzxzxzxzx
  • Setting the load balancer as internal as opposed to internet-facing while also pointing to existing subnets:
ingress:
  terraform_overrides:
    load-balancer-annotations:
      service.beta.kubernetes.io/aws-load-balancer-internal: "true"
      service.beta.kubernetes.io/aws-load-balancer-subnets: "subnet-xzxzxzxzxzxzxzx,subnet-zxzxzxzxzxzxzxzxz"
  • Potentially overriding conda configs with private conda channels
conda_store:
  extra_settings:
    CondaStore:
      conda_allowed_channels:
        - s3://my_bucket/my-conda-repository/main/
  • Deploying Nebari with self-signed certificates:
certificate:
  type: self-signed
  • Rebuilding with locally issued or customer-provided certificate (after creating k8s secret):
certificate:
  type: existing
  secret_name: nebari-custom-secret

However, configuration options could be expanded to enable:

  • Option to set image overrides for all Nebari containers, individually, or set private registry mirrors for docker, quay, gcr, etc.
  • Option to employ customized, hardened or otherwise customer-approved AMIs for EKS nodes
  • Option to specify launch commands (pre-bootstrap) that configure security/networking/performance requirements of EKS nodes
  • Option to control the EKS cluster endpoint access, beyond the default "Public", to set either "Private" or "Public and Private"

User benefit

  • Nebari users would have the option to control all container-image overrides and set private registries/mirrors.
  • Nebari users would be able to modify EKS nodes and/or utilize customized ec2 AMI to accommodate target environment requirements to ensure networking/security/performance compliance and accommodate requirements for integrating customer systems.
  • Nebari would be deployable into private AWS subnets with option to override the current default "Public" cluster API server endpoint to set an entirely "Private" endpoint or a "Public and Private" endpoint.

Design Proposal

  • Enable option to control EKS cluster endpoint access settings as discussed in #2586 and proposed in PR#2618:
amazon_web_services:
  eks_endpoint_access: 'private'
  • Enable option to run custom launch commands on EKS nodes as discussed in #2603 and proposed in PR#2621
    Note that this would also resolve the functionality to set private container registries/mirrors by adding containerd configs/imports:
amazon_web_services:
  node_prebootstrap_command: |
    #!/bin/bash
    mkdir -p /etc/containerd/certs.d/_default
    cat <<-EOT > /etc/containerd/certs.d/_default/hosts.toml
    [host."https://registry.gitlab.example.com"]
      capabilities = ["pull", "resolve"]
    EOT
  • Enable option to specify custom AMI IDs for EKS nodes as discussed in #2604 and also proposed in PR#2621
amazon_web_services:
  node_groups:
    general:
      instance: m5.2xlarge
      custom_ami: ami-0xzxzxzxzxzxzx
      min_nodes: 1
      max_nodes: 1
      gpu: false
      single_subnet: false

Alternatives or approaches considered (if any)

In addition to using node pre-bootstrap commands to override containerd configs for setting private registry mirrors, increased terraform and helm override options could be enabled to specify container images and tags to reflect custom-built or privately-hosted container-images.

Best practices

User impact

  • Per PR#2621, users would need to understand whether or not their custom EKS node AMI is GPU-enabled, to ensure proper toggling of gpu: <true/false> (setting a custom AMI ID would trigger the terraform logic to automatically switch the ami_type value to "CUSTOM", as required by AWS EKS).
  • Additionally, per PR#2621, users would not be required to manually input the override bootstrap.sh command when setting a custom AMI ID, as the terraform would employ the command for them.

Unresolved questions

@tylergraff
Copy link

I propose adding a "Nebari Secure Deployment Guide" to this RFD.

This would take the form of one or more nebari-config.yaml files which each utilize inline comments to comprehensively document configuration parameters relevant to various aspects of security. For example, one config file may demonstrate how to override the default docker container locations with a custom-specified repository. This config file could be named e.g. "nebari-config-custom-docker-repo.yaml".

Another example could specify the aboe docker repository along with AMI IDs and elimination of AWS internet gateway. This config file could be called e.g. "nebari-config-aws-airgap.yaml". These are only examples: further discussion can refine exactly what configuration goes into each yaml file and how the files are named.

These files could then be used in associated CI/CD pipelines to validate that the configuration state they describe continues to be supported by Nebari as new versions are released. I propose that the work to hook up this CI/CD mechanism is not part of this RFD.

I do not have a recommended location for these nebari-config.yaml files yet.

@Adam-D-Lewis
Copy link
Member

Somewhat related, I know @viniciusdc was working on a way to auto generate documentation from the code. We could potentially add the documentation describing for each new nebari-config setting in the pydantic models themselves. Do you have an issue that shows what you were working on @viniciusdc?

@dcmcand
Copy link

dcmcand commented Aug 26, 2024

* Enable option to control EKS cluster endpoint access settings as discussed in [#2586](https://github.com/nebari-dev/nebari/issues/2586) and proposed in [PR#2618](https://github.com/nebari-dev/nebari/pull/2618):
amazon_web_services:
  eks_endpoint_access: 'private'

I think this is fine and your proposed method in #2618 makes sense.

* Enable option to run custom launch commands on EKS nodes as discussed in [#2603](https://github.com/nebari-dev/nebari/issues/2603) and proposed in [PR#2621](https://github.com/nebari-dev/nebari/pull/2621)
  Note that this would also resolve the functionality to set private container registries/mirrors by adding containerd configs/imports:
amazon_web_services:
  node_prebootstrap_command: |
    #!/bin/bash
    mkdir -p /etc/containerd/certs.d/_default
    cat <<-EOT > /etc/containerd/certs.d/_default/hosts.toml
    [host."https://registry.gitlab.example.com"]
      capabilities = ["pull", "resolve"]
    EOT

I think this syntax would be quite awkward. As much as possible, prebaking stuff like enabling other repos into your AMI would solve this. For running a script on startup, there is the user data approach (reference https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/user-data.html) Perhaps something like specifying a path to a userdata file or config would work. I have concerns about not only the awkwardness of including the script inline, but it also seems like a potential security footgun.

* Enable option to specify custom AMI IDs for EKS nodes as discussed in [#2604](https://github.com/nebari-dev/nebari/issues/2604) and also proposed in [PR#2621](https://github.com/nebari-dev/nebari/pull/2621)
amazon_web_services:
  node_groups:
    general:
      instance: m5.2xlarge
      custom_ami: ami-0xzxzxzxzxzxzx
      min_nodes: 1
      max_nodes: 1
      gpu: false
      single_subnet: false

Slight nitpick, but rather than custom_ami, we could just call it node_ami, there is no reason it has to be custom, a user may want to just use a vendor provided node ami for some reason.

@joneszc
Copy link
Author

joneszc commented Sep 3, 2024

@viniciusdc @dcmcand

The original intent of the Add aws_launch_template PR was to make it easy on the user to run commands and customized AmazonLinux AMIs using launch_template/user_data sections, with less crisscrossing of config variable options and more built-in logic to reduce the risk of blocking nodes from joining the cluster (e.g. missing or faulty user-defined bootstrap.sh commands).

Your replacement PR includes the same launch_template/user_data approach as originally proposed in PR and still includes the inline scripting option with the pre_bootstrap_command/node_prebootstrap_command var. The main differences I see between the PRs is that your rendition moves the pre-bootstrap command option to the node_groups level to cut out some previously proposed logic in the terraform, while also requiring more due diligence from the user in mapping additional variables.

When specifying an AMI image_id, your PR will require the user to manually set the bootstrap.sh command (PR#2621 triggers a pre-set bootstrap.sh command as necessary and EKS otherwise sets the bootstrap.sh command automatically when image_id is not specified). You are proposing to call this variable, for the second part of the user_data block, user_data, which seems misleading since the overarching user_data section might already include pre-bootstrap commands and could require a bootstrap.sh command. If you are setting the onus on the user to provide the bootstrap.sh command, please rename the variable to something like override_bootstrap_command to clue the user in on providing the /etc/eks/bootstrap.sh command. Additionally, your PR adds updates to relocate some previously existing logic, from terraform to python, for setting ami_type but falls short of checking to ensure that the ami_type is always CUSTOM when a user specifies an AMI image_id. EKS will fail if ami_type is set to anything other than CUSTOM when setting image_id.

Our original PR#2621 was not engaging users to specify ami_type; rather, the intent was to enable users to customize the default AmazonLinux AMIs for security purposes (e.g. apply STIGs). I'm glad to see you aren't enabling the user free range to set the ami_type, which would expand the scope of this PR to accommodate user_data configuration schema updates for AmazonLinux2023, which transitions from Content-Type: text/x-shellscript; charset="us-ascii" to Content-Type: application/node.eks.aws (YAML). We are anticipating addressing Nebari's AWS migration from AL2 to AL2023 in a separate security Issue/PR--in due course of the upcoming deprecation of AL2.

@viniciusdc
Copy link

Hi @joneszc, Thanks for the valuable follow-up. Indeed, I took the liberty of expanding the PR to be a bit more generic in the sense of ami customization. As the draft suggests, that was just a small passthrough to see how the config would be exposed to a user, in which I was already considering the scope of a "security" deployment option or set of settings that lead to that in the future.

The main differences I see between the PRs is that your rendition moves the pre-bootstrap command option to the node_groups level to cut out some previously proposed logic in the terraform, while also requiring more due diligence from the user in mapping additional variables.

It's worth noticing that I've kept your original option to set the launch_template as a global config for all node groups. So, both the node_groups and the aws provider field will have access to that variable, though a node_group.launch_templatestill would have priority over the global one when available.

You are proposing to call this variable for the second part of the user_data block, user_data, which seems misleading since the overarching user_data section might already include pre-bootstrap commands and could require a bootstrap.sh command

I thought about that, and I concur the name needs to be more accurate since it's not exactly what it says it is. Still, at the same time, I will enforce that this can only be passed when the AMI type is set to CUSTOM, and, as mentioned by you as well, that's something that I didn't have the time to add to that PR yet, but it was the original goal.

However, on a counter suggestion, what about override_user_data? While I see the point of using override_bootstrap_command as a good source of direction to the user, I think it's limiting when comparing the broader flexibility of MIME. But I am not strongly opinionated on this.

Our original nebari-dev/nebari#2621 was not engaging users to specify ami_type; rather, the intent was to enable users to customize the default AmazonLinux AMIs for security purposes (e.g. apply STIGs).

That's also different from the PR's intention. The main goal was to clean up the handling logic only from within the Terraform resources; the reason it shows up to users right now is mainly due to a current narrow distinction between what should be passed down as a Terraform variable and what should be allowed in the nebari-config.yaml. They consume from the same model right now, but in theory, they should be separated entities, and that is something I plan to address in another ocasion.

AmazonLinux2023, which transitions from Content-Type: text/x-shellscript; charset="us-ascii" to Content-Type: application/node.eks.aws (YAML). We anticipate addressing Nebari's AWS migration from AL2 to AL2023 in a separate security Issue/PR--in due course of the upcoming deprecation of AL2.

You mentioned this exciting prospect. Right now, in any of the given PRs, we are "hardcoding" that as part of the template file, which, while not harmful, is not preferable. So maybe the best course of action would be to leverage the data_user as a path to a template file and only guarantee a set of variables to these templates, such as certificate, cluster_name, etc.

@joneszc
Copy link
Author

joneszc commented Sep 4, 2024

@viniciusdc

I could see calling the variable "user_data" if you weren't including the pre_bootstrap_command var. My team's use cases, for which we originally requested these features, are predominantly in favor of the pre-bootstrap command option in conjunction with taking the burden off the Nebari user for setting the /etc/eks/bootstrap.sh command. Again, in our original PR, we included logic to trigger the bootstrap.sh command+args, as is necessary when using a CUSTOM AMI. We pondered adding a bootstrap_extra_args or bootstrap_args_override var but didn't see that as an imminent need of Nebari.

Since you are, in effect, requiring users to manually enter the bootstrap.sh command when setting ami_id--or else facing the pitfall of nodes failing to join the cluster--while also including the pre_bootstrap_command, you are potentially dealing with three chronological parts to your user_data: pre-bootstrap-user-data, bootstrap-user-data, and post-bootstrap-user-data. You could wrap the entirety of the pre-bootstrap + bootstrap.sh-override + post-bootstrap options into a single variable and continue to call it "user_data" or else follow the models of either eksctl, which enables users to enter preBootstrapCommands and/or overrideBootstrapCommand (onus is on the user to ensure bootstrap.sh command is manually entered when specifying an ami-id), or the aws eks user_data terraform submodule, which enables both pre_bootstrap_user_data and post_bootstrap_user_data while also offering the option to set enable_bootstrap_user_data to true/false. The point is, if a Nebari user wants to run a custom AMI, and you don't include the /etc/eks/bootstrap.sh command in the user_data file for them, then they will need to know exactly under which variable to write the boostrap.sh command+args.

@tylergraff
Copy link

@viniciusdc @dcmcand I believe this RFD has served its purpose and can be closed. The changes discussed here have been implemented, merged, and are slated for release 2024.9.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants