Skip to content

Commit

Permalink
Add cluster support, different software provisioning (#16)
Browse files Browse the repository at this point in the history
* Adding clustering support in Custom VPC on AWS (#15)

* Adding custom VPC for cluster interface

* Adding cluster auto configuration

* Adding variables in the virl2-base-config

* fixing formatting

* Adding support for compute nodes creation

* fix for the ubuntu user ssh keys

* fix for the image copy

* adding dynamic hostnames for compute nodes

* Adding fix for dynamic interfaces

* adding interface sorting based on the netplan route-metric

* Moving computes behind the NAT GW

* Fixing formatting

* resetting config.yml to defaults and importing updated node definitions

---------

Co-authored-by: amieczko <[email protected]>
Co-authored-by: Ralph Schmieder <[email protected]>

* Fix Azure and consistency changes

- resource names use underscores, not commas
- whitespace / eol
- use proper main function in interface_fix.py

* Change to more unified package handling

Software packages (.deb) are not installed individually anymore but by
providing the .pkg file as it is available from CCO.  This way, all the
CML relevant Debian packages  (CML, PaTTY, IOL tools) will be pulled
from the software distribution package.  The only downside right now is
that the package stored in cloud storage is slightly bigger than the sum
of the actually required Debian packages as the .pkg usually includes
additional packages for upgrades (like a new kernel and sucht).

In addition:

- remove patty and iol customization scripts
- fix customize script so that it works witch changed hostnames
- change config vars in common and app section of config.yml
  - provide common.enable_patty boolean flag
  - rename app.deb to app.software
- remove unused / commented-out code blocks

* Documentation updates, some refactor

- documentation updates, also fix some image paths
- upload script changes to match move from .deb to .pkg
- add flavor_compute option to specify flavor for cluster computes
- wait for service availability in some more places
- ensure PaTTY restarts with virl2 target
- sign the commit

* Update / fix documentation

* added Andrzej's comment

* add is_controller function and refactor

- add common.sh with is_controller() function to ensure that
  controller-only functionality is not installed on computes
- do not stop/start CML target for post-processing in cml.sh
- add service restarts in post-process patch scripts, where needed
- remove bridge0 and prevent re-creation of bridge0 during service
  restarts
- update documentation

* Improve network configuration

- move network configuration changes from postprocess into cml_configure
- remove the 00-cml-base.yaml Netplan configuarion, as not needed for
  cloud
- select correct gw device for PaTTY if multiple default routes are
  present

* Support an external secrets manager (#19)

* Support an external secrets manager

* Add support for CyberArk Conjur
* Add support for Hashicorp Vault
* Dummy secrets manager creates random passwords if undefined
* Update prepare scripts to allow user to turn on/off secrets manager

* Change secrets generation to use random_password

* Change secrets generation to use random_password
* Change the ordering of .envrc.example to make more sense
* Fix bug with keys that have a null value in the config
* Create a sensitive output variable that contains the generated/retrieved secrets
* Update documentation

* Fix bracket placement

* Track changes to modules/secrets/{vault,conjur}.tf

* Since these files are already tracked, .gitignore doesn't
  do anything to stop changes from these files being checked in.
  This is inline with already existing behavior for AWS and AZ
  deploy modules.

* Update documentation

- fix formatting / white space throughout
- change some of the secret manager section in the top level README
- unset all but the license token secret to force random secrets by default

* Fix minor things

- typo in README
- restart target instead of controller in 00-patch_vmx.sh
- move root password change (for AWS) before compute/controller check
- white space corrections

* Format all shell code with shfmt

- using "shfmt -ci -i 4 -"
- move root password change fragment to beginning of cml.sh (as early as
  possible
- wait between retries when updating the domain name (letsencrypt.sh)

* wip

* Add better dependencies

- controller on compute depend on their subnets
- reboot at end of cloud-init via power-state
- update docs
- ready for skip bridge creation flag in 2.7.1

* Make AWS resource names consistent

* Allow to specify existing VPC

- add new AWS option to specify existing VPC ID.  By default it's an
  empty string.  In this case a custom VPC resource will be created.  If
  a valid VPC ID is provided then this VPC is used instead
- updated documentation.
- removed the root password for console access (added for
  troubleshooting)

* Make gateway ID a configurable option

* Documentation and cluster compute calculation

* Fix typo in var name

* Only reboot after provision success

* Make the user provision work (again)

* Add documentation changes and minor tweaks

- change "experimental" to "beta" in README
- small changes in cml.sh

* some final touches on CHANGELOG

---------

Co-authored-by: Andrzej Mieczkowski <[email protected]>
Co-authored-by: amieczko <[email protected]>
Co-authored-by: Chris McCoy <[email protected]>
  • Loading branch information
4 people authored Jun 4, 2024
1 parent 4ad91f5 commit a908806
Show file tree
Hide file tree
Showing 55 changed files with 1,652 additions and 376 deletions.
72 changes: 72 additions & 0 deletions .envrc.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
#
# This file is part of Cisco Modeling Labs
# Copyright (c) 2024, Cisco Systems, Inc.
# All rights reserved.
#

#########
# Configs
#########
#export TF_VAR_cfg_file=""
#export TF_VAR_cfg_extra_vars=""

########
# Clouds
########

#
# AWS
#

#export TF_VAR_aws_access_key=""
#export TF_VAR_aws_secret_key=""

#
# Azure
#

#export TF_VAR_subscription_id=""
#export TF_VAR_tenant_id=""

#########
# Secrets
#########

#
# Conjur
#

#export CONJUR_APPLIANCE_URL="https://conjur-server.example.com"
#export CONJUR_ACCOUNT="example"
## Initialize Conjur, saving the Certificate to the user's home in
## ~/conjur-server.pem
# conjur init --url "$CONJUR_APPLIANCE_URL" --account "$CONJUR_ACCOUNT" --force
## Log in with a Host API Key. The user's short hostname is used to identify
## the host. These would be set up ahead of time in Conjur. This only needs
## to be performed once.
# conjur login --id "host/org/tenant/$(hostname -s)"
# conjur whoami
## Once you are logged in with the Conjur CLI, you can use the macOS Keychain
## to access the required credentials to set up the environment variables.
#export CONJUR_AUTHN_LOGIN="$(security find-generic-password -s ${CONJUR_APPLIANCE_URL}/authn -a login -w | cut -d ':' -f 2 | base64 -d -i -)"
#export CONJUR_AUTHN_API_KEY="$(security find-generic-password -s ${CONJUR_APPLIANCE_URL}/authn -a password -w | cut -d ':' -f 2 | base64 -d -i -)"
## Or, change for other OSes
#export CONJUR_AUTHN_LOGIN=""
#export CONJUR_AUTHN_API_KEY=""
#export CONJUR_CERT_FILE="/etc/conjur.pem"
# -or for Windows-
#set CONJUR_APPLIANCE_URL=https://conjur-server.example.com
#set CONJUR_ACCOUNT=example
#set CONJUR_AUTHN_LOGIN=""
#set CONJUR_AUTHN_API_KEY=""
#set CONJUR_CERT_FILE=C:\conjur-server.pem

#
# Hashicorp Vault
#

#export VAULT_ADDR="https://vault-server.example.com:8200"
## This logs into the Vault CLI and refreshes the users' token.
# vault login #-method=ldap
# -or for Windows-
#set VAULT_ADDR=https://vault-server.example.com:8200
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@
.terraform
.terraform.lock.hcl
terraform.tfstate*
.terraform.tfstate.lock.info
28 changes: 23 additions & 5 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,27 @@

Lists the changes for the tool releases.

## Version 0.3.0

- allow cluster deployments on AWS.
- manage and use a non-default VPC
- optionally allow to use an already existing VPC and gateway
- allow to enable EBS encryption (fixes #8)
- a `cluster` section has been added to the config file. Some keywords have changed (`hostname` -> `controller_hostname`). See also a new "Cluster" section in the [AWS documentation](documentation/AWS.md)
- introduce secret managers for storing secrets.
- supported are dummy (use raw_secrets, as before), Conjur and Vault
- also support randomly generated secrets
- by default, the dummy module with random secrets is configured
- the license token secret needs to be configured regardless
- use the CML .pkg software distribution file instead of multiple .deb packages (this is a breaking change -- you need to change the configuration and upload the .pkg to cloud storage instead of the .deb. `deb` -> `software`.
- the PaTTY customization script has been removed. PaTTY is included in the .pkg. Its installation and configuration is now controlled by a new keyword `enable_patty` in the `common` section of the config.
> [!NOTE]
> Poll time is hard-coded to 5 seconds in the `cml.sh` script. If a longer poll time and/or additional options like console and VNC access are needed then this needs to be changed manually in the script.
- add a common script file which has currently a function to determine whether the instance is a controller or not. This makes it easier to install only controller relevant elements and omit them on computes (usable within the main `cml.sh` file as well as in the customization scripts).
- explicitly disable bridge0 and also disable the virl2-bridge-setup.py script by inserting `exit()` as the 2nd line. This will ensure that service restarts will not try to re-create the bridge0 interface. This will be obsolete / a no-op with 2.7.1 which includes a "skip bridge creation" flag.
- each instance will be rebooted at the end of cloud-init to come up with newly installed software / kernel and in a clean state.
- add configuration option `cfg.aws.vpc_id` and `cfg.aws.gw_id` to specify the VPC and gateway ID that should be used. If left empty, then a custom VPC ID will be created (fixes #9)

## Version 0.2.1

- allow to select provider using a script and split out TF providers
Expand All @@ -10,8 +31,7 @@ Lists the changes for the tool releases.
- fixed image paths for the AWS documentation
- mentioned the necessary "prepare" step in the overall README.md
- fix copying from cloud-storage to instance storage
- address 16KB cloud-init limitation in AWS (not entirely removed but pushed
out farther)
- address 16KB cloud-init limitation in AWS (not entirely removed but pushed out farther)

## Version 0.2.0

Expand All @@ -25,9 +45,7 @@ Lists the changes for the tool releases.
- improved upload tool
- better error handling in case no images are available
- modified help text
- completely reworked the AWS policy creation section to
provide step-by-step instructions to accurately describe the
policy creation process
- completely reworked the AWS policy creation section to provide step-by-step instructions to accurately describe the policy creation process
- added the current ref-plat images to the `config.yml` file
- provided the current .pkg file name to the `config.yml` file

Expand Down
146 changes: 126 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,24 @@
# README

Version 0.2.1, March 04 2024
Version 0.3.0, June 4 2024

This repository includes scripts, tooling and documentation to provision an instance of Cisco Modeling Labs (CML) in various cloud services. Currently supported are Amazon Web Services (AWS) and Microsoft Azure.
With CML 2.7, you can run CML instances on Azure and AWS. We have tested CML deployments using this tool chain in both clouds. **The use of this tool is considered BETA**. The tool has certain requirements and prerequisites which are described in this README and in the [documentation](documentation) directory.

> **IMPORTANT** The CML deployment procedure and the tool chain / code provided in this repository are **considered "experimental"**. If you encounter any errors or problems that might be related to the code in this repository then please open an issue on the [Github issue tracker for this repository](https://github.com/CiscoDevNet/cloud-cml/issues).
*It is very likely that this tool chain can not be used "as-is"*. It should be forked and adapted to specific customer requirements and environments.

> **IMPORTANT** Read the section below about cloud provider selection (prepare script).
> [!IMPORTANT]
>
> **Support:**
>
> - For customers with a valid service contract, CML cloud deployments are supported by TAC within the outlined constraints. Beyond this, support is done with best effort as cloud environments, requirements and policy can differ to a great extent.
> - With no service contract, support is done on a best effort basis via the issue tracker.
>
> **Features and capabilities:** Changes to the deployment tooling will be considered like any other feature by adding them to the product roadmap. This is done at the discretion of the CML team.
>
> **Error reporting:** If you encounter any errors or problems that might be related to the code in this repository then please open an issue on the [Github issue tracker for this repository](https://github.com/CiscoDevNet/cloud-cml/issues).
> [!IMPORTANT]
> Read the section below about [cloud provider selection](#important-cloud-provider-selection) (prepare script).
## General requirements

Expand All @@ -20,7 +32,7 @@ Furthermore, the user needs to have access to the cloud service. E.g. credential

The tool chain / build scripts and Terraform can be installed on the on-prem CML controller or, when this is undesirable due to support concerns, on a separate Linux instance.

That said, it *should be possible* to run the tooling also on macOS with tools installed via [Homebrew](https://brew.sh/). Or on Windows with WSL. However, this hasn't been tested by us.
That said, the tooling also runs on macOS with tools installed via [Homebrew](https://brew.sh/). Or on Windows with WSL. However, Windows hasn't been tested by us.

### Preparation

Expand All @@ -35,29 +47,113 @@ Some of the steps and procedures outlined below are preparation steps and only n

#### Important: Cloud provider selection

The tooling supports multiple cloud providers (currently AWS and Azure). Not everyone wants both providers. It is mandatory to select and configure which provider to use. This is a two step process:
The tooling supports multiple cloud providers (currently AWS and Azure). Not everyone wants both providers. The **default configuration is set to use AWS only**. If Azure should be used either instead or in addition then the following steps are mandatory:

1. Run the `prepare.sh` script to modify and prepare the tool chain. If on Windows, use `prepare.bat`. You can actually choose to use both, if that's what you want.
2. Configure the proper target ("aws" or "azure") in the configuration file

The first step is unfortunately required, since it is impossible to dynamically select different cloud configurations within the same Terraform HCL configuration. See [this SO link](https://stackoverflow.com/questions/70428374/how-to-make-the-provider-configuration-optional-and-based-on-the-condition-in-te) for more some context and details.

The default "out-of-the-box" is AWS, so if you want to run on Azure, don't forget to run the prepare script.
The default "out-of-the-box" configuration is AWS, so if you want to run on Azure, don't forget to run the prepare script.

#### Managing secrets

> [!WARNING]
> It is a best practice to **not** keep your CML secrets and passwords in Git!
CML cloud supports these storage methods for the required platform and application secrets:

- Raw secrets in the configuration file (as supported with previous versions)
- Random secrets by not specifiying any secrets
- [Hashicorp Vault](https://www.vaultproject.io/)
- [CyberArk Conjur](https://www.conjur.org/)

See the sections below for additional details how to use and manage secrets.

##### Referencing secrets

You can refer to the secret maintained in the secrets manager by updating `config.yml` appropriately. If you use the `dummy` secrets manager, it will use the `raw_secret` as specified in the `config.yml` file, and the secrets will **not** be protected.

```yaml
secret:
manager: conjur
secrets:
app:
username: admin
# Example using Conjur
path: example-org/example-project/secret/admin_password
```
Refer to the `.envrc.example` file for examples to set up environment variables to use an external secrets manager.

##### Random secrets

If you want random passwords to be generated when applying, based on [random_password](https://registry.terraform.io/providers/hashicorp/random/latest/docs/resources/password), leave the `raw_secret` undefined:

```yaml
secret:
manager: dummy
secrets:
app:
username: admin
# raw_secret: # Undefined
```

> [!NOTE]
>
> You can retrieve the generated passwords after applying with `terraform output cml2secrets`.

The included default `config.yml` configures generated passwords for the following secrets:

- App password (for the UI)
- System password for the OS system administration user
- Cluster secret when clustering is enabled

Regardless of the secret manager in use or whether you use random passwords or not: You **must** provide a valid Smart Licensing token for the sytem to work, though.

##### CyberArk Conjur installation

> [!IMPORTANT]
> CyberArk Conjur is not currently in the Terraform Registry. You must follow its [installation instructions](https://github.com/cyberark/terraform-provider-conjur?tab=readme-ov-file#terraform-provider-conjur) before running `terraform init`.

These steps are only required if using CyberArk Conjur as an external secrets manager.
1. Download the [CyberArk Conjur provider](https://github.com/cyberark/terraform-provider-conjur/releases).
2. Copy the custom provider to `~/.terraform.d/plugins/localhost/cyberark/conjur/<version>/<architecture>/terraform-provider-conjur_v<version>`
```bash
$ mkdir -vp ~/.terraform.d/plugins/localhost/cyberark/conjur/0.6.7/darwin_arm64/
$ unzip ~/terraform-provider-conjur_0.6.7-4_darwin_arm64.zip -d ~/.terraform.d/plugins/localhost/cyberark/conjur/0.6.7/darwin_arm64/
$
```
3. Create a `.terraformrc` file in the user's home:
```hcl
provider_installation {
filesystem_mirror {
path = "/Users/example/.terraform.d/plugins"
include = ["localhost/cyberark/conjur"]
}
direct {
exclude = ["localhost/cyberark/conjur"]
}
}
```

### Terraform installation

Terraform can be downloaded for free from [here](https://developer.hashicorp.com/terraform/downloads). This site has also instructions how to install it on various supported platforms.

Deployments of CML using Terraform were tested using the versions mentioned below on Ubuntu Linux.
Deployments of CML using Terraform were tested using the versions mentioned below on Ubuntu Linux and macOS.

```plain
```bash
$ terraform version
Terraform v1.7.3
on linux_amd64
Terraform v1.8.0
on darwin_arm64
+ provider registry.terraform.io/ciscodevnet/cml2 v0.7.0
+ provider registry.terraform.io/hashicorp/aws v5.37.0
+ provider registry.terraform.io/hashicorp/azurerm v3.92.0
+ provider registry.terraform.io/hashicorp/random v3.6.0
+ provider registry.terraform.io/hashicorp/aws v5.45.0
+ provider registry.terraform.io/hashicorp/azurerm v3.99.0
+ provider registry.terraform.io/hashicorp/cloudinit v2.3.3
+ provider registry.terraform.io/hashicorp/random v3.6.1
+ provider registry.terraform.io/hashicorp/vault v4.2.0
+ provider localhost/cyberark/conjur v0.6.7
$
```

Expand All @@ -74,12 +170,23 @@ See the documentation directory for cloud specific instructions:

## Customization

There's two variables which can be defined / set to further customize the behavior of the tool chain:
There's two Terraform variables which can be defined / set to further customize the behavior of the tool chain:

- `cfg_file`: This variable defines the configuration file. It defaults to `config.yml`.
- `cfg_extra_vars`: This variable defines the name of a file with additional variable definitions. The default is "none".

A typical extra vars file would look like this:
```bash
export TF_VAR_access_key="aws-something"
export TF_VAR_secret_key="aws-somethingelse"
# export TF_VAR_subscription_id="azure-something"
# export TF_VAR_tenant_id="azure-something-else"
export TF_VAR_cfg_file="config-custom.yml"
export TF_VAR_cfg_extra_vars="extras.sh"
```

A typical extra vars file would look like this (as referenced by `extras.sh` in the code above):

```plain
CFG_UN="username"
Expand All @@ -92,19 +199,18 @@ In this example, four additional variables are defined which can be used in cust

See the AWS specific document for additional information how to define variables in the environment using tools like `direnv` ("Terraform variable definition").

## Extra scripts
## Additional customization scripts

The deploy module has a couple of extra scripts which are not enabled / used by default. They are:

- install IOL related files, likely obsolete with the release of 2.7 (`02-iol.sh`)
- request/install certificates from Letsencrypt (`03-letsencrypt.sh`)
- request/install certificates from LetsEncrypt (`03-letsencrypt.sh`)
- customize additional settings, here: add users and resource pools (`04-customize.sh`).

These additional scripts serve mostly as an inspiration for customization of the system to adapt to local requirements.

### Requesting a cert

The letencrypt script requests a cert if there's none already present. The cert can then be manually copied from the host to the cloud storage with the hostname as a prefix. If the host with the same hostname is started again at a later point in time and the cert files exist in cloud storage, then those files are simply copied back to the host without requesting a new certificate. This avoids running into any certificate request limits.
The letsencrypt script requests a cert if there's none already present. The cert can then be manually copied from the host to the cloud storage with the hostname as a prefix. If the host with the same hostname is started again at a later point in time and the cert files exist in cloud storage, then those files are simply copied back to the host without requesting a new certificate. This avoids running into any certificate request limits.

Certificates are stored in `/etc/letsencrypt/live` in a directory with the configured hostname.

Expand Down
5 changes: 4 additions & 1 deletion TODO.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,12 @@
Here's a list of things which should be implemented going forward. This is in no particular order at the moment.

1. Allow for multiple instances in same account/resource group. Right now, resources do not have a unique name and they should. Using the random provider as done with the AWS VPC already.
2. Allow cluster installs (e.g. multiple computes, adding a VPC cluster network). There *seems* to be an issue on AWS with IPv6 addressing which might make this impossible)
2. Allow cluster installs on Azure (AWS is working, see below).
3. Allow for certs to be pushed to cloud storage, once requested/installed.
4. Allow more than one clouds at the same time as `prepare.sh` suggests. Right now, this does not work as it requires only ONE `required_providers` block but both template files introduce an individual one. Should be addressed by making this smarter, introducing a `versions.tf` file which is built by `prepare.sh`. See <https://discuss.hashicorp.com/t/best-provider-tf-versions-tf-placement/56581/5>

## Done items

1. Work around 16kb user data limit in AWS (seems to not be an issue in Azure).
2. Allow cluster installs (e.g. multiple computes, adding a VPC cluster network). Works on AWS, thanks to amieczko.
3. Allow to use an already existing VPC
Loading

0 comments on commit a908806

Please sign in to comment.