Investigate cloud-init support #44

displague · 2020-10-13T14:55:35Z

This issue is an open investigation into what features Hegel (and perhaps other components of tinkerbell) would need in order to support cloud-init.

Cloud-init is a provisioning service that runs within many distributions, including Debian and Ubuntu. Cloud-init has awareness of various cloud providers (or provisioning environments, like OpenStack).

Cloud-init:

detects the environment it is running in (taking hints from the kernel args, network, disks, embedded data)
accesses the metadata configuration (a network service, attached disk, a deposited file, or hardware embedded data)
parses the metadata
configures the node

For raw disk support, with images composed of partitions (GPT, dos, etc), or LVM, or unknown or encrypted filesystems, the current approach of stamping a docker image based filesystem with a file is not sufficient. These raw disks must be pristine and trusted and can not be manipulated externally (by Tinkerbell) without disturbing trust.

Tinkerbell provisioned nodes should be able to rely on pre-installed software, such as Cloud-init or Ignition (Afterburn), and kernel arguments to access the metadata service provided by Hegel.

What changes to Hegel are required to provide this? What non-Tinkerbell / external changes are needed?

After some initial input and consideration, this issue should either be closed as a non-goal or result in one or several tinkerbell proposals to address any limitations, external cloud-init issues should also be raised.

displague · 2020-10-13T15:19:45Z

The networking ranges and network location of the Hegel metadata service is user-configurable. In addition to this, network and hardware isolation is not guaranteed in Tinkerbell environments. It is therefore not possible to know or recommend that Hegel be configured for a specific address in all environments, public, private, or link-local.

The user must be able to define the addresses for the Hegel service. How is this address configured today? Where is that information stored? How does this address make its way into templates and workflows?

Tinkerbell does not provide DNS services. Can mDNS be used in Tinkerbell environments? Could Hegel then be addressed with a well-known name (configurable per cluster), such as "metadata.local"? What benefits would this provide, what are the limitations and criteria for this to be feasible?

Ignition (Afterburn) is currently Packet aware. Could this work be extended to support Tinkerbell? What are the key differences in the spec or access methods (and location)?

displague · 2020-12-09T17:03:33Z

Linking some related issues:
tinkerbell/tink#110
tinkerbell/cluster-api-provider-tinkerbell#6
canonical/cloud-init#680

displague · 2020-12-09T17:08:41Z

Two of the main benefits of cloud-init are network configuration and userdata retrieval.

Userdata would need to be attained through the metadata service.

Does Tinkerbell benefit from cloud-init for network discovery purposes? DHCP is currently provided. DHCP has the limitation of a single address per interface. Does Tinkerbell and Hegel currently provide the means to define network information more granularly than that, such that network information from the metadata service would be beneficial?

displague · 2020-12-09T17:14:08Z

Cloud-init benefits from dsidentity detection of the environment through local means. This is typically done through DMI (dmidecode). For a given environment, well known DMI device fields will be populated with platform identifiable patterns.

For example:

System Information
        Manufacturer: Packet
        Product Name: c3.small.x86
        Version: R1.00
        Serial Number: D5S0R8000047
        UUID: 00000000-0000-0000-0000-d05099f0314c
        Wake-up Type: Power Switch
        SKU Number: To Be Filled By O.E.M.
        Family: To Be Filled By O.E.M.

Can or should Tinkerbell express the opinion that DMI should be updated on each device? When would this happen in the enrolling or workflow process? What values would be used? Can a user opt-out of this? Is it technically possible to support this across unknown hardware (using common software)?

displague · 2020-12-09T17:17:42Z

Is it possible to use the network at Layer2 for platform detection or to report the metadata address, through LLDP, perhaps? (@invidian)

Barring network and local hardware modifications, are we left with only kernel command-line arguments for identification (ds=tinkerbell, for example)?

detiber · 2021-03-18T15:07:05Z

I've been able to get things working at a basic level by using sandbox/vagrant/libvirt, adding the link-local address 169.254.169.254/16 to the provisioner host, configuring user-data in the host definition, and injecting a datasource configuration into the host image using a workflow.

vagrant up provisioner --no-destroy-on-error
vagrant ssh provisioner

# workaround for https://github.com/tinkerbell/sandbox/issues/62
sudo curl -L "https://github.com/docker/compose/releases/download/1.26.0/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
exit

vagrant provision provisioner

pushd $IMAGE_BUILDER_DIR/images/capi # from https://github.com/kubernetes-sigs/image-builder/pull/547
make build-raw-all
cp output/ubuntu-1804-kube-v1.18.15.gz $SANDBOX_DIR/deploy/state/webroot/
popd

vagrant ssh provisioner

cd /vagrant && source .env && cd deploy
docker-compose up -d

# TODO: add 168.254.169.254 link-local address to provisioner machine
# TODO: figure out how we can incorporate this into sandbox
# TODO: will this cause issues in EM deployments?
# edit /etc/netplan.eth1.yaml
# add 169.254.169.254/16 to the addresses
# netplan apply

# setup hook as a replacement for OSIE (https://github.com/tinkerbell/hook#the-manual-way)
pushd /vagrant/deploy/state/webroot/misc/osie
mv current current-bak
mkdir current
wget http://s.gianarb.it/tinkie/tinkie-master.tar.gz
tar xzv -C ./current -f tinkie-master.tar.gz
popd

# TODO: follow up on not needing to pull/tag/push images to internal registry for actions
# TODO: requires changes to tink-worker to avoid internal registry use
docker pull quay.io/tinkerbell-actions/image2disk:v1.0.0
docker tag quay.io/tinkerbell-actions/image2disk:v1.0.0 192.168.1.1/image2disk:v1.0.0
docker push 192.168.1.1/image2disk:v1.0.0
docker pull quay.io/tinkerbell-actions/writefile:v1.0.0
docker tag quay.io/tinkerbell-actions/writefile:v1.0.0 192.168.1.1/writefile:v1.0.0
docker push 192.168.1.1/writefile:v1.0.0
docker pull quay.io/tinkerbell-actions/kexec:v1.0.0
docker tag quay.io/tinkerbell-actions/kexec:v1.0.0 192.168.1.1/kexec:v1.0.0
docker push 192.168.1.1/kexec:v1.0.0

# TODO: investigate hegel metadata not returning proper values for 2009-04-04/meta-data/{public,local}-ipv{4,6}, currently trying to return values from hw.metadata.instance.network.addresses[] instead of hw.network.interfaces[]
# TODO: should hegel (or tink) automatically populate fields from root sources, for example metadata.instance.id from id
#       public/local ip addresses from network.addresses, etc?
# TODO: automatic hardware detection to avoid needing to manually populate metadata.instance.storage.disks[].device

cat > hardware-data-worker-1.json <<EOF
{
  "id": "ce2e62ed-826f-4485-a39f-a82bb74338e2",
  "metadata": {
    "facility": {
      "facility_code": "onprem"
    },
    "userdata": "#cloud-config\nssh_authorized_keys:\n- ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQCZaw/MNLTa1M93IbrpklSqm/AreHmLSauFvGJ1Q5OV5/pfyeusNoxDaOQlk3BzG3InmhWX4tk73GOBHO36ugpeorGg/fC4m+5rL42z2BND1o98Borb6x2pAGF11IcEM9m7c8k0gg9lP2OR4mDAq2BFrmJq8h77zk9LtpWEvFJfASx9iqv0s7uHdWjc3ERQ/fcgl8Lor/GYzSbvATO6StrwrLs/HusA5k9vDKyEGfGbxADMmxnnzaukqhuk8+SXf+Ni4kKReGkqjFI8uUeOLU/4sG5X5afTlW6+7KPZUhLSkZh6/bVY8m5B9AsV8M6yHEan48+258Q78lsu8lWhoscUYV49nyA61RveiBUExZYhi45jI3LUmGX3hHpVwfRMfgh0RjtrkCX8I6eSLCUX//Xu4WKkVMgQur2TLT+Nmpf4dwJgDX72nQmgbu/CHC4u2Y5FTWnHpeNLicOWecsHXxqs8U1K7rWguOfCiD/qtRhqp5Sz3m37/h/aGjGqvsa/DIc= [email protected]",
    "instance": {
      "id": "ce2e62ed-826f-4485-a39f-a82bb74338e2",
      "hostname": "test-instance",
      "storage": {
        "disks": [{"device": "/dev/vda"}]
      }
    },
    "state": ""
  },
  "network": {
    "interfaces": [
      {
        "dhcp": {
          "arch": "x86_64",
          "ip": {
            "address": "192.168.1.5",
            "gateway": "192.168.1.1",
            "netmask": "255.255.255.248"
          },
          "mac": "08:00:27:00:00:01",
          "uefi": false
        },
        "netboot": {
          "allow_pxe": true,
          "allow_workflow": true
        }
      }
    ]
  }
}
EOF
docker exec -i deploy_tink-cli_1 tink hardware push < ./hardware-data-worker-1.json

cat > capi-stream-template.yml <<EOF
version: "0.1"
name: capi_provisioning
global_timeout: 6000
tasks:
  - name: "os-installation"
    worker: "{{.device_1}}"
    volumes:
      - /dev:/dev
      - /dev/console:/dev/console
      - /lib/firmware:/lib/firmware:ro
    environment:
      MIRROR_HOST: 192.168.1.1
    actions:
      - name: "stream-image"
        image: image2disk:v1.0.0
        timeout: 90
        environment:
          IMG_URL: http://192.168.1.1:8080/ubuntu-1804-kube-v1.18.15.gz
          DEST_DISK: /dev/vda
          COMPRESSED: true
      - name: "add-tink-cloud-init-config"
        image: writefile:v1.0.0
        timeout: 90
        environment:
          DEST_DISK: /dev/vda1
          FS_TYPE: ext4
          DEST_PATH: /etc/cloud/cloud.cfg.d/10_tinkerbell.cfg
          UID: 0
          GID: 0
          MODE: 0600
          DIRMODE: 0700
          CONTENTS: |
            datasource:
              Ec2:
                metadata_urls: ["http://192.168.1.1:50061", "http://169.254.169.254:50061"]
            system_info:
              default_user:
                name: tink
                groups: [wheel, adm]
                sudo: ["ALL=(ALL) NOPASSWD:ALL"]
                shell: /bin/bash
      - name: "kexec-image"
        image: kexec:v1.0.0
        timeout: 90
        pid: host
        environment:
          BLOCK_DEVICE: /dev/vda1
          FS_TYPE: ext4
EOF
docker exec -i deploy_tink-cli_1 tink template create < ./capi-stream-template.yml

docker exec -i deploy_tink-cli_1 tink workflow create -t <TEMPLATE ID> -r '{"device_1":"08:00:27:00:00:01"}'

displague · 2021-03-18T18:36:15Z

That's excellent, @detiber!

Do you suppose we can close this issue given this success or are we dependent on unreleased features?
Are there any additional or supporting features to investigate?
Do we need examples of other OSes taking advantage of this? (Ignition, Kickstart, other)?
Should we include these steps in Tinkerbell documentation?

detiber · 2021-03-18T19:19:51Z

I definitely think we need to add some documentation, quite a bit of it isn't quite intuitive, such as:

Having to inject the datasource config into the image being booted
Needing to use a link-local address since networking isn't available when the userdata is pulled
Having to manually populate metadata.instance.id, metadata.instance.hostname
Not having access to the networking information without manually populating metadata rather than autopopulating it from the hardware network.interfaces configuration.

cursedclock · 2021-07-17T09:59:27Z

@detiber Is the link local address really needed? shouldn't cloud-init just pull the metadata from 192.169.1.1:50061 because that's the ip listed in metadata_urls?

displague · 2021-07-18T19:49:40Z

@cursedclock What address should the device use to access the metadata and how will that address be determined?

Link-local solves this problem with self-assigned addressed. It also suggests that the metadata should use a well-known address like 169.254.169.254 which cloud-init uses as the default for various ds= values. hegel provides basic ds=ec2 compatibility (2009-04-04) and use of this address will avoid the need for additional kernel command line arguments.

On the other hand, if we have to manipulate kernel command line arguments, we can likely provide the IP address in the same way.

This becomes more advantageous with direct Tinkerbell support in cloud-init.

cursedclock · 2021-07-19T04:30:07Z

@displague I see, that means that there would be no need for adding an action to modify the contents of /etc/cloud/cloud.cfg.d right? Since there the worker machine is expected to use the "default" address for pulling configuration metadata.

displague · 2021-07-20T01:03:19Z

Kernel args ds=ec2;metadata_urls=http://ip:port should work too. This is for cloud-init, kickstart/ignition take different arguments.

tstromberg · 2021-08-27T04:08:47Z

@nshalman has made a PoC to do cloud-init based installs. Can you comment if this issue can be closed?

nshalman · 2021-09-14T20:33:16Z

Alas, that was a hack using a nocloud partition on disk. And the code that we used is not currently upstream. Issue is still valid and open.

chrisdoherty4 · 2022-05-03T13:52:32Z

Amazon is using Hegel to provision with cloud init and it's seemingly working. What makes us think this isn't working?

I'm verifying a few bits with cloud init manually so once I have that data I'll include it here.

chrisdoherty4 · 2022-05-09T13:43:14Z

Some further investigation in #61 (comment) found disparities that need fixing.

Perhaps this issue can be closed in favor of discussion over there about redesign?

displague · 2022-05-10T12:44:56Z

We can close this. We can open another issue if there is more interest in introducing cloud-init support for ds=tinkerbell (or hegel) as a unique flavor of metadata distinct from EC2 flavor.

tstromberg assigned nshalman Aug 27, 2021

nshalman removed their assignment Mar 8, 2022

chrisdoherty4 self-assigned this May 2, 2022

displague closed this as completed May 10, 2022

displague mentioned this issue Aug 4, 2022

Implement cloud-init integration #119

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate cloud-init support #44

Investigate cloud-init support #44

displague commented Oct 13, 2020 •

edited

Loading

displague commented Oct 13, 2020

displague commented Dec 9, 2020

displague commented Dec 9, 2020

displague commented Dec 9, 2020

displague commented Dec 9, 2020 •

edited

Loading

detiber commented Mar 18, 2021

displague commented Mar 18, 2021

detiber commented Mar 18, 2021

cursedclock commented Jul 17, 2021

displague commented Jul 18, 2021

cursedclock commented Jul 19, 2021

displague commented Jul 20, 2021

tstromberg commented Aug 27, 2021

nshalman commented Sep 14, 2021

chrisdoherty4 commented May 3, 2022 •

edited

Loading

chrisdoherty4 commented May 9, 2022 •

edited

Loading

displague commented May 10, 2022

Investigate cloud-init support #44

Investigate cloud-init support #44

Comments

displague commented Oct 13, 2020 • edited Loading

displague commented Oct 13, 2020

displague commented Dec 9, 2020

displague commented Dec 9, 2020

displague commented Dec 9, 2020

displague commented Dec 9, 2020 • edited Loading

detiber commented Mar 18, 2021

displague commented Mar 18, 2021

detiber commented Mar 18, 2021

cursedclock commented Jul 17, 2021

displague commented Jul 18, 2021

cursedclock commented Jul 19, 2021

displague commented Jul 20, 2021

tstromberg commented Aug 27, 2021

nshalman commented Sep 14, 2021

chrisdoherty4 commented May 3, 2022 • edited Loading

chrisdoherty4 commented May 9, 2022 • edited Loading

displague commented May 10, 2022

displague commented Oct 13, 2020 •

edited

Loading

displague commented Dec 9, 2020 •

edited

Loading

chrisdoherty4 commented May 3, 2022 •

edited

Loading

chrisdoherty4 commented May 9, 2022 •

edited

Loading