Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate cloud-init support #44

Closed
displague opened this issue Oct 13, 2020 · 17 comments
Closed

Investigate cloud-init support #44

displague opened this issue Oct 13, 2020 · 17 comments
Assignees
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.

Comments

@displague
Copy link
Member

displague commented Oct 13, 2020

This issue is an open investigation into what features Hegel (and perhaps other components of tinkerbell) would need in order to support cloud-init.

Cloud-init is a provisioning service that runs within many distributions, including Debian and Ubuntu. Cloud-init has awareness of various cloud providers (or provisioning environments, like OpenStack).

Cloud-init:

  • detects the environment it is running in (taking hints from the kernel args, network, disks, embedded data)
  • accesses the metadata configuration (a network service, attached disk, a deposited file, or hardware embedded data)
  • parses the metadata
  • configures the node

For raw disk support, with images composed of partitions (GPT, dos, etc), or LVM, or unknown or encrypted filesystems, the current approach of stamping a docker image based filesystem with a file is not sufficient. These raw disks must be pristine and trusted and can not be manipulated externally (by Tinkerbell) without disturbing trust.

Tinkerbell provisioned nodes should be able to rely on pre-installed software, such as Cloud-init or Ignition (Afterburn), and kernel arguments to access the metadata service provided by Hegel.

What changes to Hegel are required to provide this? What non-Tinkerbell / external changes are needed?

After some initial input and consideration, this issue should either be closed as a non-goal or result in one or several tinkerbell proposals to address any limitations, external cloud-init issues should also be raised.

@displague
Copy link
Member Author

The networking ranges and network location of the Hegel metadata service is user-configurable. In addition to this, network and hardware isolation is not guaranteed in Tinkerbell environments. It is therefore not possible to know or recommend that Hegel be configured for a specific address in all environments, public, private, or link-local.

The user must be able to define the addresses for the Hegel service. How is this address configured today? Where is that information stored? How does this address make its way into templates and workflows?


Tinkerbell does not provide DNS services. Can mDNS be used in Tinkerbell environments? Could Hegel then be addressed with a well-known name (configurable per cluster), such as "metadata.local"? What benefits would this provide, what are the limitations and criteria for this to be feasible?


Ignition (Afterburn) is currently Packet aware. Could this work be extended to support Tinkerbell? What are the key differences in the spec or access methods (and location)?

@displague
Copy link
Member Author

@displague
Copy link
Member Author

Two of the main benefits of cloud-init are network configuration and userdata retrieval.

Userdata would need to be attained through the metadata service.

Does Tinkerbell benefit from cloud-init for network discovery purposes? DHCP is currently provided. DHCP has the limitation of a single address per interface. Does Tinkerbell and Hegel currently provide the means to define network information more granularly than that, such that network information from the metadata service would be beneficial?

@displague
Copy link
Member Author

Cloud-init benefits from dsidentity detection of the environment through local means. This is typically done through DMI (dmidecode). For a given environment, well known DMI device fields will be populated with platform identifiable patterns.

For example:

System Information
        Manufacturer: Packet
        Product Name: c3.small.x86
        Version: R1.00
        Serial Number: D5S0R8000047
        UUID: 00000000-0000-0000-0000-d05099f0314c
        Wake-up Type: Power Switch
        SKU Number: To Be Filled By O.E.M.
        Family: To Be Filled By O.E.M.

Can or should Tinkerbell express the opinion that DMI should be updated on each device? When would this happen in the enrolling or workflow process? What values would be used? Can a user opt-out of this? Is it technically possible to support this across unknown hardware (using common software)?

@displague
Copy link
Member Author

displague commented Dec 9, 2020

Is it possible to use the network at Layer2 for platform detection or to report the metadata address, through LLDP, perhaps? (@invidian)

Barring network and local hardware modifications, are we left with only kernel command-line arguments for identification (ds=tinkerbell, for example)?

@detiber
Copy link
Contributor

detiber commented Mar 18, 2021

I've been able to get things working at a basic level by using sandbox/vagrant/libvirt, adding the link-local address 169.254.169.254/16 to the provisioner host, configuring user-data in the host definition, and injecting a datasource configuration into the host image using a workflow.

vagrant up provisioner --no-destroy-on-error
vagrant ssh provisioner

# workaround for https://github.com/tinkerbell/sandbox/issues/62
sudo curl -L "https://github.com/docker/compose/releases/download/1.26.0/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
exit

vagrant provision provisioner

pushd $IMAGE_BUILDER_DIR/images/capi # from https://github.com/kubernetes-sigs/image-builder/pull/547
make build-raw-all
cp output/ubuntu-1804-kube-v1.18.15.gz $SANDBOX_DIR/deploy/state/webroot/
popd

vagrant ssh provisioner

cd /vagrant && source .env && cd deploy
docker-compose up -d

# TODO: add 168.254.169.254 link-local address to provisioner machine
# TODO: figure out how we can incorporate this into sandbox
# TODO: will this cause issues in EM deployments?
# edit /etc/netplan.eth1.yaml
# add 169.254.169.254/16 to the addresses
# netplan apply

# setup hook as a replacement for OSIE (https://github.com/tinkerbell/hook#the-manual-way)
pushd /vagrant/deploy/state/webroot/misc/osie
mv current current-bak
mkdir current
wget http://s.gianarb.it/tinkie/tinkie-master.tar.gz
tar xzv -C ./current -f tinkie-master.tar.gz
popd

# TODO: follow up on not needing to pull/tag/push images to internal registry for actions
# TODO: requires changes to tink-worker to avoid internal registry use
docker pull quay.io/tinkerbell-actions/image2disk:v1.0.0
docker tag quay.io/tinkerbell-actions/image2disk:v1.0.0 192.168.1.1/image2disk:v1.0.0
docker push 192.168.1.1/image2disk:v1.0.0
docker pull quay.io/tinkerbell-actions/writefile:v1.0.0
docker tag quay.io/tinkerbell-actions/writefile:v1.0.0 192.168.1.1/writefile:v1.0.0
docker push 192.168.1.1/writefile:v1.0.0
docker pull quay.io/tinkerbell-actions/kexec:v1.0.0
docker tag quay.io/tinkerbell-actions/kexec:v1.0.0 192.168.1.1/kexec:v1.0.0
docker push 192.168.1.1/kexec:v1.0.0

# TODO: investigate hegel metadata not returning proper values for 2009-04-04/meta-data/{public,local}-ipv{4,6}, currently trying to return values from hw.metadata.instance.network.addresses[] instead of hw.network.interfaces[]
# TODO: should hegel (or tink) automatically populate fields from root sources, for example metadata.instance.id from id
#       public/local ip addresses from network.addresses, etc?
# TODO: automatic hardware detection to avoid needing to manually populate metadata.instance.storage.disks[].device

cat > hardware-data-worker-1.json <<EOF
{
  "id": "ce2e62ed-826f-4485-a39f-a82bb74338e2",
  "metadata": {
    "facility": {
      "facility_code": "onprem"
    },
    "userdata": "#cloud-config\nssh_authorized_keys:\n- ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQCZaw/MNLTa1M93IbrpklSqm/AreHmLSauFvGJ1Q5OV5/pfyeusNoxDaOQlk3BzG3InmhWX4tk73GOBHO36ugpeorGg/fC4m+5rL42z2BND1o98Borb6x2pAGF11IcEM9m7c8k0gg9lP2OR4mDAq2BFrmJq8h77zk9LtpWEvFJfASx9iqv0s7uHdWjc3ERQ/fcgl8Lor/GYzSbvATO6StrwrLs/HusA5k9vDKyEGfGbxADMmxnnzaukqhuk8+SXf+Ni4kKReGkqjFI8uUeOLU/4sG5X5afTlW6+7KPZUhLSkZh6/bVY8m5B9AsV8M6yHEan48+258Q78lsu8lWhoscUYV49nyA61RveiBUExZYhi45jI3LUmGX3hHpVwfRMfgh0RjtrkCX8I6eSLCUX//Xu4WKkVMgQur2TLT+Nmpf4dwJgDX72nQmgbu/CHC4u2Y5FTWnHpeNLicOWecsHXxqs8U1K7rWguOfCiD/qtRhqp5Sz3m37/h/aGjGqvsa/DIc= [email protected]",
    "instance": {
      "id": "ce2e62ed-826f-4485-a39f-a82bb74338e2",
      "hostname": "test-instance",
      "storage": {
        "disks": [{"device": "/dev/vda"}]
      }
    },
    "state": ""
  },
  "network": {
    "interfaces": [
      {
        "dhcp": {
          "arch": "x86_64",
          "ip": {
            "address": "192.168.1.5",
            "gateway": "192.168.1.1",
            "netmask": "255.255.255.248"
          },
          "mac": "08:00:27:00:00:01",
          "uefi": false
        },
        "netboot": {
          "allow_pxe": true,
          "allow_workflow": true
        }
      }
    ]
  }
}
EOF
docker exec -i deploy_tink-cli_1 tink hardware push < ./hardware-data-worker-1.json

cat > capi-stream-template.yml <<EOF
version: "0.1"
name: capi_provisioning
global_timeout: 6000
tasks:
  - name: "os-installation"
    worker: "{{.device_1}}"
    volumes:
      - /dev:/dev
      - /dev/console:/dev/console
      - /lib/firmware:/lib/firmware:ro
    environment:
      MIRROR_HOST: 192.168.1.1
    actions:
      - name: "stream-image"
        image: image2disk:v1.0.0
        timeout: 90
        environment:
          IMG_URL: http://192.168.1.1:8080/ubuntu-1804-kube-v1.18.15.gz
          DEST_DISK: /dev/vda
          COMPRESSED: true
      - name: "add-tink-cloud-init-config"
        image: writefile:v1.0.0
        timeout: 90
        environment:
          DEST_DISK: /dev/vda1
          FS_TYPE: ext4
          DEST_PATH: /etc/cloud/cloud.cfg.d/10_tinkerbell.cfg
          UID: 0
          GID: 0
          MODE: 0600
          DIRMODE: 0700
          CONTENTS: |
            datasource:
              Ec2:
                metadata_urls: ["http://192.168.1.1:50061", "http://169.254.169.254:50061"]
            system_info:
              default_user:
                name: tink
                groups: [wheel, adm]
                sudo: ["ALL=(ALL) NOPASSWD:ALL"]
                shell: /bin/bash
      - name: "kexec-image"
        image: kexec:v1.0.0
        timeout: 90
        pid: host
        environment:
          BLOCK_DEVICE: /dev/vda1
          FS_TYPE: ext4
EOF
docker exec -i deploy_tink-cli_1 tink template create < ./capi-stream-template.yml

docker exec -i deploy_tink-cli_1 tink workflow create -t <TEMPLATE ID> -r '{"device_1":"08:00:27:00:00:01"}'

@displague
Copy link
Member Author

That's excellent, @detiber!

Do you suppose we can close this issue given this success or are we dependent on unreleased features?
Are there any additional or supporting features to investigate?
Do we need examples of other OSes taking advantage of this? (Ignition, Kickstart, other)?
Should we include these steps in Tinkerbell documentation?

@detiber
Copy link
Contributor

detiber commented Mar 18, 2021

I definitely think we need to add some documentation, quite a bit of it isn't quite intuitive, such as:

  • Having to inject the datasource config into the image being booted
  • Needing to use a link-local address since networking isn't available when the userdata is pulled
  • Having to manually populate metadata.instance.id, metadata.instance.hostname
  • Not having access to the networking information without manually populating metadata rather than autopopulating it from the hardware network.interfaces configuration.

@cursedclock
Copy link

@detiber Is the link local address really needed? shouldn't cloud-init just pull the metadata from 192.169.1.1:50061 because that's the ip listed in metadata_urls?

@displague
Copy link
Member Author

@cursedclock What address should the device use to access the metadata and how will that address be determined?

Link-local solves this problem with self-assigned addressed. It also suggests that the metadata should use a well-known address like 169.254.169.254 which cloud-init uses as the default for various ds= values. hegel provides basic ds=ec2 compatibility (2009-04-04) and use of this address will avoid the need for additional kernel command line arguments.

On the other hand, if we have to manipulate kernel command line arguments, we can likely provide the IP address in the same way.

This becomes more advantageous with direct Tinkerbell support in cloud-init.

@cursedclock
Copy link

@displague I see, that means that there would be no need for adding an action to modify the contents of /etc/cloud/cloud.cfg.d right? Since there the worker machine is expected to use the "default" address for pulling configuration metadata.

@displague
Copy link
Member Author

Kernel args ds=ec2;metadata_urls=http://ip:port should work too. This is for cloud-init, kickstart/ignition take different arguments.

@tstromberg
Copy link
Contributor

@nshalman has made a PoC to do cloud-init based installs. Can you comment if this issue can be closed?

@tstromberg tstromberg added kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Aug 27, 2021
@nshalman
Copy link
Member

Alas, that was a hack using a nocloud partition on disk. And the code that we used is not currently upstream. Issue is still valid and open.

@nshalman nshalman added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Nov 2, 2021
@nshalman nshalman removed their assignment Mar 8, 2022
@chrisdoherty4 chrisdoherty4 self-assigned this May 2, 2022
@chrisdoherty4
Copy link
Member

chrisdoherty4 commented May 3, 2022

Amazon is using Hegel to provision with cloud init and it's seemingly working. What makes us think this isn't working?

I'm verifying a few bits with cloud init manually so once I have that data I'll include it here.

@chrisdoherty4
Copy link
Member

chrisdoherty4 commented May 9, 2022

Some further investigation in #61 (comment) found disparities that need fixing.

Perhaps this issue can be closed in favor of discussion over there about redesign?

@displague
Copy link
Member Author

We can close this. We can open another issue if there is more interest in introducing cloud-init support for ds=tinkerbell (or hegel) as a unique flavor of metadata distinct from EC2 flavor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
None yet
Development

No branches or pull requests

6 participants