Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cirrus-CI persistent worker pool management #158

Merged
merged 1 commit into from
Nov 15, 2023

Conversation

cevich
Copy link
Member

@cevich cevich commented Nov 10, 2023

Implement a set of scripts to help with management of a Cirrus-CI persistent worker pool of M1 Mac instances on AWS EC2.

  • Implement script to help monitor a set of M1 Mac dedicated hosts, creating new instances as slots become available.

  • Implement a script to help monitor M1 Mac instances, deploying and executing a setup script on newly created instances.

  • Implement a ssh-helper script for humans, to quickly access instances based on their EC2 instance ID.

  • Implement a setup script intended to run on M1 Macs, to help configure and join them to a pre-existing worker pool.

  • Implement a helper script intended to run on M1 Macs, to support developers with a CI-like environment.

  • At this time, all scripts are intended for manual/human-supervised use. Future commits may improve this and/or better support use inside automation.

  • Add very basic/expedient documentation.

N/B: The majority of this content, including the EC2-side setup has been developed in a rush. There are very likely major architecture, design, and scripting bugs and shortfalls. Some of these may be addressed in future commits.

@cevich cevich requested a review from edsantiago November 10, 2023 17:57
Copy link

github-actions bot commented Nov 10, 2023

Successfully triggered github-actions/success task to indicate successful run of cirrus-ci_retrospective integration and unit testing from this PR's e8abd2ec2be030bfe8a6df9b4abba6edd9e753d2.

@cevich
Copy link
Member Author

cevich commented Nov 10, 2023

Example output:

LaunchInstances.sh

$ ./LaunchInstances.sh
Working on Dedicated Host/Instance 'MacM1-4' for HostID 'h-05be30b208b58a4e6'.
Parsing new or existing instance details.

Working on Dedicated Host/Instance 'MacM1-1' for HostID 'h-017029f02ffa8ad6b'.
Dedicated host tag 'PWPoolReady' == 'false' != 'true'.

Working on Dedicated Host/Instance 'MacM1-5' for HostID 'h-05e619438b9cd6b30'.
Dedicated host tag 'PWPoolReady' == 'false' != 'true'.

Working on Dedicated Host/Instance 'MacM1-2' for HostID 'h-02dd1cd411dc692cc'.
Parsing new or existing instance details.

Working on Dedicated Host/Instance 'MacM1-3' for HostID 'h-00a912e75d556cdaa'.
Dedicated host tag 'PWPoolReady' == 'false' != 'true'.

Working on Dedicated Host/Instance 'MacM1-6' for HostID 'h-09248a5e62807ab54'.
Parsing new or existing instance details.

Working on Dedicated Host/Instance 'MacM1-brent' for HostID 'h-0b5af7819816b0b13'.
Dedicated host tag 'PWPoolReady' == 'false' != 'true'.

Processing all host and instance states.

pw_status.txt

$ cat pw_status.txt
# LaunchInstances.sh run 2023-11-10T17:52:27+00:00
#
MacM1-4 i-007c60b21cd872dea 2023-11-09T21:48:15+00:00
# MacM1-1 HOST DISABLED: PWPoolReady==false
# MacM1-5 HOST DISABLED: PWPoolReady==false
MacM1-2 i-0517fd7e8d962a4f3 2023-11-09T21:47:35+00:00
# MacM1-3 HOST DISABLED: PWPoolReady==false
MacM1-6 i-0992b66f4e272327c 2023-11-09T21:47:55+00:00
# MacM1-brent HOST DISABLED: PWPoolReady==false

InstanceSSH.sh

$ ./InstanceSSH.sh i-007c60b21cd872dea
Warning: Permanently added 'ec2-44-200-37-61.compute-1.amazonaws.com' (ED25519) to the list of known hosts.
Last login: Fri Nov 10 17:55:29 2023 from 99.149.127.221

    ┌───┬──┐   __|  __|_  )
    │ ╷╭╯╷ │   _|  (     /
    │  └╮  │  ___|\___|___|
    │ ╰─┼╯ │  Amazon EC2
    └───┴──┘  macOS Sonoma 14.0


The default interactive shell is now zsh.
To update your account to use zsh, please run `chsh -s /bin/zsh`.
For more details, please visit https://support.apple.com/kb/HT208050.
ip-172-31-13-105:~ ec2-user$ exit
logout
Connection to ec2-44-200-37-61.compute-1.amazonaws.com closed.

SetupInstances.sh

./SetupInstances.sh
 Operating on 3 instances from # LaunchInstances.sh run 2023-11-10T17:52:27+00:00

    Working on Instance #1/3 'MacM1-4' with ID 'i-007c60b21cd872dea' launched on '2023-11-09T21:48:15+00:00'.
    Verifying lifetime <3 days
    Looking up public DNS
    Attepting to contact 'MacM1-4' at ec2-44-200-37-61.compute-1.amazonaws.com
    Checking state of instance
    Setting up new instance
    Setup script started

    Working on Instance #2/3 'MacM1-2' with ID 'i-0517fd7e8d962a4f3' launched on '2023-11-09T21:47:35+00:00'.
    Verifying lifetime <3 days
    Looking up public DNS
    Attepting to contact 'MacM1-2' at ec2-44-192-85-246.compute-1.amazonaws.com
    Checking state of instance
    Instance setup has completed
    Cirrus worker listener process is running

    Working on Instance #3/3 'MacM1-6' with ID 'i-0992b66f4e272327c' launched on '2023-11-09T21:47:55+00:00'.
    Verifying lifetime <3 days
    Looking up public DNS
    Attepting to contact 'MacM1-6' at ec2-3-230-118-215.compute-1.amazonaws.com
    Checking state of instance
    Instance setup has completed
    Cirrus worker listener process is running

SetupInstances.sh again right away

$ ./SetupInstances.sh
 Operating on 3 instances from # LaunchInstances.sh run 2023-11-10T17:52:27+00:00

    Working on Instance #1/3 'MacM1-4' with ID 'i-007c60b21cd872dea' launched on '2023-11-09T21:48:15+00:00'.
    Verifying lifetime <3 days
    Looking up public DNS
    Attepting to contact 'MacM1-4' at ec2-44-200-37-61.compute-1.amazonaws.com
    Checking state of instance
    Setup not complete;  Now: 2023-11-10T17:55:38+00:00
    Setup running since:      2023-11-10T17:54:02+00:00
...cut...

SetupInstances.sh again after some minutes

$ ./SetupInstances.sh
 Operating on 3 instances from # LaunchInstances.sh run 2023-11-10T17:52:27+00:00

    Working on Instance #1/3 'MacM1-4' with ID 'i-007c60b21cd872dea' launched on '2023-11-09T21:48:15+00:00'.
    Verifying lifetime <3 days
    Looking up public DNS
    Attepting to contact 'MacM1-4' at ec2-44-200-37-61.compute-1.amazonaws.com
    Checking state of instance
    Instance setup has completed
    Cirrus worker listener process is running
...cut...

Example log content (some editing done to reduce length)

./InstanceSSH.sh i-007c60b21cd872dea cat setup.log
Warning: Permanently added 'ec2-44-200-37-61.compute-1.amazonaws.com' (ED25519) to the list of known hosts.
##### Configuring paths
##### Installing podman-machine, testing, and CI deps. (~2m install time)
Running `brew update --auto-update`...

...cut...
==> Fetching go
==> Downloading https://ghcr.io/v2/homebrew/core/go/blobs/sha256:e3b1b54314a26125d0dc830958acd92f496b0dbbbc2715432625c3654ae755fc
==> Downloading https://ghcr.io/v2/homebrew/core/go-md2man/manifests/2.0.3
==> Fetching go-md2man
...cut...
==> Pouring go--1.21.4.arm64_sonoma.bottle.tar.gz
🍺  /opt/homebrew/Cellar/go/1.21.4: 12,539 files, 241.2MB
==> Running `brew cleanup go`...
Disable this behaviour by setting HOMEBREW_NO_INSTALL_CLEANUP.
Hide these hints with HOMEBREW_NO_ENV_HINTS (see `man brew`).
...cut...
##### Adding/Configuring PW User
2023-11-10 17:55:45.241 sysadminctl[7338:128167] ----------------------------
2023-11-10 17:55:45.241 sysadminctl[7338:128167] No clear text password or interactive option was specified (adduser, change/reset password will not allow user to use FDE) !
2023-11-10 17:55:45.241 sysadminctl[7338:128167] ----------------------------
2023-11-10 17:55:45.429 sysadminctl[7338:128167] Creating user record…
2023-11-10 17:55:45.478 sysadminctl[7338:128167] Assigning UID: 502 GID: 20
2023-11-10 17:55:45.525 sysadminctl[7338:128167] Creating home directory at /Users/MacM1-4-worker
##### Starting listener supervisor process
##### Listener started at 2023-11-10T17:55:46+00:00
##### 2023-11-10T17:55:46+00:00 Starting PW pool listener as MacM1-4-worker
##### 2023-11-10T17:57:16+00:00 Pool listener watcher process tick.
...cut...
##### 2023-11-10T18:06:17+00:00 Pool listener watcher process tick.

Comment on lines 30 to 32
# For whatever reason, when this script is run through ssh, the default
# environment isn't loaded automatically.
. /etc/profile
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: This needs to move down under the "Configuring paths" section below.

Copy link
Member

@edsantiago edsantiago left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, still reviewing off and on as time allows. Submitting minor nits in case you need to push again.

I can't claim to understand more than a fraction of it, but I can say that I appreciate all the work you've put into detecting corner cases and making those friendlier.


dbg() {
if ((L_DEBUG)); then
echo "${1:-No debug message provided}" | awk -e '{print "DEBUG: "$0}' > /dev/stderr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having trouble understanding the need for awk here and lines 49-50. Does this not work on Mac?

    echo "DEBUG: ${*:-No debug message provided}" >&2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two notes (hopefully they help):

  • The LaunchInstances.sh and SetupInstance.sh are intended to run on my laptop for now, not on the mac. Only setup.sh and service_pool.sh are targeted to run on the macs.
  • The awk is there because there are/were sometimes multi-line calls to dbg/warn/die. The awk simply makes it clear all those lines are for debugging, warning, or death. As opposed to something else like a script syntax error.

mac_pw_pool/LaunchInstances.sh Outdated Show resolved Hide resolved
fi

msg "Adding/Configuring PW User"
PWINST=$(curl -sSLf http://instance-data/latest/meta-data/tags/instance/Name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming that this is what explains the magic instance-data, would you mind adding a comment here (and the other two) with a quick explanation and a link?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things for this:

  1. There is no cloud-init support on Macs (they use this instead). Remember Macs are each special unique rainbow snowflakes in a walled garden of unicorns, so Apple can ensure maximize profit and developer grief 🤣
  2. I dug this magic out of stackoverflow somewhere. AWS docs can be hard on the eyes, they don't like to put everything in one place. I think perhaps either this or this reference is maybe the most useful. Or maybe both? I have no idea how they resolve "instance-data" to 169.254.169.254 it must be magic.

@cevich
Copy link
Member Author

cevich commented Nov 13, 2023

Sorry, still reviewing off and on as time allows. Submitting minor nits in case you need to push again.

SGTM. I can incorporate them as they come in, no prob. I know it's a lot to review and the whole situation is messy. BTW, I fully expect to abandon all this work once CRC's management solution is workable for podman PR-level testing (could be months, could be never, hard to say ATM).

I can't claim to understand more than a fraction of it, but I can say that I appreciate all the work you've put into detecting corner cases and making those friendlier.

Aww thanks, I appreciate it. Yeah, and unf. I'm sure we're going to encounter more weirdness over time 😖 I spent a few days trying to find ways to NOT use the aws CLI, but everything I tried fell short on the "works with dedicated hosts" part 😞

@edsantiago
Copy link
Member

Okay... thanks for your responses. I don't think I can review any more today... so, LGTM.

@cevich
Copy link
Member Author

cevich commented Nov 13, 2023

Okay... thanks for your responses. I don't think I can review any more today... so, LGTM.

Great, force-pushing changes from today then. Mostly de-duplicating what I could into a pw_lib.sh, cleanup of output, and addition of a time-check so instances aren't launched onto dedicated hosts too close together (reasons explained in comment).

@cevich
Copy link
Member Author

cevich commented Nov 13, 2023

Force-push: Comment typo fix.

mac_pw_pool/setup.sh Outdated Show resolved Hide resolved
@cevich
Copy link
Member Author

cevich commented Nov 14, 2023

@edsantiago I've got fixes to the above and a handful of "polish" items queued up for pushing. I don't want to disturb you if you're already taking another look, so please LMK when it's safe to push again.

@edsantiago
Copy link
Member

Go ahead, I'm not reviewing this atm. Thanks for checking.

Implement a set of scripts to help with management of a Cirrus-CI
persistent worker pool of M1 Mac instances on AWS EC2.

* Implement script to help monitor a set of M1 Mac dedicated hosts,
  creating new instances as slots become available.

* Implement a script to help monitor M1 Mac instances, deploying
  and executing a setup script on newly created instances.

* Implement a ssh-helper script for humans, to quickly access
  instances based on their EC2 instance ID.

* Implement a setup script intended to run on M1 Macs, to help
  configure and join them to a pre-existing worker pool.

* Implement a helper script intended to run on M1 Macs, to
  support developers with a CI-like environment.

* At this time, all scripts are intended for manual/human-supervised
  use.  Future commits may improve this and/or better support use
  inside automation.

* Add very basic/expedient documentation.

N/B: The majority of this content, including the EC2-side setup has
been developed in a rush.  There are very likely major architecture,
design, and scripting bugs and shortfalls.  Some of these may be
addressed in future commits.

Signed-off-by: Chris Evich <[email protected]>
@cevich cevich marked this pull request as ready for review November 14, 2023 18:53
@cevich
Copy link
Member Author

cevich commented Nov 14, 2023

Force-push: Fixed a few bugs, improved spelling and other comment-tweaks. Removed draft-status as this iteration appears to be working great. Ready for final review.

Copy link

github-actions bot commented Nov 14, 2023

Successfully triggered github-actions/success task to indicate successful run of cirrus-ci_retrospective integration and unit testing from this PR's aba52cf01f4c0db40c0fe2f500f584988eed1792.

@edsantiago
Copy link
Member

range-diff review LGTM.

@cevich
Copy link
Member Author

cevich commented Nov 15, 2023

Thanks a bunch Ed.

@cevich cevich merged commit d41b345 into containers:main Nov 15, 2023
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants