Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use community gallery for default VM images #5167

Merged

Conversation

mboersma
Copy link
Contributor

@mboersma mboersma commented Oct 7, 2024

/kind feature

What this PR does / why we need it:

Use an Azure community image gallery as the default source for VM images.

Which issue(s) this PR fixes:

See also:

Special notes for your reviewer:

  • cherry-pick candidate

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests

Release note:

Use community gallery for default VM images

@k8s-ci-robot k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Oct 7, 2024
@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 7, 2024
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 7, 2024
@mboersma mboersma changed the title Use CAPZ community gallery for default VM images [WIP] Use CAPZ community gallery for default VM images Oct 7, 2024
@mboersma
Copy link
Contributor Author

mboersma commented Oct 7, 2024

/test pull-cluster-api-provider-azure-ci-entrypoint

@mboersma
Copy link
Contributor Author

mboersma commented Oct 8, 2024

/test pull-cluster-api-provider-azure-ci-entrypoint

Published Kubernetes v1.29.5 images and retrying. Update: passed!

@mboersma
Copy link
Contributor Author

mboersma commented Oct 8, 2024

/test pull-cluster-api-provider-azure-e2e
/test pull-cluster-api-provider-azure-e2e-aks

@mboersma mboersma added this to the v1.18 milestone Oct 8, 2024
@mboersma
Copy link
Contributor Author

mboersma commented Oct 8, 2024

/test pull-cluster-api-provider-azure-e2e
/test pull-cluster-api-provider-azure-e2e-aks

Edit: AKS tests passed, with BYO nodepool restored.

@mboersma
Copy link
Contributor Author

mboersma commented Oct 8, 2024

/test pull-cluster-api-provider-azure-e2e

Published v1.28.14 images.

@mboersma
Copy link
Contributor Author

mboersma commented Oct 9, 2024

Note to self: this also needs some documentation updates.

  • az commands to list the current images in the community gallery
  • high-level description of publishing process
  • where to ask if images aren't being replicated to a region you need

@mboersma
Copy link
Contributor Author

mboersma commented Oct 9, 2024

Since the means of looking up a particular image is basically by convention:

  • First, get the well-known uniquely named community gallery
  • Second, list the image definitions there (or just use "capi-ubun2-2404")
  • Third, just use k8s version as image version ("1.31.1" is the name of a version, for example)
    I don't think there's any need for a cached API client at all, so let's purge all that code.

@mboersma mboersma force-pushed the community-gallery-ref-images branch from d66f4df to 2a4e9ef Compare October 9, 2024 21:29
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 9, 2024
@mboersma
Copy link
Contributor Author

mboersma commented Oct 9, 2024

/test pull-cluster-api-provider-azure-build
/test pull-cluster-api-provider-azure-test
/test pull-cluster-api-provider-azure-verify
/test pull-cluster-api-provider-azure-e2e

Copy link

codecov bot commented Oct 9, 2024

Codecov Report

Attention: Patch coverage is 93.54839% with 2 lines in your changes missing coverage. Please review.

Project coverage is 52.73%. Comparing base (0f48c52) to head (f05ef21).
Report is 19 commits behind head on main.

Files with missing lines Patch % Lines
azure/services/virtualmachineimages/images.go 93.10% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5167      +/-   ##
==========================================
- Coverage   52.98%   52.73%   -0.25%     
==========================================
  Files         273      270       -3     
  Lines       29197    29014     -183     
==========================================
- Hits        15469    15301     -168     
+ Misses      12926    12922       -4     
+ Partials      802      791      -11     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mboersma
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-build
/test pull-cluster-api-provider-azure-test
/test pull-cluster-api-provider-azure-verify
/test pull-cluster-api-provider-azure-e2e

@mboersma
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e

flake

@nojnhuh
Copy link
Contributor

nojnhuh commented Oct 10, 2024

Flakes look mostly related to #5100. I think the last time I dug into that, I was only seeing failures in tests with multiple control plane nodes, where I see at least one test here flaking with only one control plane node. Not sure how much weight to put behind that observation though.

@mboersma
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e
/test pull-cluster-api-provider-azure-e2e-optional

@mboersma
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e
/test pull-cluster-api-provider-azure-e2e-optional

@nojnhuh
Copy link
Contributor

nojnhuh commented Oct 16, 2024

@mboersma Can we try rebuilding the Windows images using the same version of Windows that we're currently using in tests? It looks like currently we're using "Windows Server 2019 Datacenter" build 17763 where this PR makes it so we're using an undated "Windows Server Datacenter" build 20348.

@mboersma
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e-optional

@mboersma
Copy link
Contributor Author

/retest

"Standard SKU IP limit reached in this region"

@nojnhuh
Copy link
Contributor

nojnhuh commented Oct 25, 2024

Looks like this error is cropping up on the ci-entrypoint job:

The gallery image /CommunityGalleries/ClusterAPI-f72ceb4f-5159-4c26-a0fe-2ea738f0d019/Images/capi-win-2022-containerd/Versions/1.29.10 is not available in northeurope region. Please contact image owner to replicate to this region, or change your requested region.

And should that be using Windows 2019 instead of 2022?

@mboersma
Copy link
Contributor Author

should that be using Windows 2019 instead of 2022?

If it doesn't run the curl-to-ilb test, we can probably get away with 2022 here. Thanks for finding that--indeed I hadn't populated that image yet, but I'll fill those in and retry.

Dockerfile Show resolved Hide resolved
@mboersma
Copy link
Contributor Author

/retest

Copy link
Contributor

@nojnhuh nojnhuh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@@ -56,6 +56,12 @@ const (
DefaultImagePublisherID = "cncf-upstream"
// LatestVersion is the image version latest.
LatestVersion = "latest"
// DefaultPublicGalleryName is the default Azure Compute Gallery.
DefaultPublicGalleryName = "ClusterAPI-f72ceb4f-5159-4c26-a0fe-2ea738f0d019"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parroting @jsturtevant, are you planning to open a k8s.io PR to add this new gallery to the Terraform?

Copy link
Contributor Author

@mboersma mboersma Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes indeed, but I haven't started that yet. That's the third piece of the puzzle, in addition to this and kubernetes-sigs/image-builder#1578.

Edit: terraform PR is at kubernetes/k8s.io#7461

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this gallery name relate to anything in here?

kubernetes/k8s.io#7461

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't--I don't think there's a way to recreate a community gallery with the same unique name if it were to be deleted. It gets created for you, starting with your specified prefix, when you actually share the gallery to the world.

So AFAICT it isn't / shouldn't be part of the terraform for the gallery.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implies if the gallery were to be deleted, we'd have to change code to point to a new gallery with the same name prefix. That's not great. I can't add a "Delete" lock on the resource, because that has the side effect of disallowing new image definitions to be added or old versions to be deleted, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So once we apply the terraform we will change the value here to reflect the unique name that it gets?

Copy link
Contributor Author

@mboersma mboersma Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's already got a unique name by virtue of me recreating everything in the CNCF community sub and then reverse-engineering the terraform. AFAICT, the unique name it gets is stored under its sharingProfile which isn't in the terraform I generated...but maybe can be added?

% az sig show -g cluster-api-gallery --gallery-name community_gallery -o yaml
description: Shared image gallery for Cluster API Provider Azure
id: /subscriptions/46678f10-4bbb-447e-98e8-d2829589f2d8/resourceGroups/cluster-api-gallery/providers/Microsoft.Compute/galleries/community_gallery
identifier:
  uniqueName: 46678f10-4bbb-447e-98e8-d2829589f2d8-COMMUNITY_GALLERY
location: northcentralus
name: community_gallery
provisioningState: Succeeded
resourceGroup: cluster-api-gallery
sharingProfile:
  communityGalleryInfo:
    communityGalleryEnabled: true
    eula: https://raw.githubusercontent.com/kubernetes-sigs/cluster-api-provider-azure/main/LICENSE
    publicNamePrefix: ClusterAPI
    publicNames:
    - ClusterAPI-f72ceb4f-5159-4c26-a0fe-2ea738f0d019
    publisherContact: [email protected]
    publisherUri: https://github.com/kubernetes-sigs/cluster-api-provider-azure
  groups: null
  permissions: Community
sharingStatus: null
softDeletePolicy: null
tags:
  DO-NOT-DELETE: UpstreamInfra
  DateCreated: 10/24/2024
  creationTimestamp: '2024-10-24T17:36:37Z'
  jobName: image-builder-sig-ubuntu-2404
type: Microsoft.Compute/galleries

Copy link
Contributor Author

@mboersma mboersma Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But from the point of view of CAPZ's code here, this is the unique identifier we need to access images. It shouldn't change except in a disaster recovery case where we have to set up a new community gallery.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could try to refactor this away from being a constant into something created at runtime that the user could override if needed. (Or I can follow up with that change after this merges.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If users can set the gallery name in the API, I don't think we need a separate toggle to change the default for now.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 25, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 35cefabcb48d7a1af907593bc20d4c4223cf677c

@mboersma
Copy link
Contributor Author

I'd like to add that overall, working with Azure compute galleries has been much easier than publishing to the Azure Marketplace.

  • The structure is transparent and can be navigated like normal Azure resources.
  • The publishing pipeline is simpler and involves only az for tooling.
  • We can publish and replicate a new image in about an hour, compared to a minimum of ten hours (and often much more) with the Marketplace approach.
  • Cross-tenant security issues are more easily worked when there's just one subscription (previously we needed two, in different tenants)

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 29, 2024
Copy link
Contributor

@nojnhuh nojnhuh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 29, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 1cd4768c6b1902007483a4cb4d2107857b62c9cc

Copy link
Member

@nawazkh nawazkh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@nojnhuh
Copy link
Contributor

nojnhuh commented Oct 30, 2024

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nojnhuh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 30, 2024
@k8s-ci-robot k8s-ci-robot merged commit ea9ff03 into kubernetes-sigs:main Oct 30, 2024
33 checks passed
@mboersma mboersma deleted the community-gallery-ref-images branch October 30, 2024 20:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Share VM images via Azure Compute Galleries Consider offering reference images in Community Gallery
5 participants