Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add aks cluster extension feature azurerm_kubernetes_cluster_extension #805

Closed

Conversation

drew0ps
Copy link
Contributor

@drew0ps drew0ps commented Aug 29, 2024

Description of your changes

Added the resource azurerm_kubernetes_cluster_extension which is already present in the underlying terraform release but was not made available through the provider.

Fixes #731

I have:

  • Read and followed Crossplane's contribution process.
  • Run make reviewable to ensure this PR is ready for review.
  • Manual tests
  • [N/A] Added backport release-x.y labels to auto-backport this PR if necessary.

How has this code been tested

Manual Test according to the process ran through. Caveat: Used clusterId instead of a clusterIdSelector.

Creation:

k get KubernetesClusterExtension/example-flux KubernetesCluster/example ResourceGroup/example
NAME                                                                               SYNCED   READY   EXTERNAL-NAME   AGE
kubernetesclusterextension.kubernetesconfiguration.azure.upbound.io/example-flux   True     True    example-flux    10m

NAME                                                          SYNCED   READY   EXTERNAL-NAME   AGE
kubernetescluster.containerservice.azure.upbound.io/example   True     True    example         108m

NAME                                     SYNCED   READY   EXTERNAL-NAME   AGE
resourcegroup.azure.upbound.io/example   True     True    example         24h

Deletion:

NAME                                                                               SYNCED   READY   EXTERNAL-NAME   AGE
kubernetesclusterextension.kubernetesconfiguration.azure.upbound.io/example-flux   True     False   example-flux    24m

Resource is in k8s until the external resource is deleted as expected.

Edit:

This does not seem to work however. I added certain diffs under non-immutable field configurationSettings. The monolith says the following debug messages:

Successfully requested update of external resource      {"controller": "managed/kubernetesconfiguration.azure.upbound.io/v1beta1, kind=kubernetesclusterextension", "request": {"name":"example-flux"}, "uid": "redacted", "version": "222673", "external-name": "example-flux", "requeue-after": "2024-08-30T21:37:50+02:00"}
Async update starting...        {"trackerUID": "redacted", "resourceName": "example-flux", "gvk": "kubernetesconfiguration.azure.upbound.io/v1beta1, Kind=KubernetesClusterExtension", "tfID": "/subscriptions/redacted/resourceGroups/example/providers/Microsoft.ContainerService/managedClusters/example/providers/Microsoft.KubernetesConfiguration/extensions/example-flux"}
provider-azure  Updating the external resource  {"uid": "redacted", "name": "example-flux", "gvk": "kubernetesconfiguration.azure.upbound.io/v1beta1, Kind=KubernetesClusterExtension"}
provider-azure  Async update ended.     {"trackerUID": "redacted", "resourceName": "example-flux", "gvk": "kubernetesconfiguration.azure.upbound.io/v1beta1, Kind=KubernetesClusterExtension", "error": null, "tfID": "/subscriptions/test/resourceGroups/example/providers/Microsoft.ContainerService/managedClusters/example/providers/Microsoft.KubernetesConfiguration/extensions/example-flux"}

@drew0ps drew0ps force-pushed the add-aks-extensions-feature branch 2 times, most recently from 06797ad to 30726e8 Compare August 29, 2024 15:12
@drew0ps drew0ps marked this pull request as draft August 29, 2024 16:33
@drew0ps
Copy link
Contributor Author

drew0ps commented Aug 30, 2024

Pipeline run failure
It seems that the cluster reference makes it's way into the extension CR eventually, only quite late. 20 minutes into the pipeline run I still got:

begin time: logger.go:42: 20:13:59 | case | Creating namespace: kuttl-test-dashing-lark

logger.go:42: 20:31:56 | case/0-apply | 
 message: 'cannot resolve references: mg.Spec.ForProvider.ClusterID: referenced
field was empty (referenced resource may not yet be ready)'
reason: ReconcileError

Then eventually the field gets populated:

    logger.go:42: 20:32:19 | case/0-apply |   spec:
    logger.go:42: 20:32:19 | case/0-apply |     deletionPolicy: Delete
    logger.go:42: 20:32:19 | case/0-apply |     forProvider:
    logger.go:42: 20:32:19 | case/0-apply |       clusterId: /subscriptions/2895a7df-ae9f-41b8-9e78-3ce4926df838/resourceGroups/example/providers/Microsoft.ContainerService/managedClusters/example
    logger.go:42: 20:32:19 | case/0-apply |       clusterIdRef:
    logger.go:42: 20:32:19 | case/0-apply |         name: example

But then there is not enough time for the extension to get created:
logger.go:42: 20:34:01 | case/0-apply | test step failed 0-apply

In this run the field wasn't populated whatsoever.

Copy link
Collaborator

@turkenf turkenf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution @drew0ps, I left a few comments for you to consider.

@turkenf
Copy link
Collaborator

turkenf commented Sep 3, 2024

/test-examples="examples/containerservice/v1beta1/kubernetescluster.yaml"

@turkenf
Copy link
Collaborator

turkenf commented Sep 4, 2024

Regardless of the problem of not being able to create the Kubernetes cluster, what we need to pay attention to in this PR is the group name of the resource we added. We should add r.ShortGroup = "containerservice" configuration, see example:

r.ShortGroup = "containerservice"

Could you also squash your commits? It would help keep the commit history cleaner and more organized.

@drew0ps drew0ps force-pushed the add-aks-extensions-feature branch 3 times, most recently from 5537229 to 45d3e5a Compare September 4, 2024 12:33
@drew0ps
Copy link
Contributor Author

drew0ps commented Sep 4, 2024

Shortgroup config fixed here, commits squashed. Will squash them again when it's ready for merge. Regenerating code and testing again.

@drew0ps drew0ps force-pushed the add-aks-extensions-feature branch 2 times, most recently from f9d30ae to 9f4f9dd Compare September 4, 2024 15:11
@drew0ps
Copy link
Contributor Author

drew0ps commented Sep 5, 2024

@turkenf I ran uptest locally with make e2e and that successfully created my resources - although it seems make e2e doesn't work as I expected as it produces an error

 k get ResourceGroup/example KubernetesCluster/example KubernetesClusterExtension/example
NAME                                            SYNCED   READY   EXTERNAL-NAME    AGE
resourcegroup.azure.upbound.io/example-rg-kce   True     True    example-rg-kce   17m

NAME                                                                 SYNCED   READY   EXTERNAL-NAME    AGE
kubernetescluster.containerservice.azure.upbound.io/example-kc-kce   True     True    example-kc-kce   17m

NAME                                                                       SYNCED   READY   EXTERNAL-NAME   AGE
kubernetesclusterextension.containerservice.azure.upbound.io/example-kce   True     True    example-kce     17m

And the error it produces:

    logger.go:42: 19:30:37 | case/0-apply | running command: [/Users/thisisme/codebase/mycompany/provider-test/provider-upjet-azure/.cache/tools/darwin_x86_64/kubectl-v1.24.3 annotate managed --all upjet.upbound.io/test=true --overwrite]
    logger.go:42: 19:30:45 | case/0-apply | error: the server doesn't have a resource type "managed"
    logger.go:42: 19:30:45 | case/0-apply | command failure, skipping 5 additional commands

Opened an issue on this

@drew0ps drew0ps marked this pull request as ready for review September 5, 2024 18:17
@turkenf
Copy link
Collaborator

turkenf commented Sep 6, 2024

/test-examples="examples/containerservice/v1beta1/kubernetesclusterextension.yaml"

@drew0ps
Copy link
Contributor Author

drew0ps commented Sep 6, 2024

I stopped the kind cluster and recreated it with make e2e - now I get a different behaviour. My resources are still created successfully:

k get ResourceGroup/example-rg-kce KubernetesCluster/example-kc-kce KubernetesClusterExtension/example-kce
NAME                                            SYNCED   READY   EXTERNAL-NAME    AGE
resourcegroup.azure.upbound.io/example-rg-kce   True     True    example-rg-kce   37m

NAME                                                                 SYNCED   READY   EXTERNAL-NAME    AGE
kubernetescluster.containerservice.azure.upbound.io/example-kc-kce   True     True    example-kc-kce   37m

NAME                                                                       SYNCED   READY   EXTERNAL-NAME   AGE
kubernetesclusterextension.containerservice.azure.upbound.io/example-kce   True     True    example-kce     37m

but I no longer see the error message that is complaining about a missing resource. Instead, it just hangs and then times out:

logger.go:42: 10:44:38 | case/0-apply | running command: [/Users/me/codebase/company/provider-test/provider-upjet-azure/.cache/tools/darwin_x86_64/kubectl-v1.24.3 annotate managed --all upjet.upbound.io/test=true --overwrite]
    logger.go:42: 11:04:38 | case/0-apply | command failure, skipping 5 additional commands
    logger.go:42: 11:04:38 | case/0-apply | test step failed 0-apply
    case.go:361: failed in step 0-apply
    case.go:363: command "${KUBECTL} annotate managed --all upjet.upbound.io/test=true --overwrite" exceeded 1200 sec timeout, context deadline exceeded
I0906 11:04:39.337534    8357 request.go:655] Throttling request took 1.038372665s, request: GET:https://127.0.0.1:55370/apis/streamanalytics.azure.upbound.io/v1beta1?timeout=32s
    logger.go:42: 11:04:46 | case | Failed to collect events for case in ns kuttl-test-deciding-reindeer: no matches for kind "Event" in version "events.k8s.io/v1beta1"
    logger.go:42: 11:04:46 | case | Trying with events eventsv1 API...
    logger.go:42: 11:04:46 | case | case events from ns kuttl-test-deciding-reindeer:

@turkenf
Copy link
Collaborator

turkenf commented Sep 6, 2024

Fixed issue creating Kubernetes Cluster resource. Could you please add the uptest.upbound.io/timeout: "3600" annotation to the KubernetesClusterExtension resource and re-trigger the uptest?

@drew0ps
Copy link
Contributor Author

drew0ps commented Sep 6, 2024

@turkenf Thanks for fixing the AKS deployment topic. I'm afraid this would still not unblock this particular PR - the extension is stuck in creating state seemingly indefinitely. This is really weird because this issue is not present when I apply with make run or even with make e2e - (even though the annotation command seems to be broken in the make e2e case therefore local uptests can never succeed.)

@drew0ps
Copy link
Contributor Author

drew0ps commented Sep 9, 2024

@turkenf Thank you for the fix regarding the missing error messages, indeed the last run is more vocal about the issue:

message: "async create failed: failed to create the resource: [***0 creating Scoped
 Extension (Scope: \"/subscriptions/2895a7df-ae9f-41b8-9e78-3ce4926df838/resourceGroups/example-rg-kce/providers/Microsoft.ContainerService/managedClusters/example-kc-kce\"\nExtension
Name: \"example-kce\"): performing Create: Put \"https://management.azure.com/subscriptions/2895a7df-ae9f-41b8-9e78-3ce4926df838/resourceGroups/example-rg-kce/providers/Microsoft.ContainerService/managedClusters/example-kc-kce/providers/Microsoft.KubernetesConfiguration/extensions/example-kce?api-version=2022-11-01\":
The Resource Provider was not registered\n\nResource Providers (APIs) in Azure
eed to be registered before they can be used - however the Resource\nProvider
was not registered, and calling the API returned the following error:\n\nThe
subscription is not registered to use namespace 'Microsoft.KubernetesConfiguration'.
 See  https://aka.ms/rps-not-found for how to register subscriptions.

@turkenf
Copy link
Collaborator

turkenf commented Sep 10, 2024

/test-examples="examples/containerservice/v1beta1/kubernetesclusterextension.yaml"

@turkenf
Copy link
Collaborator

turkenf commented Sep 10, 2024

/test-examples="examples/containerservice/v1beta1/kubernetescluster.yaml"

1 similar comment
@drew0ps
Copy link
Contributor Author

drew0ps commented Sep 10, 2024

/test-examples="examples/containerservice/v1beta1/kubernetescluster.yaml"

@drew0ps
Copy link
Contributor Author

drew0ps commented Sep 11, 2024

The tests never get past the apply stage, even for an already merged resource examples/containerservice/v1beta1/kubernetescluster.yaml, even though all the statuses are set to true and ready. I was able to reproduce this issue using make e2e on my own Azure sub and I saw the error event in the resource group level activity logs complaining about the following:

The PutAgentPoolHandler.PUT request limit has been exceeded for agentpool /subscriptions/mysub/resourcegroups/example-rg-kce/providers/Microsoft.ContainerService/managedClusters/example-kc-kce/agentPools/default, please retry again in 36 seconds

I suspect the same thing is happening with the pipeline.

@mergenci
Copy link
Collaborator

mergenci commented Sep 13, 2024

/test-examples="examples/containerservice/v1beta2/kubernetesclusterextension.yaml"

Edit: This was an error.

@mergenci
Copy link
Collaborator

/test-examples="examples/containerservice/v1beta2/kubernetescluster.yaml"

@mergenci
Copy link
Collaborator

Thanks @sergenyalcin for analyzing the Uptest control plane dump. Let me report his findings that he shared in private with me.

2024-09-11T16:20:48Z	DEBUG	provider-azure	Diff detected	{"uid": "0fef3b05-9b1d-425f-ac87-94f6bd4bf38c", "name": "example-kc-kce", "gvk": "containerservice.azure.upbound.io/v1beta1, Kind=KubernetesCluster", "instanceDiff": "*terraform.InstanceDiff{mu:sync.Mutex{state:0, sema:0x0}, Attributes:map[string]*terraform.ResourceAttrDiff{\"default_node_pool.0.upgrade_settings.#\":*terraform.ResourceAttrDiff{Old:\"1\", New:\"0\", NewComputed:false, NewRemoved:false, NewExtra:interface {}(nil), RequiresNew:false, Sensitive:false, Type:0x0}, \"default_node_pool.0.upgrade_settings.0.max_surge\":*terraform.ResourceAttrDiff{Old:\"10%\", New:\"\", NewComputed:false, NewRemoved:true, NewExtra:interface {}(nil), RequiresNew:false, Sensitive:false, Type:0x0}}, Destroy:false, DestroyDeposed:false, DestroyTainted:false, RawConfig:cty.NilVal, RawState:cty.NilVal, RawPlan:cty.NilVal, Meta:map[string]interface {}(nil)}"}

After KubernetesCluster creation, the provider is trying to delete default_node_pool.0.upgrade_settings.0.max_surge at the log line above. For some reason, the update is not successful — I don't see any errors reported by the provider or the Terraform layer. So, the provider enters an update loop, trying to delete the field forever. The update loop also explains the rate limiting that @drew0ps reported above.

@sergenyalcin noted that a Terraform provider version bump might have introduced the regression, if the Uptests were ever successful.

I triggered another Uptest for the cluster resource, this time with v1beta2, which is slightly different than v1beta1 manifest. Just in case…

@mergenci
Copy link
Collaborator

mergenci commented Sep 13, 2024

Of course the Uptest failed, because of the region error 😒

Creating a free tier cluster is unavailable at this time in region westeurope.

@drew0ps
Copy link
Contributor Author

drew0ps commented Sep 13, 2024

@mergenci Thanks a lot for following up and sharing insights - It seems that we are unable to proceed with the uptests because of the issue your described. What's weird to me is that even though the update loop occurs and the tests fail the KubernetesCluster still reports ready state in the status field.
With regards to this PR, I have tested the changes numerous times and it seems stable to me (as I showed in the SIG meeting), so in my opinion it would still be safe to merge it in time for the next release if noone sees any potential risks.

@drew0ps drew0ps force-pushed the add-aks-extensions-feature branch 2 times, most recently from fbdc471 to e5caa7a Compare September 13, 2024 14:57
@drew0ps
Copy link
Contributor Author

drew0ps commented Sep 13, 2024

Way better state now - uptests progressed until case-2 import, there I ran into an issue though:

    logger.go:42: 18:08:37 | case/2-import |     - lastTransitionTime: "2024-09-13T16:01:51Z"
message: 'async create failed: failed to create the resource: [{0 A resource
with the ID "/subscriptions/mysub/resourceGroups/example-rg-kce/providers/Microsoft.ContainerService/managedClusters/example-kc-kce/providers/Microsoft.KubernetesConfiguration/extensions/example-kce"
already exists - to be managed via Terraform this resource needs to be imported
into the State. Please see the resource documentation for "azurerm_kubernetes_cluster_extension"
for more information.

Looking at past issues, Ive noticed that this usually indicates an issue with externalname config. However, it seems good to me. TF doc:

terraform import azurerm_kubernetes_cluster_extension.example /subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/resourceGroup1/providers/Microsoft.ContainerService/managedClusters/cluster1/providers/Microsoft.KubernetesConfiguration/extensions/extension1

externalname config:
config.TemplatedStringAsIdentifier("name", "{{ .parameters.kubernetes_cluster_id }}/providers/Microsoft.KubernetesConfiguration/extensions/{{ .external_name }}"),

Which seems to be correct.

…s to the examples folder and made slight changes.

Added a simple flux extension install for automated tests.
Specifying maxSurge in the AKS cluster spec precents the update loop of the cluster and unblocks tests.
Signed-off-by: drew0ps <[email protected]>
@drew0ps
Copy link
Contributor Author

drew0ps commented Sep 16, 2024

/test-examples="examples/containerservice/v1beta1/kubernetesclusterextension.yaml"

@drew0ps drew0ps marked this pull request as draft September 16, 2024 19:11
@drew0ps
Copy link
Contributor Author

drew0ps commented Sep 16, 2024

closing this in favor of a way cleaner #818

@drew0ps drew0ps closed this Sep 16, 2024
@drew0ps drew0ps deleted the add-aks-extensions-feature branch September 18, 2024 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Request for azurerm_kubernetes_cluster_extension resource
3 participants