Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Can't install provider-aws 1.0.0 on crossplane versions before 1.14 #1147

Closed
1 task done
mbbush opened this issue Feb 8, 2024 · 7 comments · Fixed by #1157
Closed
1 task done

[Bug]: Can't install provider-aws 1.0.0 on crossplane versions before 1.14 #1147

mbbush opened this issue Feb 8, 2024 · 7 comments · Fixed by #1157
Labels
bug Something isn't working is:triaged Indicates that an issue has been reviewed.

Comments

@mbbush
Copy link
Collaborator

mbbush commented Feb 8, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Affected Resource(s)

No response

Resource MRs required to reproduce the bug

No response

Steps to Reproduce

Create a kind cluster.
Install uxp v1.13.2-up.3
Install provider-aws v0.47.1, including at least one of the provider packages that have conversion webhooks (ec2, rds, elasticache, or a couple others)
Wait for the providers to stabilize
Upgrade the providers to provider-aws v1.0.0

What happened?

The providers which have conversion webhooks got into a crashloop, and never became ready. ec2, rds, elasticache, etc resources become completely unusable.

Relevant Error Output Snippet

From provider-aws-elasticache v1.0.0:
2024-02-08T02:22:36Z	DEBUG	provider-aws	Starting	{"sync-interval": "1h0m0s", "poll-interval": "10m0s", "poll-jitter": "30s", "max-reconcile-rate": 100}
2024-02-08T02:22:38Z	INFO	provider-aws	Beta feature enabled	{"flag": "EnableBetaManagementPolicies"}
2024-02-08T02:22:38Z	INFO	controller-runtime.builder	skip registering a mutating webhook, object does not implement admission.Defaulter or WithDefaulter wasn't called	{"GVK": "elasticache.aws.upbound.io/v1beta1, Kind=Cluster"}
2024-02-08T02:22:38Z	INFO	controller-runtime.builder	skip registering a validating webhook, object does not implement admission.Validator or WithValidator wasn't called	{"GVK": "elasticache.aws.upbound.io/v1beta1, Kind=Cluster"}
2024-02-08T02:22:38Z	INFO	controller-runtime.builder	skip registering a mutating webhook, object does not implement admission.Defaulter or WithDefaulter wasn't called	{"GVK": "elasticache.aws.upbound.io/v1beta1, Kind=ParameterGroup"}
2024-02-08T02:22:38Z	INFO	controller-runtime.builder	skip registering a validating webhook, object does not implement admission.Validator or WithValidator wasn't called	{"GVK": "elasticache.aws.upbound.io/v1beta1, Kind=ParameterGroup"}
2024-02-08T02:22:38Z	INFO	controller-runtime.builder	skip registering a mutating webhook, object does not implement admission.Defaulter or WithDefaulter wasn't called	{"GVK": "elasticache.aws.upbound.io/v1beta2, Kind=ReplicationGroup"}
2024-02-08T02:22:38Z	INFO	controller-runtime.builder	skip registering a validating webhook, object does not implement admission.Validator or WithValidator wasn't called	{"GVK": "elasticache.aws.upbound.io/v1beta2, Kind=ReplicationGroup"}
2024-02-08T02:22:38Z	INFO	controller-runtime.webhook	Registering webhook	{"path": "/convert"}
2024-02-08T02:22:38Z	INFO	controller-runtime.builder	Conversion webhook enabled	{"GVK": "elasticache.aws.upbound.io/v1beta2, Kind=ReplicationGroup"}
2024-02-08T02:22:38Z	INFO	controller-runtime.builder	skip registering a mutating webhook, object does not implement admission.Defaulter or WithDefaulter wasn't called	{"GVK": "elasticache.aws.upbound.io/v1beta1, Kind=SubnetGroup"}
2024-02-08T02:22:38Z	INFO	controller-runtime.builder	skip registering a validating webhook, object does not implement admission.Validator or WithValidator wasn't called	{"GVK": "elasticache.aws.upbound.io/v1beta1, Kind=SubnetGroup"}
2024-02-08T02:22:38Z	INFO	controller-runtime.builder	skip registering a mutating webhook, object does not implement admission.Defaulter or WithDefaulter wasn't called	{"GVK": "elasticache.aws.upbound.io/v1beta1, Kind=User"}
2024-02-08T02:22:38Z	INFO	controller-runtime.builder	skip registering a validating webhook, object does not implement admission.Validator or WithValidator wasn't called	{"GVK": "elasticache.aws.upbound.io/v1beta1, Kind=User"}
2024-02-08T02:22:38Z	INFO	controller-runtime.builder	skip registering a mutating webhook, object does not implement admission.Defaulter or WithDefaulter wasn't called	{"GVK": "elasticache.aws.upbound.io/v1beta1, Kind=UserGroup"}
2024-02-08T02:22:38Z	INFO	controller-runtime.builder	skip registering a validating webhook, object does not implement admission.Validator or WithValidator wasn't called	{"GVK": "elasticache.aws.upbound.io/v1beta1, Kind=UserGroup"}
2024-02-08T02:22:38Z	INFO	controller-runtime.metrics	Starting metrics server
2024-02-08T02:22:38Z	INFO	controller-runtime.webhook	Starting webhook server
2024-02-08T02:22:38Z	INFO	Stopping and waiting for non leader election runnables
2024-02-08T02:22:38Z	INFO	Stopping and waiting for leader election runnables
2024-02-08T02:22:38Z	INFO	controller-runtime.metrics	Serving metrics server	{"bindAddress": ":8080", "secure": false}
2024-02-08T02:22:38Z	INFO	Starting EventSource	{"controller": "managed/elasticache.aws.upbound.io/v1beta1, kind=cluster", "source": "kind source: *v1beta1.Cluster"}
2024-02-08T02:22:38Z	INFO	Starting Controller	{"controller": "managed/elasticache.aws.upbound.io/v1beta1, kind=cluster"}
2024-02-08T02:22:38Z	INFO	Starting EventSource	{"controller": "managed/elasticache.aws.upbound.io/v1beta2, kind=replicationgroup", "source": "kind source: *v1beta2.ReplicationGroup"}
2024-02-08T02:22:38Z	INFO	Starting Controller	{"controller": "managed/elasticache.aws.upbound.io/v1beta2, kind=replicationgroup"}
2024-02-08T02:22:38Z	INFO	Starting workers	{"controller": "managed/elasticache.aws.upbound.io/v1beta2, kind=replicationgroup", "worker count": 100}
2024-02-08T02:22:38Z	INFO	Starting EventSource	{"controller": "managed/elasticache.aws.upbound.io/v1beta1, kind=usergroup", "source": "kind source: *v1beta1.UserGroup"}
2024-02-08T02:22:38Z	INFO	Starting Controller	{"controller": "managed/elasticache.aws.upbound.io/v1beta1, kind=usergroup"}
2024-02-08T02:22:38Z	INFO	Starting workers	{"controller": "managed/elasticache.aws.upbound.io/v1beta1, kind=usergroup", "worker count": 100}
2024-02-08T02:22:38Z	INFO	Shutdown signal received, waiting for all workers to finish	{"controller": "managed/elasticache.aws.upbound.io/v1beta2, kind=replicationgroup"}
2024-02-08T02:22:38Z	INFO	Starting EventSource	{"controller": "managed/elasticache.aws.upbound.io/v1beta1, kind=parametergroup", "source": "kind source: *v1beta1.ParameterGroup"}
2024-02-08T02:22:38Z	INFO	Starting Controller	{"controller": "managed/elasticache.aws.upbound.io/v1beta1, kind=parametergroup"}
2024-02-08T02:22:38Z	INFO	Starting workers	{"controller": "managed/elasticache.aws.upbound.io/v1beta1, kind=parametergroup", "worker count": 100}
2024-02-08T02:22:38Z	INFO	Starting workers	{"controller": "managed/elasticache.aws.upbound.io/v1beta1, kind=cluster", "worker count": 100}
2024-02-08T02:22:38Z	INFO	Starting EventSource	{"controller": "managed/elasticache.aws.upbound.io/v1beta1, kind=user", "source": "kind source: *v1beta1.User"}
2024-02-08T02:22:38Z	INFO	Starting Controller	{"controller": "managed/elasticache.aws.upbound.io/v1beta1, kind=user"}
2024-02-08T02:22:38Z	INFO	Starting workers	{"controller": "managed/elasticache.aws.upbound.io/v1beta1, kind=user", "worker count": 100}
2024-02-08T02:22:38Z	INFO	Shutdown signal received, waiting for all workers to finish	{"controller": "managed/elasticache.aws.upbound.io/v1beta1, kind=usergroup"}
2024-02-08T02:22:38Z	INFO	Shutdown signal received, waiting for all workers to finish	{"controller": "managed/elasticache.aws.upbound.io/v1beta1, kind=parametergroup"}
2024-02-08T02:22:38Z	INFO	All workers finished	{"controller": "managed/elasticache.aws.upbound.io/v1beta2, kind=replicationgroup"}
2024-02-08T02:22:38Z	INFO	Shutdown signal received, waiting for all workers to finish	{"controller": "managed/elasticache.aws.upbound.io/v1beta1, kind=cluster"}
2024-02-08T02:22:38Z	INFO	Shutdown signal received, waiting for all workers to finish	{"controller": "managed/elasticache.aws.upbound.io/v1beta1, kind=user"}
2024-02-08T02:22:38Z	INFO	All workers finished	{"controller": "managed/elasticache.aws.upbound.io/v1beta1, kind=parametergroup"}
2024-02-08T02:22:38Z	INFO	All workers finished	{"controller": "managed/elasticache.aws.upbound.io/v1beta1, kind=cluster"}
2024-02-08T02:22:38Z	INFO	Starting EventSource	{"controller": "managed/elasticache.aws.upbound.io/v1beta1, kind=subnetgroup", "source": "kind source: *v1beta1.SubnetGroup"}
2024-02-08T02:22:38Z	INFO	Starting Controller	{"controller": "managed/elasticache.aws.upbound.io/v1beta1, kind=subnetgroup"}
2024-02-08T02:22:38Z	INFO	All workers finished	{"controller": "managed/elasticache.aws.upbound.io/v1beta1, kind=usergroup"}
2024-02-08T02:22:38Z	INFO	Starting workers	{"controller": "managed/elasticache.aws.upbound.io/v1beta1, kind=subnetgroup", "worker count": 100}
2024-02-08T02:22:38Z	INFO	Shutdown signal received, waiting for all workers to finish	{"controller": "managed/elasticache.aws.upbound.io/v1beta1, kind=subnetgroup"}
2024-02-08T02:22:38Z	INFO	All workers finished	{"controller": "managed/elasticache.aws.upbound.io/v1beta1, kind=user"}
2024-02-08T02:22:38Z	INFO	All workers finished	{"controller": "managed/elasticache.aws.upbound.io/v1beta1, kind=subnetgroup"}
2024-02-08T02:22:38Z	INFO	Stopping and waiting for caches
2024-02-08T02:22:38Z	ERROR	controller-runtime.source.EventHandler	failed to get informer from cache	{"error": "Timeout: failed waiting for *v1beta1.Cluster Informer to sync"}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
	sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:68
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1
	k8s.io/[email protected]/pkg/util/wait/loop.go:53
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
	k8s.io/[email protected]/pkg/util/wait/loop.go:54
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
	k8s.io/[email protected]/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
	sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:56
2024-02-08T02:22:38Z	INFO	Stopping and waiting for webhooks
2024-02-08T02:22:38Z	INFO	Stopping and waiting for HTTP servers
2024-02-08T02:22:38Z	INFO	controller-runtime.metrics	Shutting down metrics server with timeout of 1 minute
2024-02-08T02:22:38Z	INFO	Wait completed, proceeding to shutdown the manager
provider: error: Cannot start controller manager: open /tls/server/tls.crt: no such file or directory

Crossplane Version

v1.13.2-up.3 and earlier (I also tested v1.13.2-up.1 and v1.12.3-up.1)

Provider Version

0.47.1 upgrading to 1.0.0

Kubernetes Version

No response

Kubernetes Distribution

Both EKS and Kind

Additional Info

No response

@mbbush mbbush added bug Something isn't working needs:triage labels Feb 8, 2024
@turkenf
Copy link
Collaborator

turkenf commented Feb 8, 2024

@mbbush thank you for bringing up this, I was able to reproduce this issue with provider-aws-elasticache
CC: @sergenyalcin, @ulucinar

  • k logs provider-aws-elasticache-e008782434ec-75cfb79bb-f6w7p -n upbound-system
2024-02-08T18:39:00Z	ERROR	controller-runtime.source.EventHandler	failed to get informer from cache	{"error": "Timeout: failed waiting for *v1beta1.SubnetGroup Informer to sync"}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
	sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:68
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1
	k8s.io/[email protected]/pkg/util/wait/loop.go:53
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
	k8s.io/[email protected]/pkg/util/wait/loop.go:54
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
	k8s.io/[email protected]/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
	sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:56
2024-02-08T18:39:00Z	INFO	controller-runtime.metrics	Shutting down metrics server with timeout of 1 minute
2024-02-08T18:39:00Z	INFO	Wait completed, proceeding to shutdown the manager
provider: error: Cannot start controller manager: open /tls/server/tls.crt: no such file or directory
  • k get providers -w
NAME                          INSTALLED   HEALTHY   PACKAGE                                                   AGE
provider-aws-elasticache      True        False     xpkg.upbound.io/upbound/provider-aws-elasticache:v1.0.0   6m54s
upbound-provider-family-aws   True        True      xpkg.upbound.io/upbound/provider-family-aws:v1.0.0        6m46s
provider-aws-elasticache      True        True      xpkg.upbound.io/upbound/provider-aws-elasticache:v1.0.0   7m19s
provider-aws-elasticache      True        False     xpkg.upbound.io/upbound/provider-aws-elasticache:v1.0.0   7m24s
provider-aws-elasticache      True        True      xpkg.upbound.io/upbound/provider-aws-elasticache:v1.0.0   8m49s
provider-aws-elasticache      True        False     xpkg.upbound.io/upbound/provider-aws-elasticache:v1.0.0   8m53s
  • k get pods -n upbound-system -w
provider-aws-elasticache-e008782434ec-75cfb79bb-f6w7p      0/1     Pending       0          0s
provider-aws-elasticache-e008782434ec-75cfb79bb-f6w7p      0/1     ContainerCreating   0          0s
provider-aws-elasticache-e008782434ec-75cfb79bb-f6w7p      1/1     Running             0          51s
provider-aws-elasticache-e008782434ec-75cfb79bb-f6w7p      0/1     Error               0          56s
provider-aws-elasticache-e008782434ec-75cfb79bb-f6w7p      1/1     Running             1 (3s ago)   57s
provider-aws-elasticache-e008782434ec-75cfb79bb-f6w7p      0/1     Error               1 (7s ago)   61s
provider-aws-elasticache-e008782434ec-75cfb79bb-f6w7p      0/1     CrashLoopBackOff    1 (13s ago)   73s
provider-aws-elasticache-e008782434ec-75cfb79bb-f6w7p      1/1     Running             2 (13s ago)   73s

@turkenf
Copy link
Collaborator

turkenf commented Feb 12, 2024

After discussing with the team we think that this is a Crossplane/UXP issue, so I will transfer the issue to the UXP repo.

@turkenf turkenf transferred this issue from crossplane-contrib/provider-upjet-aws Feb 12, 2024
@phisco
Copy link
Contributor

phisco commented Feb 12, 2024

@turkenf AFAICT this is because provider-aws, here, expects the certificates to be either at /tls/certs by default, which is the correct directory from 1.14 on, or the folder to be passed as CERTS_DIR, but before 1.14 we used to use WEBHOOK_TLS_CERTS_DIR, while from 1.14 we moved to TLS_SERVER_CERTS_DIR but we kept WEBHOOK_TLS_CERTS_DIR for retrocompatibility as explained here.

So, TL;DR: I think provider-aws should instead use WEBHOOK_TLS_CERTS_DIR instead of CERTS_DIR here, and if possible overwrite it with the value of TLS_SERVER_CERTS_DIR which at some point is going to be the only env var set on the provider.

@phisco phisco transferred this issue from upbound/universal-crossplane Feb 12, 2024
@turkenh
Copy link
Contributor

turkenh commented Feb 12, 2024

Thanks @phisco 🙏

Linking the correct implementation from provider-kubernetes: crossplane-contrib/provider-kubernetes@c8bbc9e

@mbbush
Copy link
Collaborator Author

mbbush commented Feb 14, 2024

FWIW, when we experienced this issue in our EKS cluster, it was running upstream crossplane v1.12.2, not UXP.

@ulucinar
Copy link
Collaborator

ulucinar commented Feb 15, 2024

Thank you all for the discussions here and the guidance given. Opened #1157, which implements the following configuration protocol:

  1. If the --certs-dir command-line option is supplied, it's used.
  2. If the --certs-dir command-line option is not supplied, the following environment variables are used in the given order: CERTS_DIR (for backwards-compatibility reasons), TLS_SERVER_CERTS_DIR (the new environment variable, which has replaced the WEBHOOK_TLS_CERT_DIR env. variable in Crossplane), and WEBHOOK_TLS_CERT_DIR (for backwards-compatibility).

@mbbush, I think you've already seen it, but as a workaround, you can use the --certs-dir command-line option (or the CERTS_DIR env. variable) for now to get your installation working against Crossplane versions prior to v1.14 (assuming the certificate and key filenames did not change).

@mbbush
Copy link
Collaborator Author

mbbush commented Feb 15, 2024

Thanks for the workaround suggestion. I've already opted to instead use this as an excuse to finally upgrade to crossplane 1.14, which I've wanted to do for a long time, but there were always more urgent things to work on.

It seems like the kind of thing that would be good to add to the release notes. Maybe @jeanduplessis could do that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working is:triaged Indicates that an issue has been reviewed.
Projects
None yet
5 participants