[Bug]: RDS instances not syncing due to wrong AZ in spec #1379

bobdanek · 2024-06-25T22:44:20Z

Is there an existing issue for this?

I have searched the existing issues

Affected Resource(s)

rds.aws.upbound.io/v1beta1 - Instance

Resource MRs required to reproduce the bug

exampledb.yml.txt
mysqlinstanceandservice.yml.txt
xmysqlinstance.yml.txt
xmysqlinstanceandservice.yml.txt

Steps to Reproduce

I don't have a way to reproduce this outside our environment, but here's an approximation:

Create a new RDS instance via a composite resource. Set multiAz to true. Do not specify an availability zone.
Examine the "composed" instance (instances.rds.aws.upbound.io) to see in which AZ it landed (status.atProvider.availabilityZone)
Examine the "composed" instance (instances.rds.aws.upbound.io) to see which AZ ends up in the spec (spec.forProvider.availabilityZone)

What happened?

Expected: RDS instance created successfully, and the instance managed resource stays synced. The AZ in which the instance was created matches the AZ that appears in the spec.

Actual behavior: RDS instance created successfully, but at some point Synced on the Instance becomes False because a different availability zone appeared in the spec, which causes replacement. Replacement is blocked (thankfully) due to "prevent_destroy":true. Any unrelated changes we want to make are blocked by this.

Relevant Error Output Snippet

conditions:
  - lastTransitionTime: "2024-06-25T16:41:47Z"
    message: 'observe failed: cannot run plan: plan failed: Instance cannot be destroyed:
      Resource aws_db_instance.example-12345-abcde has lifecycle.prevent_destroy
      set, but the plan calls for this resource to be destroyed. To avoid this error
      and continue with the plan, either disable lifecycle.prevent_destroy or reduce
      the scope of the plan using the -target flag.'
    reason: ReconcileError
    status: "False"
    type: Synced
  - lastTransitionTime: "2024-02-07T13:40:20Z"
    reason: Finished
    status: "True"
    type: AsyncOperation
  - lastTransitionTime: "2023-11-21T19:16:10Z"
    reason: Available
    status: "True"
    type: Ready
  - lastTransitionTime: "2024-02-07T13:48:19Z"
    message: 'apply failed: Instance cannot be destroyed: Resource aws_db_instance.example-12345-abcde
      has lifecycle.prevent_destroy set, but the plan calls for this resource to be
      destroyed. To avoid this error and continue with the plan, either disable lifecycle.prevent_destroy
      or reduce the scope of the plan using the -target flag.'
    reason: ApplyFailure
    status: "False"
    type: LastAsyncOperation

Crossplane Version

1.14.9

Provider Version

0.40.102

Kubernetes Version

v1.28.9-eks-036c24b

Kubernetes Distribution

EKS

Additional Info

I'm fairly new to crossplane, apologies in advance if I mix up terminology and concepts.
All affected instances are using multi-AZ
AZ is not specified in any of our manifests, so I'm not sure why an AZ, let alone a different AZ, is appearing in the instance managed resource
I discovered this was occurring when trying to troubleshoot why changes to caCertIdentifier (from rds-ca-2019 to rds-ca-rsa2048-g1) were not being applied.
We use a forked provider for feeding into our crossplane libraries: https://github.com/grafana/crossplane-provider-aws/tree/release-0.40

The text was updated successfully, but these errors were encountered:

bobdanek · 2024-06-27T21:31:08Z

I found a workaround that will unblock me, though it doesn't explain why things got into this state to begin with:

If I delete availabilityZone from spec.forProvider, the resource gets updated a few seconds later, adding back availabilityZone but with the correct/expected value.

Example with kubectl:

kubectl patch instances.rds.aws.upbound.io example-db --type json -p '[{ "op": "remove", "path": "/spec/forProvider/availabilityZone" }]'

mbbush · 2024-07-04T05:21:56Z

The process of copying fields from the observed status.atProvider to spec.forProvider is called Late Initialization, and it has had significant improvements made since provider version 0.40. I don't have any explanation for how the wrong AZ could have gotten on the resource, although frankly the most likely explanation is that either someone or something in your infrastructure set it by mistake (typo in a kubectl command, bug in a composition, something like that).

Version 0.40 is really too old for bug reports to be meaningful. Since then, we've completely changed the way we reconcile resources (no longer forking terraform processes for each managed resource), which made a huge improvement in compute resource requirements for the provider, and upgraded to terraform provider aws version 5.x, to name just some of the major changes. I would encourage you to upgrade to a newer version of the provider, and I expect this issue would likely not re-occur. If it does, I'd be happy to look at a bug report with more specific STRs.

By the way, it looks like the fork you linked is missing the backport of a bug fix for a regression introduced in v0.40.0 for the Role.iam resource, which is one of the most commonly used resources. Maybe you don't need it because you don't use that resource, but you should at least be aware of the issue and the fix (which is coincidentally also related to late initialization).

WolfGanGeRTech · 2024-07-23T17:18:09Z

I also had this issue. The reason for happening is that since you enable "multiAZ=true" if for some reason the instance of the initial Az has any issue it will switch to another Az and this error will start to appear (this is pretty common).

I have this issue in many other fields like "engineVersion" because AWS does minorUpgrades and Crossplane starts getting a sync error.

It would be really cool if we could have a way to specify that some field should be ignored. The initProvider kinda does this but we need to provide an initial value which for most of the cases don't make sense.

WolfGanGeRTech · 2024-07-23T17:29:09Z

This also relates to this other issue I have: #1370 basically I just want to import an existing ReplicationGroup and tell Crossplane to ignore the field "authTokenSecretRef" but (afaik) it is not possible cause Crossplane always tries to sync it.

pixiono · 2024-08-02T12:02:45Z

We are facing the same issue.

github-actions · 2024-11-01T04:33:04Z

This provider repo does not have enough maintainers to address every issue. Since there has been no activity in the last 90 days it is now marked as stale. It will be closed in 14 days if no further activity occurs. Leaving a comment starting with /fresh will mark this issue as not stale.

applike-ss · 2024-11-05T10:01:59Z

I also had this issue. The reason for happening is that since you enable "multiAZ=true" if for some reason the instance of the initial Az has any issue it will switch to another Az and this error will start to appear (this is pretty common).

I have this issue in many other fields like "engineVersion" because AWS does minorUpgrades and Crossplane starts getting a sync error.

It would be really cool if we could have a way to specify that some field should be ignored. The initProvider kinda does this but we need to provide an initial value which for most of the cases don't make sense.

Exactly what I stumbled upon too. During OS system maintenance it it switching over to the secondary AZ and suddenly the resource is not in sync anymore.
I'd also like to see a feature to ignore certain fields.

bobdanek added bug Something isn't working needs:triage labels Jun 25, 2024

mbbush added is:triaged Indicates that an issue has been reviewed. and removed needs:triage labels Jul 4, 2024

github-actions bot added the stale label Nov 1, 2024

jeanduplessis removed the stale label Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: RDS instances not syncing due to wrong AZ in spec #1379

[Bug]: RDS instances not syncing due to wrong AZ in spec #1379

bobdanek commented Jun 25, 2024

bobdanek commented Jun 27, 2024

mbbush commented Jul 4, 2024

WolfGanGeRTech commented Jul 23, 2024

WolfGanGeRTech commented Jul 23, 2024

pixiono commented Aug 2, 2024

github-actions bot commented Nov 1, 2024

applike-ss commented Nov 5, 2024

[Bug]: RDS instances not syncing due to wrong AZ in spec #1379

[Bug]: RDS instances not syncing due to wrong AZ in spec #1379

Comments

bobdanek commented Jun 25, 2024

Is there an existing issue for this?

Affected Resource(s)

Resource MRs required to reproduce the bug

Steps to Reproduce

What happened?

Relevant Error Output Snippet

Crossplane Version

Provider Version

Kubernetes Version

Kubernetes Distribution

Additional Info

bobdanek commented Jun 27, 2024

mbbush commented Jul 4, 2024

WolfGanGeRTech commented Jul 23, 2024

WolfGanGeRTech commented Jul 23, 2024

pixiono commented Aug 2, 2024

github-actions bot commented Nov 1, 2024

applike-ss commented Nov 5, 2024