Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: RDS instances not syncing due to wrong AZ in spec #1379

Open
1 task done
bobdanek opened this issue Jun 25, 2024 · 7 comments
Open
1 task done

[Bug]: RDS instances not syncing due to wrong AZ in spec #1379

bobdanek opened this issue Jun 25, 2024 · 7 comments
Labels
bug Something isn't working is:triaged Indicates that an issue has been reviewed.

Comments

@bobdanek
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Affected Resource(s)

  • rds.aws.upbound.io/v1beta1 - Instance

Resource MRs required to reproduce the bug

exampledb.yml.txt
mysqlinstanceandservice.yml.txt
xmysqlinstance.yml.txt
xmysqlinstanceandservice.yml.txt

Steps to Reproduce

I don't have a way to reproduce this outside our environment, but here's an approximation:

  • Create a new RDS instance via a composite resource. Set multiAz to true. Do not specify an availability zone.
  • Examine the "composed" instance (instances.rds.aws.upbound.io) to see in which AZ it landed (status.atProvider.availabilityZone)
  • Examine the "composed" instance (instances.rds.aws.upbound.io) to see which AZ ends up in the spec (spec.forProvider.availabilityZone)

What happened?

Expected: RDS instance created successfully, and the instance managed resource stays synced. The AZ in which the instance was created matches the AZ that appears in the spec.

Actual behavior: RDS instance created successfully, but at some point Synced on the Instance becomes False because a different availability zone appeared in the spec, which causes replacement. Replacement is blocked (thankfully) due to "prevent_destroy":true. Any unrelated changes we want to make are blocked by this.

Relevant Error Output Snippet

conditions:
  - lastTransitionTime: "2024-06-25T16:41:47Z"
    message: 'observe failed: cannot run plan: plan failed: Instance cannot be destroyed:
      Resource aws_db_instance.example-12345-abcde has lifecycle.prevent_destroy
      set, but the plan calls for this resource to be destroyed. To avoid this error
      and continue with the plan, either disable lifecycle.prevent_destroy or reduce
      the scope of the plan using the -target flag.'
    reason: ReconcileError
    status: "False"
    type: Synced
  - lastTransitionTime: "2024-02-07T13:40:20Z"
    reason: Finished
    status: "True"
    type: AsyncOperation
  - lastTransitionTime: "2023-11-21T19:16:10Z"
    reason: Available
    status: "True"
    type: Ready
  - lastTransitionTime: "2024-02-07T13:48:19Z"
    message: 'apply failed: Instance cannot be destroyed: Resource aws_db_instance.example-12345-abcde
      has lifecycle.prevent_destroy set, but the plan calls for this resource to be
      destroyed. To avoid this error and continue with the plan, either disable lifecycle.prevent_destroy
      or reduce the scope of the plan using the -target flag.'
    reason: ApplyFailure
    status: "False"
    type: LastAsyncOperation

Crossplane Version

1.14.9

Provider Version

0.40.102

Kubernetes Version

v1.28.9-eks-036c24b

Kubernetes Distribution

EKS

Additional Info

  • I'm fairly new to crossplane, apologies in advance if I mix up terminology and concepts.
  • All affected instances are using multi-AZ
  • AZ is not specified in any of our manifests, so I'm not sure why an AZ, let alone a different AZ, is appearing in the instance managed resource
  • I discovered this was occurring when trying to troubleshoot why changes to caCertIdentifier (from rds-ca-2019 to rds-ca-rsa2048-g1) were not being applied.
  • We use a forked provider for feeding into our crossplane libraries: https://github.com/grafana/crossplane-provider-aws/tree/release-0.40
@bobdanek bobdanek added bug Something isn't working needs:triage labels Jun 25, 2024
@bobdanek
Copy link
Author

I found a workaround that will unblock me, though it doesn't explain why things got into this state to begin with:

If I delete availabilityZone from spec.forProvider, the resource gets updated a few seconds later, adding back availabilityZone but with the correct/expected value.

Example with kubectl:

kubectl patch instances.rds.aws.upbound.io example-db --type json -p '[{ "op": "remove", "path": "/spec/forProvider/availabilityZone" }]'

@mbbush
Copy link
Collaborator

mbbush commented Jul 4, 2024

The process of copying fields from the observed status.atProvider to spec.forProvider is called Late Initialization, and it has had significant improvements made since provider version 0.40. I don't have any explanation for how the wrong AZ could have gotten on the resource, although frankly the most likely explanation is that either someone or something in your infrastructure set it by mistake (typo in a kubectl command, bug in a composition, something like that).

Version 0.40 is really too old for bug reports to be meaningful. Since then, we've completely changed the way we reconcile resources (no longer forking terraform processes for each managed resource), which made a huge improvement in compute resource requirements for the provider, and upgraded to terraform provider aws version 5.x, to name just some of the major changes. I would encourage you to upgrade to a newer version of the provider, and I expect this issue would likely not re-occur. If it does, I'd be happy to look at a bug report with more specific STRs.

By the way, it looks like the fork you linked is missing the backport of a bug fix for a regression introduced in v0.40.0 for the Role.iam resource, which is one of the most commonly used resources. Maybe you don't need it because you don't use that resource, but you should at least be aware of the issue and the fix (which is coincidentally also related to late initialization).

@mbbush mbbush added is:triaged Indicates that an issue has been reviewed. and removed needs:triage labels Jul 4, 2024
@WolfGanGeRTech
Copy link

I also had this issue. The reason for happening is that since you enable "multiAZ=true" if for some reason the instance of the initial Az has any issue it will switch to another Az and this error will start to appear (this is pretty common).

I have this issue in many other fields like "engineVersion" because AWS does minorUpgrades and Crossplane starts getting a sync error.

It would be really cool if we could have a way to specify that some field should be ignored. The initProvider kinda does this but we need to provide an initial value which for most of the cases don't make sense.

@WolfGanGeRTech
Copy link

This also relates to this other issue I have: #1370 basically I just want to import an existing ReplicationGroup and tell Crossplane to ignore the field "authTokenSecretRef" but (afaik) it is not possible cause Crossplane always tries to sync it.

@pixiono
Copy link

pixiono commented Aug 2, 2024

We are facing the same issue.

Copy link

github-actions bot commented Nov 1, 2024

This provider repo does not have enough maintainers to address every issue. Since there has been no activity in the last 90 days it is now marked as stale. It will be closed in 14 days if no further activity occurs. Leaving a comment starting with /fresh will mark this issue as not stale.

@github-actions github-actions bot added the stale label Nov 1, 2024
@applike-ss
Copy link

I also had this issue. The reason for happening is that since you enable "multiAZ=true" if for some reason the instance of the initial Az has any issue it will switch to another Az and this error will start to appear (this is pretty common).

I have this issue in many other fields like "engineVersion" because AWS does minorUpgrades and Crossplane starts getting a sync error.

It would be really cool if we could have a way to specify that some field should be ignored. The initProvider kinda does this but we need to provide an initial value which for most of the cases don't make sense.

Exactly what I stumbled upon too. During OS system maintenance it it switching over to the secondary AZ and suddenly the resource is not in sync anymore.
I'd also like to see a feature to ignore certain fields.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working is:triaged Indicates that an issue has been reviewed.
Projects
None yet
Development

No branches or pull requests

6 participants