Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for msk connect resources #1162

Merged
merged 8 commits into from
Mar 28, 2024

Conversation

mbbush
Copy link
Collaborator

@mbbush mbbush commented Feb 16, 2024

Description of your changes

Adds support for the three Msk Connect resources.

Note: AWS has very limited api support for these resources. Other than the scaling object, every field on the terraform schema has ForceNew: true, so from the crossplane perspective, the resources are mostly immutable once created. If you want to make a change, you'll have to delete it and make a new one.

The WorkerConfiguration also has no delete method in the aws API, so once you've created a config with a given name, that's it, forever.

There is some surprising lifecycle behavior around the Connector resource, that I'm not sure if it's possible to resolve.

The AWS api exposes the State of a connector, which can be either "Creating", "Running", "Deleting" or "Failed". The terraform provider uses this state to determine whether a creation call succeeds, but does not expose it in the observed status as a property of the resource, so I can't think of a way to get it into crossplane.

When I create a Connector resource that is broken (as in the connector-nokafka.yaml example, where the kafka cluster doesn't exist), what happens is that the create call runs for just over 20 minutes (I had to increase the timeout slightly from the 20 minutes configured in the terraform provider), and then fails, and sets the following conditions:

  - lastTransitionTime: "2024-02-17T23:47:54Z"
    reason: ReconcileSuccess
    status: "True"
    type: Synced
  - lastTransitionTime: "2024-02-17T23:47:54Z"
    reason: Creating
    status: "False"
    type: Ready
  - lastTransitionTime: "2024-02-18T00:08:23Z"
    message: 'async create failed: failed to create the resource: [{0 waiting for
      MSK Connect Connector (arn:aws:kafkaconnect:us-east-2:905418119848:connector/connector-broken/4f63f714-30c2-4dd7-bf0c-edced7778d4e-2)
      create: unexpected state ''FAILED'', wanted target ''RUNNING''. last error:
      UnknownError.Unknown: The last operation failed. Retry the operation.  []}]'
    reason: AsyncCreateFailure
    status: "False"
    type: LastAsyncOperation

The provider then immediately runs an Observe, which finds that the connector does exist, and that its status matches the spec (since the State is not exposed in the terraform schema), and decides that creation must have succeeded after all, filling in status.atProvider and setting the following conditions:

  - lastTransitionTime: "2024-02-17T23:47:54Z"
    reason: ReconcileSuccess
    status: "True"
    type: Synced
  - lastTransitionTime: "2024-02-18T00:08:25Z"
    reason: Available
    status: "True"
    type: Ready
  - lastTransitionTime: "2024-02-18T00:08:25Z"
    reason: Success
    status: "True"
    type: LastAsyncOperation
  - lastTransitionTime: "2024-02-18T00:08:25Z"
    reason: UpToDate
    status: "True"
    type: Test

I'm not quite sure how best to reconcile this behavior of the aws resource, and the terraform provider, with what crossplane expects.

Fixes #374
Fixes #1146

I have:

  • Run make reviewable test to ensure this PR is ready for review.

How has this code been tested

Manually, in kind, using the manifests I committed. I was able to get the connector to enter a Running state, in which it did nothing because there was no traffic, and no actual working code to run, but the infrastructure was all healthy.

I'll run uptest on the manifests that don't require manual intervention.

Output from local run with the connector.yaml example:

NAME                                       READY   SYNCED   EXTERNAL-NAME   AGE
group.cloudwatchlogs.aws.upbound.io/test   True    True     op-vggf5w1o     154m

NAME                                           READY   SYNCED   EXTERNAL-NAME       AGE
securitygrouprule.ec2.aws.upbound.io/egress    True    True     sgrule-2572981504   154m
securitygrouprule.ec2.aws.upbound.io/ingress   True    True     sgrule-2191228865   154m

NAME                                  READY   SYNCED   EXTERNAL-NAME          AGE
securitygroup.ec2.aws.upbound.io/sg   True    True     sg-0b03fcec08eeca5d9   154m

NAME                                   READY   SYNCED   EXTERNAL-NAME              AGE
subnet.ec2.aws.upbound.io/subnet-az1   True    True     subnet-04e98dd98ea750f9f   154m
subnet.ec2.aws.upbound.io/subnet-az2   True    True     subnet-040975500b81b55f1   154m

NAME                         READY   SYNCED   EXTERNAL-NAME           AGE
vpc.ec2.aws.upbound.io/vpc   True    True     vpc-0319cb8653512d954   154m

NAME                                               READY   SYNCED   EXTERNAL-NAME                                                             AGE
deliverystream.firehose.aws.upbound.io/connector   True    True     arn:aws:firehose:us-east-2:226209437353:deliverystream/example-absdfsdf   154m

NAME                                READY   SYNCED   EXTERNAL-NAME   AGE
role.iam.aws.upbound.io/connector   True    True     connector       154m

NAME                                   READY   SYNCED   EXTERNAL-NAME                                                                                 AGE
cluster.kafka.aws.upbound.io/example   True    True     arn:aws:kafka:us-east-2:226209437353:cluster/example/d9f0e042-f520-4e34-a6d6-1ac5fce3e82d-4   154m

NAME                                         READY   SYNCED   EXTERNAL-NAME                                                                                       AGE
configuration.kafka.aws.upbound.io/example   True    True     arn:aws:kafka:us-east-2:226209437353:configuration/example/eae7df28-446c-4c93-9ecb-ab0737de0e01-4   154m

NAME                                                     READY   SYNCED   EXTERNAL-NAME                            AGE
connector.kafkaconnect.aws.upbound.io/connector          True    True     c6c84280-7796-42cc-9f4b-42607692741a-2   29m

NAME                                               READY   SYNCED   EXTERNAL-NAME                            AGE
customplugin.kafkaconnect.aws.upbound.io/example   True    True     29fec958-2694-4251-9a62-b2f7de64f7fc-2   154m

NAME                                                        READY   SYNCED   EXTERNAL-NAME                            AGE
workerconfiguration.kafkaconnect.aws.upbound.io/connector   True    True     2d5604c3-ec17-4394-be2f-8b07131963c3-2   154m

NAME                                 READY   SYNCED   EXTERNAL-NAME   AGE
bucket.s3.aws.upbound.io/connector   True    True     op-bonq5b6d     154m

NAME                               READY   SYNCED   EXTERNAL-NAME   AGE
object.s3.aws.upbound.io/example   True    True     empty.zip       154m

@mbbush
Copy link
Collaborator Author

mbbush commented Feb 16, 2024

/test-examples="examples/kafkaconnect/v1beta1/customplugin.yaml"

@mbbush
Copy link
Collaborator Author

mbbush commented Feb 16, 2024

/test-examples="examples/kafkaconnect/v1beta1/workerconfiguration.yaml"

@mbbush
Copy link
Collaborator Author

mbbush commented Feb 16, 2024

/test-examples="examples/kafkaconnect/v1beta1/connector-broken.yaml"

@mbbush
Copy link
Collaborator Author

mbbush commented Feb 16, 2024

/test-examples="examples/kafkaconnect/v1beta1/connector-broken.yaml"

The previous run actually failed because of the aws account hitting a VPC limit. It would be great if someone from upbound could submit some AWS quota increase requests for the number of VPCs per region.

@mbbush
Copy link
Collaborator Author

mbbush commented Mar 1, 2024

/test-examples="examples/kafkaconnect/v1beta1/connector-broken.yaml"

@mbbush
Copy link
Collaborator Author

mbbush commented Mar 1, 2024

The previous run failed with

2024-03-01T19:10:45.0295557Z     logger.go:42: 19:10:44 | case/0-apply |       message: 'connect failed: cannot initialize the Terraform plugin SDK async external
2024-03-01T19:10:45.0296976Z     logger.go:42: 19:10:44 | case/0-apply |         client: cannot get terraform setup: cannot get referenced Provider: default:
2024-03-01T19:10:45.0298532Z     logger.go:42: 19:10:44 | case/0-apply |         Timeout: failed waiting for *v1beta1.ProviderConfig Informer to sync'

I haven't seen this particular failure before, but it doesn't seem at all related to this PR.

Maybe it will pass a second time?

/test-examples="examples/kafkaconnect/v1beta1/connector-broken.yaml"

@mbbush
Copy link
Collaborator Author

mbbush commented Mar 2, 2024

The previous failure was caused by upbound/universal-crossplane#442. Pinning the UXP version in the makefile resolved the issue.

/test-examples="examples/kafkaconnect/v1beta1/connector-nokafka.yaml"

@mbbush
Copy link
Collaborator Author

mbbush commented Mar 2, 2024

Uptest is failing because of too many VPCs. This is a problem in the upbound AWS account, which I don't have access to.

@jeanduplessis Could you please take a look at why cloud-nuke doesn't seem to be cleaning them up? Requesting a limit increase from AWS on the number of VPCs per region would also be very helpful.

@jeanduplessis
Copy link
Collaborator

/test-examples="examples/kafkaconnect/v1beta1/connector-nokafka.yaml"

@jeanduplessis
Copy link
Collaborator

@mbbush its should be fine now again. Ping me directly if the uptest run fails again with that issue.

@mbbush
Copy link
Collaborator Author

mbbush commented Mar 5, 2024

/test-examples="examples/kafkaconnect/v1beta1/connector-nokafka.yaml"

Makefile Outdated Show resolved Hide resolved
@mbbush
Copy link
Collaborator Author

mbbush commented Mar 6, 2024

I believe this is ready for review. @turkenf do you think you could take a look before the next provider-aws release?

I don't see any path forward to improving the behavior of the provider when creating a Connector results in creating a connector in state Failed in aws. The state is not exposed by the tf provider, so the only way I could even possibly query it would be to extract the aws client from the provider and make an explicit API call.

I tried that, and was able to extract the aws client, but it was wrapped in a different type, and I couldn't figure out how to extract the underlying aws sdk client (it's probably possibly, I just don't know how). But regardless of whether it's possible I'm not sure if it's wise.

My preference would be to merge this as-is, and release it, with the known issue that the provider does not expose error states with the connector. If there's consensus that it is wise to somehow add the connector's status field to the schema by invoking the SDK client directly, then I can work on that for a future release, as it would be a non-breaking change.

Nothing significant has changed, so I'm confident the e2e tests will still pass. I'll rerun the ones that are testable automatically, but the full example which requires manual intervention is kind of a pain to run and takes a very long time, so I'd rather only do it one more time after this is otherwise approved.

@mbbush
Copy link
Collaborator Author

mbbush commented Mar 6, 2024

/test-examples="examples/kafkaconnect/v1beta1/customplugin.yaml"

@mbbush
Copy link
Collaborator Author

mbbush commented Mar 11, 2024

/test-examples="examples/kafkaconnect/v1beta1/connector-nokafka.yaml"

@jeanduplessis
Copy link
Collaborator

/test-examples="examples/kafkaconnect/v1beta1/connector-nokafka.yaml"

Copy link
Collaborator

@turkenf turkenf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your effort in this PR @mbbush, I left a few initial review comments for you to consider.

config/kafkaconnect/config.go Show resolved Hide resolved
config/kafkaconnect/config.go Show resolved Hide resolved
config/kafkaconnect/config.go Show resolved Hide resolved
config/kafkaconnect/config.go Show resolved Hide resolved
examples/kafkaconnect/v1beta1/connector.yaml Outdated Show resolved Hide resolved
examples/kafkaconnect/v1beta1/connector.yaml Show resolved Hide resolved
examples/kafkaconnect/v1beta1/connector.yaml Show resolved Hide resolved
examples/kafkaconnect/v1beta1/connector.yaml Show resolved Hide resolved
examples/kafkaconnect/v1beta1/connector.yaml Show resolved Hide resolved
@mbbush mbbush force-pushed the msk-connect branch 2 times, most recently from 4e6dfb0 to 7a22d4a Compare March 16, 2024 06:42
@mbbush mbbush requested a review from turkenf March 18, 2024 18:02
Copy link
Collaborator

@turkenf turkenf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @mbbush, we can merge after resolving the conflicts.

mbbush and others added 5 commits March 28, 2024 09:31
Signed-off-by: Matt Bush <[email protected]>
Signed-off-by: Matt Bush <[email protected]>

Co-authored-by: Fatih Türken <[email protected]>
@mbbush
Copy link
Collaborator Author

mbbush commented Mar 28, 2024

@turkenf Rebased and resolved conflicts. It's not clear to me if you've already started the process of releasing v1.3.0. It would be great if this made it in, but it's not a huge problem for me if it doesn't.

Signed-off-by: Matt Bush <[email protected]>
@turkenf
Copy link
Collaborator

turkenf commented Mar 28, 2024

@mbbush, we will include this PR.

@mbbush
Copy link
Collaborator Author

mbbush commented Mar 28, 2024

Thanks! I'm rebasing/committing from the train on my way in to the office, so I didn't have the bandwidth to run make generate locally. But I applied the diff from the failed check-diff ci job using patch, so it should pass now.

@turkenf
Copy link
Collaborator

turkenf commented Mar 28, 2024

/test-examples="examples/ec2/v1beta1/vpc.yaml"

Uptest run: https://github.com/crossplane-contrib/provider-upjet-aws/actions/runs/8470993317

@turkenf turkenf merged commit 9c77960 into crossplane-contrib:main Mar 28, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Request for aws_mskconnect_* resources Moving location(1), mskconnect(3) resources to v1beta1 version
3 participants