Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrote ISM Policy reconciler #846

Merged

Conversation

rkthtrifork
Copy link
Contributor

@rkthtrifork rkthtrifork commented Jun 18, 2024

Description

The ISM Policy reconciler was constantly trying to update the ISM Policy and it was not handling reconciliation requeue in some cases. There were possibly other issues as well. Below I have described what caused the different issues I encountered

  • The ISM Policy request was different from the response, but they were both made with the same struct. This caused the reconciler to always see the existing ISM Policy and the ISM Policy from the CR as different and try to update it. I have created a separate struct model for each to separate the logic and in the code I now compare the existing policy with the policy from the CR by comparing both the Policy IDs and the policy spec
  • There were some very complex cases in the code that were very difficult to understand so I have attempted to make the code more concise and easy to read and understand
  • I have added reconciliation requeuing to all cases so the operator doesn't just stop reconciling the ISM Policy in some cases

One thing I am wondering is that I am not sure why we would want to create a CR without specifying the cluster ID and then the operator automatically links it to that cluster ID so it breaks if the OpenSearch CR is deleted. Is this intended and why? I'm talking about the section with the comment "Check cluster ref has not changed"

Tested cases:

  • A new ISM Policy is created through a CR and the operator creates it in the OpenSearch Cluster
  • The CR for an ISM Policy that is created by the operator is removed and the operator removes it in the OpenSearch Cluster
  • An ISM Policy that already exists in the OpenSearch Cluster is created through a CR and the operator ignores it and marks it as existing
  • The CR for an ISM Policy that was pre-existing and therefore was not created by the operator is removed and the operator does not remove the ISM Policy from the OpenSearch Cluster
  • An ISM Policy that already exists in the OpenSearch Cluster is created through a CR and the operator ignores it and marks it as existing. The ISM Policy is then manually removed from the OpenSearch Cluster and the operator now applies the ISM Policy from the CR

The test for ISM Policies is currently failing miserably, but I decided to create the PR to get feedback before I dive into fixing it.

Issues Resolved

#833
#732
Possibly other issues

Check List

  • Commits are signed per the DCO using --signoff
  • Unittest added for the new/changed functionality and all unit tests are successful
  • Customer-visible features documented
  • No linter warnings (make lint)

If CRDs are changed:

  • CRD YAMLs updated (make manifests) and also copied into the helm chart
  • Changes to CRDs documented

Please refer to the PR guidelines before submitting this pull request.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@rkthtrifork
Copy link
Contributor Author

rkthtrifork commented Jun 21, 2024

I'm taking a second look at my code to see if there's anything I'm not happy about and I see naked returns are sometimes used and sometimes not. I see that the first reconciler (cluster reconciler) does not use naked return values and it seems somewhat random when we use naked returns and when we use explicit return values. I personally find the naked returns quite difficult to read and I have to backtrack in the code to see where the return value comes from so I would like to use explicit return values everywhere for the ISM Policy. Is this something you would have an issue with for this PR? @swoehrl-mw (I'm mentioning you since I've seen you the most here)

@rkthtrifork
Copy link
Contributor Author

I also don't understand why there are so many cases where the reconciliation is not requeued. Is this intended? And is it correctly understood that r.updateStatus is a boolean representing whether the operator should update the CR or is it something else?

@rkthtrifork
Copy link
Contributor Author

rkthtrifork commented Jun 21, 2024

And a last question. Is there a reason why we check that the cluster ref has not been changed? Would we not expect the CR to be applied to any OpenSearch Cluster with the given name? Especially since the name has to be unique for each namespace

@cthtrifork
Copy link
Contributor

@swoehrl-mw Is there a possibility of a community call or similar to discuss this? We would like to help improve this, but we need to align on the approach.

@rkthtrifork rkthtrifork marked this pull request as draft August 5, 2024 06:54
@swoehrl-mw
Copy link
Collaborator

Hi @cthtrifork. There currently is no community call for the operator. I'm quite swamped with work so don't really have the time for a 1:1 call, the best I can offer is an async discussion either here or you can hit me up in the opensearch slack.

cc @prudhvigodithi

@swoehrl-mw
Copy link
Collaborator

Hi @rkthtrifork. Sorry for not getting back to you, not much time and your comments got lost among all my messages.

I'm taking a second look at my code to see if there's anything I'm not happy about and I see naked returns are sometimes used and sometimes not. I see that the first reconciler (cluster reconciler) does not use naked return values and it seems somewhat random when we use naked returns and when we use explicit return values. I personally find the naked returns quite difficult to read and I have to backtrack in the code to see where the return value comes from so I would like to use explicit return values everywhere for the ISM Policy. Is this something you would have an issue with for this PR?

This boils down to basically personal style. I'm fine with using explicit returns.

I also don't understand why there are so many cases where the reconciliation is not requeued. Is this intended? And is it correctly understood that r.updateStatus is a boolean representing whether the operator should update the CR or is it something else?

The r.updateStatus is just a little helper for testing to make writing tests a bit easier. For reconciles the basic rule is: Each single reconciler should reconcile if it ran into an error or if its waiting on something. If there was no problem or nothing to do the default should be reconcile after 30s. It could be a bit confusing because for all reconcilers running under the cluster banner the top-level controller does this 30s default reconcile requeue. For the others there might be inconsistencies. If you find any, cleaning them up would be dope.

And a last question. Is there a reason why we check that the cluster ref has not been changed? Would we not expect the CR to be applied to any OpenSearch Cluster with the given name? Especially since the name has to be unique for each namespace

That was an early design decision to avoid inconsistencies and make it safer for the operator (like what should it do if the cluster changes, should it try to delete the role from the old one).

@rkthtrifork
Copy link
Contributor Author

I'm taking a second look at my code to see if there's anything I'm not happy about and I see naked returns are sometimes used and sometimes not. I see that the first reconciler (cluster reconciler) does not use naked return values and it seems somewhat random when we use naked returns and when we use explicit return values. I personally find the naked returns quite difficult to read and I have to backtrack in the code to see where the return value comes from so I would like to use explicit return values everywhere for the ISM Policy. Is this something you would have an issue with for this PR?

This boils down to basically personal style. I'm fine with using explicit returns.

Got it

I also don't understand why there are so many cases where the reconciliation is not requeued. Is this intended? And is it correctly understood that r.updateStatus is a boolean representing whether the operator should update the CR or is it something else?

The r.updateStatus is just a little helper for testing to make writing tests a bit easier. For reconciles the basic rule is: Each single reconciler should reconcile if it ran into an error or if its waiting on something. If there was no problem or nothing to do the default should be reconcile after 30s. It could be a bit confusing because for all reconcilers running under the cluster banner the top-level controller does this 30s default reconcile requeue. For the others there might be inconsistencies. If you find any, cleaning them up would be dope.

Got it

And a last question. Is there a reason why we check that the cluster ref has not been changed? Would we not expect the CR to be applied to any OpenSearch Cluster with the given name? Especially since the name has to be unique for each namespace

That was an early design decision to avoid inconsistencies and make it safer for the operator (like what should it do if the cluster changes, should it try to delete the role from the old one).

We have been talking about why it does this since we are running a test cluster where we sometimes kill OpenSearch completely during downgrades (when we retry an upgrade) and when we have tested some stuff that didn't go as planned. When we kill OpenSearch completely, we have to delete all the custom resources as well and we were unsure why it was made this way. I have no issue whatsoever with adding it back, but I wanted to raise the question first

@rkthtrifork
Copy link
Contributor Author

@swoehrl-mw Should I add back the cluster ref check and fix the test and then you can take a look at the PR?

@swoehrl-mw
Copy link
Collaborator

Should I add back the cluster ref check and fix the test and then you can take a look at the PR?

@rkthtrifork Yes, please add back the ref check. That way all reconcilers are consistent.

@rkthtrifork rkthtrifork force-pushed the patch/fix-ismpolicy-needs-update branch 4 times, most recently from 3cca2d0 to a3b8382 Compare August 21, 2024 12:34
@rkthtrifork rkthtrifork marked this pull request as ready for review August 21, 2024 12:51
@rkthtrifork rkthtrifork force-pushed the patch/fix-ismpolicy-needs-update branch 2 times, most recently from cc6b2c8 to 623268a Compare August 22, 2024 12:08
@rkthtrifork rkthtrifork force-pushed the patch/fix-ismpolicy-needs-update branch from 623268a to c0cc5bb Compare August 22, 2024 12:09
@rkthtrifork
Copy link
Contributor Author

@swoehrl-mw I fixed up the unit tests and added back the cluster ref check. Would you or someone else take a look?

Copy link
Collaborator

@swoehrl-mw swoehrl-mw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @rkthtrifork for your contribution.

@swoehrl-mw swoehrl-mw merged commit ec6428d into opensearch-project:main Aug 22, 2024
8 of 9 checks passed
@prudhvigodithi
Copy link
Member

@swoehrl-mw Is there a possibility of a community call or similar to discuss this? We would like to help improve this, but we need to align on the approach.

Hey @cthtrifork I'm open for setting up a community call, we even have a slack channel #k8s-operator https://opensearch.slack.com/archives/C06QRV1RLD7, please join.
Thanks
@getsaurabh02

@rkthtrifork
Copy link
Contributor Author

@swoehrl-mw Is there a possibility of a community call or similar to discuss this? We would like to help improve this, but we need to align on the approach.

Hey @cthtrifork I'm open for setting up a community call, we even have a slack channel #k8s-operator https://opensearch.slack.com/archives/C06QRV1RLD7, please join. Thanks @getsaurabh02

I joined the Slack channel as well. Idk if there's a reason for a community meeting in regards to discussing this now that it's already merged, but I still do think there are few inconsistencies with how the reconcilers handle different cases and I think I also saw some cases where some of the reconcilers didn't requeue when I would expect it to. Getting those aligned would be cool. I would have to check up on it before I have some concrete places in the code to mention

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

4 participants