Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exposing scale subresource to enable HPA's to modify the replicas of Infinispan CR #2133

Merged
merged 6 commits into from
Sep 30, 2024

Conversation

ZeidH
Copy link
Contributor

@ZeidH ZeidH commented Jul 29, 2024

Enhanced the spec.subresources field for the Infinispan CRD to map scaling

      scale:
        specReplicasPath: .spec.replicas
        statusReplicasPath: .status.replicas

This enables APIs like the HorizontalPodAutoscaler and KEDA to modify the replica count of the Cache.

Copy link

openshift-ci bot commented Jul 29, 2024

Hi @ZeidH. Thanks for your PR.

I'm waiting for a infinispan member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ZeidH ZeidH mentioned this pull request Jul 29, 2024
Copy link
Contributor

@ryanemerson ryanemerson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @ZeidH, however we generate our CRD content from code, so it's not sufficient to update the static resource files.

Looking at the Artemis PR you linked on the issue, it seems like we can generate this by adding the following annotation above the Infinispan struct:

//+kubebuilder:subresource:scale:specpath=.spec.replicas,statuspath=.status.replicas

You should then be able to generate the static files by doing make generate.

@ZeidH
Copy link
Contributor Author

ZeidH commented Jul 31, 2024

Thanks for the clear instructions @ryanemerson!, I'm rather new to this so appreciate the help.
Encountered this issue with controller-gen on my version of go (1.22.5) so I also updated it to latest (1.15.0).

@ryanemerson
Copy link
Contributor

Thanks for updating @ZeidH. It seems like we need to increase our baseline go version in order to use that version of controller-gen. I have created a PR to baseline on 1.21 (1.20 is now EOL) and update the controller-gen version #2134. I'll let you know once that PR has been merged so that you can rebase this PR.

@ZeidH
Copy link
Contributor Author

ZeidH commented Aug 9, 2024

I've tested the autoscaling with this setup:

./scripts/ci/kind-with-olm.sh
make deploy-cert-manager
helm install keda kedacore/keda --namespace keda --create-namespace
skaffold dev

and a ScaledObject that targets the Infinispan CR

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: infinispan-scaler
  namespace: infinispan-operator-system
spec:
  minReplicaCount: 0
  scaleTargetRef:
    apiVersion: infinispan.org/v1
    kind: Infinispan
    name: test-cache
  triggers:
    - metadata:
        desiredReplicas: '2'
        start: 52 8 * * *
        end: 59 8 * * *
        timezone: UTC
      type: cron
---
apiVersion: infinispan.org/v1
kind: Infinispan
metadata:
  annotations:
    infinispan.org/monitoring: 'true'
    infinispan.org/operatorPodTargetLabels: 'rht.comp,rht.comp_ver,rht.prod_name,rht.prod_ver,rht.subcomp_t'
  name: test-cache
  namespace: infinispan-operator-system
spec:
  service:
    container:
      livenessProbe:
        failureThreshold: 5
        initialDelaySeconds: 0
        periodSeconds: 10
        successThreshold: 1
        timeoutSeconds: 1
      readinessProbe:
        failureThreshold: 5
        initialDelaySeconds: 0
        periodSeconds: 10
        successThreshold: 1
        timeoutSeconds: 1
      startupProbe:
        failureThreshold: 600
        initialDelaySeconds: 3
        periodSeconds: 1
        successThreshold: 1
        timeoutSeconds: 1
    type: DataGrid
  jmx: {}
  configListener:
    enabled: true
    logging:
      level: info
  upgrades:
    type: Shutdown
  replicas: 1

The Statefulset was able to scale from zero to n and back but I can imagine that for some configurations scaling to zero is not possible

@ryanemerson
Copy link
Contributor

ryanemerson commented Sep 3, 2024

Apologies for the delay @ZeidH, but main is now baselined on go 1.21. Can you please rebase on the latest code?

@ryanemerson
Copy link
Contributor

ryanemerson commented Sep 3, 2024

but I can imagine that for some configurations scaling to zero is not possible

Can you explain which configurations you think would cause issues?

Copy link
Contributor

@ryanemerson ryanemerson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ZeidH, just a couple of minor comments and then I think we're good.

api/v1/infinispan_types.go Show resolved Hide resolved
@ZeidH
Copy link
Contributor Author

ZeidH commented Sep 9, 2024

but I can imagine that for some configurations scaling to zero is not possible

Can you explain which configurations you think would cause issues?

I can't think of any specific to Infinispan, the amount of available configurations is so immense it would take a long time to figure out what config can be autoscaled, and what cannot. I think after learning a bit on what Infinispan does the past few months, the configs where autoscaling doesn't work well also most likely wouldn't make much sense to use autoscaling/scale to zero, like clustered caches where you (almost)always want to have a hot instance.

But from experience there can be issues with autoscaling when using specific types of storage configurations, or as @rigazilla mentioned in the GH Issue: about the operator sometimes needs to explicitly set the .spec.Replicas, i.e. in upgrade

What we attempted to do is to dynamically scale a replicated cache with the amount of nodes (1 cache per node), essentially making Infinispan behave like a DaemonSet. Although we're not done with it just yet, so far it's been working out well.

@ryanemerson ryanemerson enabled auto-merge (squash) September 30, 2024 10:47
@ryanemerson ryanemerson merged commit 32d71f3 into infinispan:main Sep 30, 2024
2 checks passed
@ryanemerson
Copy link
Contributor

Thanks @ZeidH and sorry for the delay. I'll create a follow up PR with some documentation. Let me know if you need this feature soon via OLM and I can expedite the next release.

ryanemerson added a commit to ryanemerson/infinispan-operator that referenced this pull request Sep 30, 2024
ryanemerson added a commit to ryanemerson/infinispan-operator that referenced this pull request Sep 30, 2024
ryanemerson added a commit to ryanemerson/infinispan-operator that referenced this pull request Sep 30, 2024
ryanemerson added a commit to ryanemerson/infinispan-operator that referenced this pull request Sep 30, 2024
ryanemerson added a commit to ryanemerson/infinispan-operator that referenced this pull request Oct 1, 2024
ryanemerson added a commit to ryanemerson/infinispan-operator that referenced this pull request Oct 2, 2024
@ZeidH
Copy link
Contributor Author

ZeidH commented Oct 2, 2024

Appreciate the feedback and the completion of this issue! It will help us greatly! We're not in a rush but we'd be very happy if this was released sometime in coming 2 weeks!

@ryanemerson
Copy link
Contributor

@ZeidH 2.4.5 has been released now.

@ZeidH
Copy link
Contributor Author

ZeidH commented Oct 10, 2024

Thank you @ryanemerson for the help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants