Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defer cEOS-lab pod check, update operator version #534

Merged
merged 2 commits into from
May 6, 2024

Conversation

frasieroh
Copy link
Contributor

A customer is expressing performance concerns at high scale (~100 instances across ~10 nodes). One of their findings is that cEOS-lab instances appear to start consecutively instead of in parallel.

Because the pod check is baked into (n *Node) Config instead of (n *Node) Status, we don't create the next cEOS-lab custom resource object until the previous pod has started. Now they're created all at once.

The new operator version increases the number of reconcilation workers from 1 to runtime.NumCPU to cope with this change. It turns out the operator spends most of its time generated self-signed RSA certs, depending on what the runtime does with the worker goroutines there may be performance gains there.

Thanks!

The Arista node is waiting for its pod to come up in Create(). This is
problematic in high-scale scenarious because we create the pods
sychronously.

Move the check to Status() so checkNodeStatus() handles things instead.
See release notes: https://github.com/aristanetworks/arista-ceoslab-operator/releases/tag/v2.1.2

Increased the number of workers in the operator to the number of cores
(from the default value of 1). This may improve performance in
high-scale scenarios. Most of their time is spent generating RSA certificates.
@coveralls
Copy link

Pull Request Test Coverage Report for Build 8963970828

Details

  • 19 of 20 (95.0%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.06%) to 65.176%

Changes Missing Coverage Covered Lines Changed/Added Lines %
topo/node/arista/arista.go 19 20 95.0%
Totals Coverage Status
Change from base Build 8840821359: 0.06%
Covered Lines: 4634
Relevant Lines: 7110

💛 - Coveralls

@chrisy
Copy link

chrisy commented May 6, 2024

As the concerned customer, thanks for this. FWIW, I do have a hack that lets KNE avoid the local wait by launching all pods of a topology in parallel, with obvious concerns about simply shifting the problem to other API's -- in this case, that's what made apparent that the cEOS operator was itself somewhat serializing the work.

@alexmasi
Copy link
Contributor

alexmasi commented May 6, 2024

/gcbrun

@alexmasi alexmasi merged commit a88cfaf into openconfig:main May 6, 2024
5 checks passed
@frasieroh frasieroh deleted the node-status-pr branch May 6, 2024 18:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants