Defer cEOS-lab pod check, update operator version #534

frasieroh · 2024-05-06T04:20:19Z

A customer is expressing performance concerns at high scale (~100 instances across ~10 nodes). One of their findings is that cEOS-lab instances appear to start consecutively instead of in parallel.

Because the pod check is baked into (n *Node) Config instead of (n *Node) Status, we don't create the next cEOS-lab custom resource object until the previous pod has started. Now they're created all at once.

The new operator version increases the number of reconcilation workers from 1 to runtime.NumCPU to cope with this change. It turns out the operator spends most of its time generated self-signed RSA certs, depending on what the runtime does with the worker goroutines there may be performance gains there.

Thanks!

The Arista node is waiting for its pod to come up in Create(). This is problematic in high-scale scenarious because we create the pods sychronously. Move the check to Status() so checkNodeStatus() handles things instead.

See release notes: https://github.com/aristanetworks/arista-ceoslab-operator/releases/tag/v2.1.2 Increased the number of workers in the operator to the number of cores (from the default value of 1). This may improve performance in high-scale scenarios. Most of their time is spent generating RSA certificates.

coveralls · 2024-05-06T04:23:33Z

Pull Request Test Coverage Report for Build 8963970828

Details

19 of 20 (95.0%) changed or added relevant lines in 1 file are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.06%) to 65.176%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
topo/node/arista/arista.go	19	20	95.0%

Totals
Change from base Build 8840821359:	0.06%
Covered Lines:	4634
Relevant Lines:	7110

💛 - Coveralls

chrisy · 2024-05-06T14:12:38Z

As the concerned customer, thanks for this. FWIW, I do have a hack that lets KNE avoid the local wait by launching all pods of a topology in parallel, with obvious concerns about simply shifting the problem to other API's -- in this case, that's what made apparent that the cEOS operator was itself somewhat serializing the work.

alexmasi · 2024-05-06T17:40:56Z

/gcbrun

frasieroh added 2 commits May 5, 2024 23:50

Arista node to implement Status

a76d314

The Arista node is waiting for its pod to come up in Create(). This is problematic in high-scale scenarious because we create the pods sychronously. Move the check to Status() so checkNodeStatus() handles things instead.

alexmasi approved these changes May 6, 2024

View reviewed changes

alexmasi merged commit a88cfaf into openconfig:main May 6, 2024
5 checks passed

frasieroh deleted the node-status-pr branch May 6, 2024 18:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Defer cEOS-lab pod check, update operator version #534

Defer cEOS-lab pod check, update operator version #534

frasieroh commented May 6, 2024

coveralls commented May 6, 2024

chrisy commented May 6, 2024

alexmasi commented May 6, 2024

Defer cEOS-lab pod check, update operator version #534

Defer cEOS-lab pod check, update operator version #534

Conversation

frasieroh commented May 6, 2024

coveralls commented May 6, 2024

Pull Request Test Coverage Report for Build 8963970828

Details

💛 - Coveralls

chrisy commented May 6, 2024

alexmasi commented May 6, 2024