Member info in SPC status - capacity manager part #1119

metlos · 2025-01-08T14:22:17Z

This slightly simplifies the capacity manager by relying more on the ready status of the SPC (which is now based on more complex logic).

A whole bunch of tests was also updated to not have to initialize the count cache just for the capacity manager and instead set up the capacity manager to only use SPCs for determining the space counts.

This is a follow-up PR of #1109

Note that there are no new or updated e2e tests for this PR because it should not change any behavior whatsoever.

the corresponding cluster. The capacity manager is simplified to take this fact into account, even though it needs to re-check the spacecount from the cache to decrease the chance of placing spaces to full clusters just because of the fact that the reconciliation of the SPC didn't happen yet.

The readiness reason will reflect the situation better in that case.

… in the manager

…o member-info-in-spc-status-2

Co-authored-by: Francisc Munteanu <[email protected]>

…o member-info-in-spc-status-2

Co-authored-by: Francisc Munteanu <[email protected]>

This simplifies the logic in the controller and doesn't increase the complexity in the controller tests.

to be less surprising.

a test package and simplify how the SpaceCountGetter is obtained.

…o member-info-in-spc-status-2

SpaceProvisionerConfig. This data is just for information purposes and is not yet used anywhere else in the operator.

contained in the SpaceProvisionerConfig. In addition to that it only uses the counts cache to minimize the chance of overcommitting spaces to clusters.

…c-status_capacity-manager-part

…tion of the cluster in the capacity manager

openshift-ci · 2025-01-08T14:22:22Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: metlos

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [metlos]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…-part

metlos · 2025-01-09T11:31:32Z

/retest

MatousJobanek

It looks like that the PR still needs some cleanup after the changes and discussions in the previous PR. I'm talking mainly about the usage of SPC status for storing the number of provisioned Spaces - we shouldn't do that in the capacity manager.

In other words, use SPC only to get the readiness, and the threshold for max number of Spaces, nothing else. The Space count should be taken from the cache only.

MatousJobanek · 2025-01-09T13:46:33Z

controllers/usersignup/approval_test.go


 		// when
-		approved, clusterName, err := getClusterIfApproved(ctx, fakeClient, signup, capacity.NewClusterManager(test.HostOperatorNs, fakeClient))
+		approved, clusterName, err := getClusterIfApproved(ctx, fakeClient, signup, capacity.NewClusterManager(test.HostOperatorNs, fakeClient, hspc.GetSpaceCountFromSpaceProvisionerConfigs(fakeClient, test.HostOperatorNs)))


this doesn't seem to be correct - we shouldn't use SPC for getting the Space count

as mentioned below, since we don't need to worry about the counts & thresholds in this unit test, then it's fine to leave them 0, thus no need to initialize anything.

btw, how about having something like this:

type CountPerCluster func() (string, int) func ClusterCount(cluster string, count int) CountPerCluster { return func() (string, int) { return cluster, count } } func InitializeCountersWith(t *testing.T, counts ...CountPerCluster) { commontest.SetEnvVarAndRestore(t, "WATCH_NAMESPACE", commontest.HostOperatorNs) counter.Reset() t.Cleanup(counter.Reset) var members []ToolchainStatusOption for _, clusterCount := range counts { cluster, count := clusterCount() members = append(members, WithMember(cluster, WithSpaceCount(count))) } toolchainStatus := NewToolchainStatus(members...) initializeCounters(t, commontest.NewFakeClient(t), toolchainStatus) }

so it could be used:

InitializeCountersWith(t, ClusterCount("member1", 0), ClusterCount("member2", 0))

MatousJobanek · 2025-01-09T13:48:52Z

controllers/usersignup/approval_test.go

-		spc1 := hspc.NewEnabledValidTenantSPC("member1", spc.MaxNumberOfSpaces(1000), spc.MaxMemoryUtilizationPercent(70))
-		spc2 := hspc.NewEnabledValidTenantSPC("member2", spc.MaxNumberOfSpaces(1000), spc.MaxMemoryUtilizationPercent(75))
-		fakeClient := commontest.NewFakeClient(t, toolchainStatus, toolchainConfig, spc1, spc2)
-		InitializeCounters(t, toolchainStatus)


I believe that we should still initialize the counter cache, without it you won't be able to test the expected logic

based on the other comment below, we don't need to initialize the counters in this unit tests, it's fine if they stay just 0 for the cache.

MatousJobanek · 2025-01-09T13:52:30Z

controllers/usersignup/approval_test.go

+		spc1 := hspc.NewEnabledTenantSPC("member1")
+		spc2 := hspc.NewEnabledValidTenantSPC("member2")


we need to verify also that the space count thresholds are applied, not only the readiness

I don't think that user approval tests should be testing the capacity manager. That's why I wanted to simplify these tests as much as possible. It is no concern to the approval process why a cluster is not capable of accepting a space.

I didn't want to be radical and wanted some kind of "middle ground" solution where I'd just simplify the capacity manager logic to at least not rely on the ToolchainStatus. But in hindsight I think I should have just modified the usersignup to be able to completely mock-out the capacity manager so that these tests actually just test what they're supposed to and not have to have a ton of setup.

I do think though that there shouldn't be a single threshold check in the approval tests. The approval only is concerned about clusters that can be deployed to. It should not have to know about the reasons why a cluster is not eligible, just whether it is or not.

Good point 👍
Yeah, you are right that testing the thresholds should be enough in the capacity manager tests and here only test when SPC is (not) ready

controllers/usersignup/approval_test.go

MatousJobanek · 2025-01-09T13:57:53Z

controllers/usersignup/usersignup_controller_test.go

-	InitializeCounters(t, NewToolchainStatus(
-		WithMetric(toolchainv1alpha1.UserSignupsPerActivationAndDomainMetricKey, toolchainv1alpha1.Metric{
-			"1,external": 1,
-		}),
-		WithMetric(toolchainv1alpha1.MasterUserRecordsPerDomainMetricKey, toolchainv1alpha1.Metric{
-			string(metrics.External): 1,
-		}),
-	))


let's keep using the counters (and it's initialization) also here. Or is there any reason for dropping it?

MatousJobanek · 2025-01-09T13:59:53Z

pkg/capacity/manager.go

+	// SpaceCountGetter is a function useful for mocking the counts cache and can be passed
+	// to the NewClusterManager function. The returned tuple represents the actual number
+	// of the spaces found for given cluster and the fact whether the value was actually found or not
+	// (similar to the return value of the map lookup).
+	SpaceCountGetter func(ctx context.Context, clusterName string) (int, bool)


is this really needed? why not keeping the original format that was there before?

MatousJobanek · 2025-01-09T14:03:09Z

pkg/capacity/manager.go

+		if spc.Status.ConsumedCapacity != nil {
+			spaceCount = spc.Status.ConsumedCapacity.SpaceCount


correct me if I'm wrong, but we shouldn't use the SPC for getting the number of provisioned Spaces

MatousJobanek · 2025-01-09T14:05:20Z

pkg/capacity/manager.go

+	if spaceCount, ok := getSpaceCount(ctx, spc.Spec.ToolchainCluster); ok {
+		// the cached spaceCount is always going to be fresher than (or as fresh as) what's in the SPC
+		if spc.Status.ConsumedCapacity == nil {
+			spc.Status.ConsumedCapacity = &toolchainv1alpha1.ConsumedCapacity{}
+		}
+		spc.Status.ConsumedCapacity.SpaceCount = spaceCount
+	}


as discussed in the previous PR, let's not update the status of the SPC if we are not going to update the actual resource - there is no benefit in doing that

We discussed that it is weird for a predicate to have side-effects which I completely agree with.

Now, the "side-effect" is done while getting the candidate optimal clusters SO THAT the value we update can later be used again in the rest of the computation. Otherwise we'd be forced to repeatedly read the value from the count cache while sorting the array of the candidate clusters. I stand by the fact that it is better do it once and re-use those values than to do that repeatedly. It is not only more performant but also more stable (think of concurrent updates of the count cache).

This behavior is documented in the docs of getOptionalTargetClusters() as well as a comment in the body of GetOptionalTargetCluster().

I don't agree with the reasoning that we only should update the CRDs if we intend to update them. Those are data structures like any other - they're used to hold data. And if we need some kind of data be updated because we are going to need the updated values down the road, it is IMHO completely warranted to use the CRD for that. Note that these CRDs are local to the computation in GetOptionalTargetCluster() function. They don't escape that function because they are not stored in cluster nor in any "global" structure like the count cache. So there is no non-local "suprise" in having the updates in the CRDs.

* Get rid of SpaceCountGetter and just always use the counts cache. * fix counter.GetCounts() to not be susceptible to concurrent modification * have a dedicated data structure for the optimal cluster computation

…ty-manager-part' into member-info-in-spc-status_capacity-manager-part

sonarqubecloud · 2025-01-10T16:03:20Z

Quality Gate passed

Issues
10 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
7.2% Duplication on New Code

See analysis details on SonarQube Cloud

codecov · 2025-01-10T18:07:28Z

Codecov Report

Attention: Patch coverage is 85.71429% with 13 lines in your changes missing coverage. Please review.

Project coverage is 79.02%. Comparing base (69a195d) to head (4ad343b).

Files with missing lines	Patch %	Lines
pkg/counter/cache.go	65.00%	7 Missing ⚠️
pkg/capacity/manager.go	91.66%	3 Missing and 2 partials ⚠️
test/counter.go	50.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1119      +/-   ##
==========================================
- Coverage   79.15%   79.02%   -0.13%     
==========================================
  Files          78       78              
  Lines        7809     7814       +5     
==========================================
- Hits         6181     6175       -6     
- Misses       1449     1457       +8     
- Partials      179      182       +3

Files with missing lines	Coverage Δ
...ers/spacecompletion/space_completion_controller.go	`85.48% <100.00%> (-0.24%)`	⬇️
controllers/usersignup/approval.go	`100.00% <100.00%> (ø)`
test/spaceprovisionerconfig/util.go	`100.00% <100.00%> (ø)`
test/counter.go	`83.33% <50.00%> (ø)`
pkg/capacity/manager.go	`95.04% <91.66%> (-3.45%)`	⬇️
pkg/counter/cache.go	`83.02% <65.00%> (-1.98%)`	⬇️

metlos and others added 25 commits November 28, 2024 20:26

fix small typos and improve the wording in comments

8470d2b

Address linter complaints.

f30cec6

Merge branch 'master' into member-info-in-spc-status-2

a6b7053

return error when no ToolchainStatus is found

2130e70

The readiness reason will reflect the situation better in that case.

Rename the tests to better reflect what they're doing

466f212

Don't log the error that will be logged again by the callers

8450dd9

better naming and explanation why we're doing only a partial re-check…

a0b15c8

… in the manager

Merge remote-tracking branch 'origin/member-info-in-spc-status-2' int…

1b9a606

…o member-info-in-spc-status-2

Add tests for the GetSpaceCountsFromCountsCache.

691752e

Clean up comments

d6db400

Co-authored-by: Francisc Munteanu <[email protected]>

add a testcase for multiple memory usage threshold breaches

3b1b557

Merge remote-tracking branch 'origin/member-info-in-spc-status-2' int…

9e27aa8

…o member-info-in-spc-status-2

Fix typos

136c8ed

Co-authored-by: Francisc Munteanu <[email protected]>

remove GetUsageFunc from the SPC, always read the ToolchainStatus

7c935b5

This simplifies the logic in the controller and doesn't increase the complexity in the controller tests.

Make sure the mappers are used with the correct object type

0bebee7

reset the env var to its original value after the test

f97d687

Move the update of the space count from the predicate up a level for it

087b3bd

to be less surprising.

Move the GetSpaceCountFromSpaceProvisionerConfigs helper function into

5972be2

a test package and simplify how the SpaceCountGetter is obtained.

Simplify the capacity ready state computation.

226ca14

Merge remote-tracking branch 'origin/member-info-in-spc-status-2' int…

845ea69

…o member-info-in-spc-status-2

Mirror the consumed capacity recorded in the ToolchainStatus to

802262a

SpaceProvisionerConfig. This data is just for information purposes and is not yet used anywhere else in the operator.

Update the capacity manager to be almost completely based on the info

d70908b

contained in the SpaceProvisionerConfig. In addition to that it only uses the counts cache to minimize the chance of overcommitting spaces to clusters.

Merge remote-tracking branch 'upstream/master' into member-info-in-sp…

16761b4

…c-status_capacity-manager-part

Add a lengthy explanation of the guarantees we make about the utiliza…

bb09e35

…tion of the cluster in the capacity manager

metlos requested review from MatousJobanek, xcoulon, alexeykazakov, rajivnathan and ranakan19 as code owners January 8, 2025 14:22

metlos requested a review from mfrancisc as a code owner January 8, 2025 14:22

openshift-ci bot requested a review from rsoaresd January 8, 2025 14:22

openshift-ci bot added the approved label Jan 8, 2025

metlos and others added 2 commits January 8, 2025 16:29

use the correct log package, d'oh!

19a64a2

Merge branch 'master' into member-info-in-spc-status_capacity-manager…

a65c490

…-part

MatousJobanek reviewed Jan 9, 2025

View reviewed changes

metlos added 2 commits January 10, 2025 17:00

Simplify the capacity manager further

01fd5d8

* Get rid of SpaceCountGetter and just always use the counts cache. * fix counter.GetCounts() to not be susceptible to concurrent modification * have a dedicated data structure for the optimal cluster computation

Merge remote-tracking branch 'origin/member-info-in-spc-status_capaci…

4ad343b

…ty-manager-part' into member-info-in-spc-status_capacity-manager-part

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Member info in SPC status - capacity manager part #1119

Member info in SPC status - capacity manager part #1119

metlos commented Jan 8, 2025

openshift-ci bot commented Jan 8, 2025

metlos commented Jan 9, 2025

MatousJobanek left a comment

MatousJobanek Jan 9, 2025

MatousJobanek Jan 10, 2025

MatousJobanek Jan 10, 2025

MatousJobanek Jan 9, 2025

MatousJobanek Jan 10, 2025

MatousJobanek Jan 9, 2025

metlos Jan 9, 2025

MatousJobanek Jan 10, 2025

MatousJobanek Jan 9, 2025

MatousJobanek Jan 9, 2025

MatousJobanek Jan 9, 2025

MatousJobanek Jan 9, 2025

metlos Jan 9, 2025

sonarqubecloud bot commented Jan 10, 2025

codecov bot commented Jan 10, 2025

		spc1 := hspc.NewEnabledTenantSPC("member1")
		spc2 := hspc.NewEnabledValidTenantSPC("member2")

		if spc.Status.ConsumedCapacity != nil {
		spaceCount = spc.Status.ConsumedCapacity.SpaceCount

Member info in SPC status - capacity manager part #1119

Are you sure you want to change the base?

Member info in SPC status - capacity manager part #1119

Conversation

metlos commented Jan 8, 2025

openshift-ci bot commented Jan 8, 2025

metlos commented Jan 9, 2025

MatousJobanek left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sonarqubecloud bot commented Jan 10, 2025

Quality Gate passed

codecov bot commented Jan 10, 2025

Codecov Report