Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Member info in SPC status - capacity manager part #1119

Open
wants to merge 29 commits into
base: master
Choose a base branch
from

Conversation

metlos
Copy link
Contributor

@metlos metlos commented Jan 8, 2025

This slightly simplifies the capacity manager by relying more on the ready status of the SPC (which is now based on more complex logic).

A whole bunch of tests was also updated to not have to initialize the count cache just for the capacity manager and instead set up the capacity manager to only use SPCs for determining the space counts.

This is a follow-up PR of #1109

Note that there are no new or updated e2e tests for this PR because it should not change any behavior whatsoever.

metlos and others added 25 commits November 28, 2024 20:26
the corresponding cluster. The capacity manager is simplified to take this
fact into account, even though it needs to re-check the spacecount from
the cache to decrease the chance of placing spaces to full clusters just
because of the fact that the reconciliation of the SPC didn't happen yet.
The readiness reason will reflect the situation better in that case.
Co-authored-by: Francisc Munteanu <[email protected]>
Co-authored-by: Francisc Munteanu <[email protected]>
This simplifies the logic in the controller and doesn't increase the
complexity in the controller tests.
a test package and simplify how the SpaceCountGetter is obtained.
SpaceProvisionerConfig. This data is just for information purposes and is
not yet used anywhere else in the operator.
contained in the SpaceProvisionerConfig. In addition to that it only
uses the counts cache to minimize the chance of overcommitting spaces
to clusters.
@metlos metlos requested a review from mfrancisc as a code owner January 8, 2025 14:22
@openshift-ci openshift-ci bot requested a review from rsoaresd January 8, 2025 14:22
Copy link

openshift-ci bot commented Jan 8, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: metlos

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved label Jan 8, 2025
@metlos
Copy link
Contributor Author

metlos commented Jan 9, 2025

/retest

Copy link
Contributor

@MatousJobanek MatousJobanek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like that the PR still needs some cleanup after the changes and discussions in the previous PR. I'm talking mainly about the usage of SPC status for storing the number of provisioned Spaces - we shouldn't do that in the capacity manager.

In other words, use SPC only to get the readiness, and the threshold for max number of Spaces, nothing else. The Space count should be taken from the cache only.


// when
approved, clusterName, err := getClusterIfApproved(ctx, fakeClient, signup, capacity.NewClusterManager(test.HostOperatorNs, fakeClient))
approved, clusterName, err := getClusterIfApproved(ctx, fakeClient, signup, capacity.NewClusterManager(test.HostOperatorNs, fakeClient, hspc.GetSpaceCountFromSpaceProvisionerConfigs(fakeClient, test.HostOperatorNs)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't seem to be correct - we shouldn't use SPC for getting the Space count

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as mentioned below, since we don't need to worry about the counts & thresholds in this unit test, then it's fine to leave them 0, thus no need to initialize anything.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, how about having something like this:

type CountPerCluster func() (string, int)

func ClusterCount(cluster string, count int) CountPerCluster {
	return func() (string, int) {
		return cluster, count
	}
}

func InitializeCountersWith(t *testing.T, counts ...CountPerCluster) {
	commontest.SetEnvVarAndRestore(t, "WATCH_NAMESPACE", commontest.HostOperatorNs)
	counter.Reset()
	t.Cleanup(counter.Reset)

	var members []ToolchainStatusOption
	for _, clusterCount := range counts {
		cluster, count := clusterCount()
		members = append(members, WithMember(cluster, WithSpaceCount(count)))
	}

	toolchainStatus := NewToolchainStatus(members...)
	initializeCounters(t, commontest.NewFakeClient(t), toolchainStatus)
}

so it could be used:

InitializeCountersWith(t, ClusterCount("member1", 0), ClusterCount("member2", 0))

spc1 := hspc.NewEnabledValidTenantSPC("member1", spc.MaxNumberOfSpaces(1000), spc.MaxMemoryUtilizationPercent(70))
spc2 := hspc.NewEnabledValidTenantSPC("member2", spc.MaxNumberOfSpaces(1000), spc.MaxMemoryUtilizationPercent(75))
fakeClient := commontest.NewFakeClient(t, toolchainStatus, toolchainConfig, spc1, spc2)
InitializeCounters(t, toolchainStatus)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that we should still initialize the counter cache, without it you won't be able to test the expected logic

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on the other comment below, we don't need to initialize the counters in this unit tests, it's fine if they stay just 0 for the cache.

Comment on lines +242 to +243
spc1 := hspc.NewEnabledTenantSPC("member1")
spc2 := hspc.NewEnabledValidTenantSPC("member2")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to verify also that the space count thresholds are applied, not only the readiness

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that user approval tests should be testing the capacity manager. That's why I wanted to simplify these tests as much as possible. It is no concern to the approval process why a cluster is not capable of accepting a space.

I didn't want to be radical and wanted some kind of "middle ground" solution where I'd just simplify the capacity manager logic to at least not rely on the ToolchainStatus. But in hindsight I think I should have just modified the usersignup to be able to completely mock-out the capacity manager so that these tests actually just test what they're supposed to and not have to have a ton of setup.

I do think though that there shouldn't be a single threshold check in the approval tests. The approval only is concerned about clusters that can be deployed to. It should not have to know about the reasons why a cluster is not eligible, just whether it is or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point 👍
Yeah, you are right that testing the thresholds should be enough in the capacity manager tests and here only test when SPC is (not) ready

controllers/usersignup/approval_test.go Outdated Show resolved Hide resolved
Comment on lines -2081 to -2088
InitializeCounters(t, NewToolchainStatus(
WithMetric(toolchainv1alpha1.UserSignupsPerActivationAndDomainMetricKey, toolchainv1alpha1.Metric{
"1,external": 1,
}),
WithMetric(toolchainv1alpha1.MasterUserRecordsPerDomainMetricKey, toolchainv1alpha1.Metric{
string(metrics.External): 1,
}),
))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's keep using the counters (and it's initialization) also here. Or is there any reason for dropping it?

Comment on lines 19 to 23
// SpaceCountGetter is a function useful for mocking the counts cache and can be passed
// to the NewClusterManager function. The returned tuple represents the actual number
// of the spaces found for given cluster and the fact whether the value was actually found or not
// (similar to the return value of the map lookup).
SpaceCountGetter func(ctx context.Context, clusterName string) (int, bool)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this really needed? why not keeping the original format that was there before?

Comment on lines 34 to 35
if spc.Status.ConsumedCapacity != nil {
spaceCount = spc.Status.ConsumedCapacity.SpaceCount
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct me if I'm wrong, but we shouldn't use the SPC for getting the number of provisioned Spaces

Comment on lines 262 to 268
if spaceCount, ok := getSpaceCount(ctx, spc.Spec.ToolchainCluster); ok {
// the cached spaceCount is always going to be fresher than (or as fresh as) what's in the SPC
if spc.Status.ConsumedCapacity == nil {
spc.Status.ConsumedCapacity = &toolchainv1alpha1.ConsumedCapacity{}
}
spc.Status.ConsumedCapacity.SpaceCount = spaceCount
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed in the previous PR, let's not update the status of the SPC if we are not going to update the actual resource - there is no benefit in doing that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed that it is weird for a predicate to have side-effects which I completely agree with.

Now, the "side-effect" is done while getting the candidate optimal clusters SO THAT the value we update can later be used again in the rest of the computation. Otherwise we'd be forced to repeatedly read the value from the count cache while sorting the array of the candidate clusters. I stand by the fact that it is better do it once and re-use those values than to do that repeatedly. It is not only more performant but also more stable (think of concurrent updates of the count cache).

This behavior is documented in the docs of getOptionalTargetClusters() as well as a comment in the body of GetOptionalTargetCluster().

I don't agree with the reasoning that we only should update the CRDs if we intend to update them. Those are data structures like any other - they're used to hold data. And if we need some kind of data be updated because we are going to need the updated values down the road, it is IMHO completely warranted to use the CRD for that. Note that these CRDs are local to the computation in GetOptionalTargetCluster() function. They don't escape that function because they are not stored in cluster nor in any "global" structure like the count cache. So there is no non-local "suprise" in having the updates in the CRDs.

* Get rid of SpaceCountGetter and just always use the counts cache.
* fix counter.GetCounts() to not be susceptible to concurrent
modification
* have a dedicated data structure for the optimal cluster computation
…ty-manager-part' into member-info-in-spc-status_capacity-manager-part
Copy link

codecov bot commented Jan 10, 2025

Codecov Report

Attention: Patch coverage is 85.71429% with 13 lines in your changes missing coverage. Please review.

Project coverage is 79.02%. Comparing base (69a195d) to head (4ad343b).

Files with missing lines Patch % Lines
pkg/counter/cache.go 65.00% 7 Missing ⚠️
pkg/capacity/manager.go 91.66% 3 Missing and 2 partials ⚠️
test/counter.go 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1119      +/-   ##
==========================================
- Coverage   79.15%   79.02%   -0.13%     
==========================================
  Files          78       78              
  Lines        7809     7814       +5     
==========================================
- Hits         6181     6175       -6     
- Misses       1449     1457       +8     
- Partials      179      182       +3     
Files with missing lines Coverage Δ
...ers/spacecompletion/space_completion_controller.go 85.48% <100.00%> (-0.24%) ⬇️
controllers/usersignup/approval.go 100.00% <100.00%> (ø)
test/spaceprovisionerconfig/util.go 100.00% <100.00%> (ø)
test/counter.go 83.33% <50.00%> (ø)
pkg/capacity/manager.go 95.04% <91.66%> (-3.45%) ⬇️
pkg/counter/cache.go 83.02% <65.00%> (-1.98%) ⬇️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants