-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Member info in SPC status - capacity manager part #1119
base: master
Are you sure you want to change the base?
Member info in SPC status - capacity manager part #1119
Conversation
the corresponding cluster. The capacity manager is simplified to take this fact into account, even though it needs to re-check the spacecount from the cache to decrease the chance of placing spaces to full clusters just because of the fact that the reconciliation of the SPC didn't happen yet.
The readiness reason will reflect the situation better in that case.
…o member-info-in-spc-status-2
Co-authored-by: Francisc Munteanu <[email protected]>
…o member-info-in-spc-status-2
Co-authored-by: Francisc Munteanu <[email protected]>
This simplifies the logic in the controller and doesn't increase the complexity in the controller tests.
to be less surprising.
a test package and simplify how the SpaceCountGetter is obtained.
…o member-info-in-spc-status-2
SpaceProvisionerConfig. This data is just for information purposes and is not yet used anywhere else in the operator.
contained in the SpaceProvisionerConfig. In addition to that it only uses the counts cache to minimize the chance of overcommitting spaces to clusters.
…c-status_capacity-manager-part
…tion of the cluster in the capacity manager
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: metlos The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like that the PR still needs some cleanup after the changes and discussions in the previous PR. I'm talking mainly about the usage of SPC status for storing the number of provisioned Spaces - we shouldn't do that in the capacity manager.
In other words, use SPC only to get the readiness, and the threshold for max number of Spaces, nothing else. The Space count should be taken from the cache only.
|
||
// when | ||
approved, clusterName, err := getClusterIfApproved(ctx, fakeClient, signup, capacity.NewClusterManager(test.HostOperatorNs, fakeClient)) | ||
approved, clusterName, err := getClusterIfApproved(ctx, fakeClient, signup, capacity.NewClusterManager(test.HostOperatorNs, fakeClient, hspc.GetSpaceCountFromSpaceProvisionerConfigs(fakeClient, test.HostOperatorNs))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't seem to be correct - we shouldn't use SPC for getting the Space count
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as mentioned below, since we don't need to worry about the counts & thresholds in this unit test, then it's fine to leave them 0, thus no need to initialize anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, how about having something like this:
type CountPerCluster func() (string, int)
func ClusterCount(cluster string, count int) CountPerCluster {
return func() (string, int) {
return cluster, count
}
}
func InitializeCountersWith(t *testing.T, counts ...CountPerCluster) {
commontest.SetEnvVarAndRestore(t, "WATCH_NAMESPACE", commontest.HostOperatorNs)
counter.Reset()
t.Cleanup(counter.Reset)
var members []ToolchainStatusOption
for _, clusterCount := range counts {
cluster, count := clusterCount()
members = append(members, WithMember(cluster, WithSpaceCount(count)))
}
toolchainStatus := NewToolchainStatus(members...)
initializeCounters(t, commontest.NewFakeClient(t), toolchainStatus)
}
so it could be used:
InitializeCountersWith(t, ClusterCount("member1", 0), ClusterCount("member2", 0))
spc1 := hspc.NewEnabledValidTenantSPC("member1", spc.MaxNumberOfSpaces(1000), spc.MaxMemoryUtilizationPercent(70)) | ||
spc2 := hspc.NewEnabledValidTenantSPC("member2", spc.MaxNumberOfSpaces(1000), spc.MaxMemoryUtilizationPercent(75)) | ||
fakeClient := commontest.NewFakeClient(t, toolchainStatus, toolchainConfig, spc1, spc2) | ||
InitializeCounters(t, toolchainStatus) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that we should still initialize the counter cache, without it you won't be able to test the expected logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
based on the other comment below, we don't need to initialize the counters in this unit tests, it's fine if they stay just 0 for the cache.
spc1 := hspc.NewEnabledTenantSPC("member1") | ||
spc2 := hspc.NewEnabledValidTenantSPC("member2") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to verify also that the space count thresholds are applied, not only the readiness
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that user approval tests should be testing the capacity manager. That's why I wanted to simplify these tests as much as possible. It is no concern to the approval process why a cluster is not capable of accepting a space.
I didn't want to be radical and wanted some kind of "middle ground" solution where I'd just simplify the capacity manager logic to at least not rely on the ToolchainStatus. But in hindsight I think I should have just modified the usersignup to be able to completely mock-out the capacity manager so that these tests actually just test what they're supposed to and not have to have a ton of setup.
I do think though that there shouldn't be a single threshold check in the approval tests. The approval only is concerned about clusters that can be deployed to. It should not have to know about the reasons why a cluster is not eligible, just whether it is or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point 👍
Yeah, you are right that testing the thresholds should be enough in the capacity manager tests and here only test when SPC is (not) ready
InitializeCounters(t, NewToolchainStatus( | ||
WithMetric(toolchainv1alpha1.UserSignupsPerActivationAndDomainMetricKey, toolchainv1alpha1.Metric{ | ||
"1,external": 1, | ||
}), | ||
WithMetric(toolchainv1alpha1.MasterUserRecordsPerDomainMetricKey, toolchainv1alpha1.Metric{ | ||
string(metrics.External): 1, | ||
}), | ||
)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's keep using the counters (and it's initialization) also here. Or is there any reason for dropping it?
pkg/capacity/manager.go
Outdated
// SpaceCountGetter is a function useful for mocking the counts cache and can be passed | ||
// to the NewClusterManager function. The returned tuple represents the actual number | ||
// of the spaces found for given cluster and the fact whether the value was actually found or not | ||
// (similar to the return value of the map lookup). | ||
SpaceCountGetter func(ctx context.Context, clusterName string) (int, bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this really needed? why not keeping the original format that was there before?
pkg/capacity/manager.go
Outdated
if spc.Status.ConsumedCapacity != nil { | ||
spaceCount = spc.Status.ConsumedCapacity.SpaceCount |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct me if I'm wrong, but we shouldn't use the SPC for getting the number of provisioned Spaces
pkg/capacity/manager.go
Outdated
if spaceCount, ok := getSpaceCount(ctx, spc.Spec.ToolchainCluster); ok { | ||
// the cached spaceCount is always going to be fresher than (or as fresh as) what's in the SPC | ||
if spc.Status.ConsumedCapacity == nil { | ||
spc.Status.ConsumedCapacity = &toolchainv1alpha1.ConsumedCapacity{} | ||
} | ||
spc.Status.ConsumedCapacity.SpaceCount = spaceCount | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as discussed in the previous PR, let's not update the status of the SPC if we are not going to update the actual resource - there is no benefit in doing that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed that it is weird for a predicate to have side-effects which I completely agree with.
Now, the "side-effect" is done while getting the candidate optimal clusters SO THAT the value we update can later be used again in the rest of the computation. Otherwise we'd be forced to repeatedly read the value from the count cache while sorting the array of the candidate clusters. I stand by the fact that it is better do it once and re-use those values than to do that repeatedly. It is not only more performant but also more stable (think of concurrent updates of the count cache).
This behavior is documented in the docs of getOptionalTargetClusters()
as well as a comment in the body of GetOptionalTargetCluster()
.
I don't agree with the reasoning that we only should update the CRDs if we intend to update them. Those are data structures like any other - they're used to hold data. And if we need some kind of data be updated because we are going to need the updated values down the road, it is IMHO completely warranted to use the CRD for that. Note that these CRDs are local to the computation in GetOptionalTargetCluster()
function. They don't escape that function because they are not stored in cluster nor in any "global" structure like the count cache. So there is no non-local "suprise" in having the updates in the CRDs.
* Get rid of SpaceCountGetter and just always use the counts cache. * fix counter.GetCounts() to not be susceptible to concurrent modification * have a dedicated data structure for the optimal cluster computation
…ty-manager-part' into member-info-in-spc-status_capacity-manager-part
Quality Gate passedIssues Measures |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1119 +/- ##
==========================================
- Coverage 79.15% 79.02% -0.13%
==========================================
Files 78 78
Lines 7809 7814 +5
==========================================
- Hits 6181 6175 -6
- Misses 1449 1457 +8
- Partials 179 182 +3
|
This slightly simplifies the capacity manager by relying more on the ready status of the SPC (which is now based on more complex logic).
A whole bunch of tests was also updated to not have to initialize the count cache just for the capacity manager and instead set up the capacity manager to only use SPCs for determining the space counts.
This is a follow-up PR of #1109
Note that there are no new or updated e2e tests for this PR because it should not change any behavior whatsoever.