-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MON-3513: Add availability test for Metrics API #28737
Conversation
@machine424: This pull request references MON-3513 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.16.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@machine424: This pull request references MON-3513 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.16.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
pkg/monitortests/monitoring/disruptionmetricsapi/monitortest.go
Outdated
Show resolved
Hide resolved
pkg/monitortests/monitoring/disruptionmetricsapi/monitortest.go
Outdated
Show resolved
Hide resolved
@machine424: This pull request references MON-3513 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.16.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/retest |
2 similar comments
/retest |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll need to check back once you've got job results to make sure the data in artifacts looks good, but this is really cool, and one of very few instances where someone has added additional disruption monitoring for their component.
disruptionBackedName := "metrics-api" | ||
|
||
newConnectionTestName := "[sig-instrumentation] disruption/metrics-api connection/new should be available throughout the test" | ||
reusedConnectionTestName := "[sig-instrumentation] disruption/metrics-api connection/reused should be available throughout the test" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to share some info on these tests, they're extremely forgiving, P99 over past three weeks + a healthy margin before they'll fail. Our primary system for detecting real regressions uses large sets of data and alerting, but this dashboard provides a strong visualization of that same data: https://grafana-loki.ci.openshift.org/d/ISnBj4LVk/disruption?orgId=1 Very handy if you want to monitor how your component is doing. Once this lands, it'll eventually start appearing in TRT's alerting if something is wrong and we'll get in touch if we detect a sustained change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really interesting, thanks for the details.
We'll try ti have a look at those dashboards from time to time.
// TODO: clean up/refactor following. | ||
|
||
// For nodes metrics | ||
newConnections, err := createAPIServerBackendSampler(adminRESTConfig, disruptionBackedName, "/apis/metrics.k8s.io/v1beta1/nodes", monitorapi.NewConnectionType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How big is the response from this request? Looks like it could be pretty big, and we'll be polling it once a second. Could we identify a smaller one if it is quite large?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'll depend on the nodes count.
I pushed a new version where we only ask for the metrics of the Metrics API backend Pods themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think that checking "/apis/metrics.k8s.io/v1beta1" is sufficient.
The response looks like
oc get --raw "/apis/metrics.k8s.io/v1beta1" | jq
{
"kind": "APIResourceList",
"apiVersion": "v1",
"groupVersion": "metrics.k8s.io/v1beta1",
"resources": [
{
"name": "nodes",
"singularName": "node",
"namespaced": false,
"kind": "NodeMetrics",
"verbs": [
"get",
"list"
]
},
{
"name": "pods",
"singularName": "pod",
"namespaced": true,
"kind": "PodMetrics",
"verbs": [
"get",
"list"
]
}
]
}
And when the backends are down:
Error from server (ServiceUnavailable): the server is currently unable to handle the request
reusedConnections, err = createAPIServerBackendSampler(adminRESTConfig, disruptionBackedName, fmt.Sprintf("/apis/metrics.k8s.io/v1beta1/namespaces/%s/pods", monitoringNamespace), monitorapi.ReusedConnectionType) | ||
if err != nil { | ||
return err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you're re-using the disruptionBackedName here, which is probably a bug as both results are probably getting muddled or ignored, but do you actually need a second set of endpoints being monitored here? Typically we'd only have one pair, making sure the APIserver handling the request is up, not that the various APIs within that server are working. Assuming these are handled by the same APIServer, I'd pitch dropping one of these sets and just hit one new+reused pair. The cpu and network load of these requests every second is not insignificant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, as mentioned in the description #28737 (comment), we're only interested in the availability (codes 200-399), pushed a new version.
I hope it's ok now.
Job Failure Risk Analysis for sha: 93a0eff
|
Thanks, I'll check with the team, and maybe we'll add more for the other components (prometheus, etc.). The monitortest framework really makes it easy to write such tests, good job on that :) What do you think about adding (in another PR), a generic disruptioncheck for all the aggregation layer APIs?, I assume it's possible to get the available APIs from Kube or we can provide a list of URLs... I'll continue testing, make the tests fail learn more about the behavior, then I'll ask you for another review, once the CI is greener :) |
var deployment *appsv1.Deployment | ||
deployment, err = kubeClient.AppsV1().Deployments(monitoringNamespace).Get(ctx, metricsServerDeploymentName, metav1.GetOptions{}) | ||
if apierrors.IsNotFound(err) { | ||
// TODO: remove this in 4.17 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: open ticket for this before merge
6336ec7
to
4d48684
Compare
/retest |
Job Failure Risk Analysis for sha: 4d48684
|
/retest |
/payload-with-prs pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade openshift/api#1865 openshift/cluster-monitoring-operator#2268 |
@machine424: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info. |
/payload-with-prs periodic-ci-openshift-release-master-ci-4.16-e2e-aws-ovn-upgrade openshift/api#1865 openshift/cluster-monitoring-operator#2268 |
@machine424: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info. |
/payload-job-with-prs periodic-ci-openshift-release-master-ci-4.16-e2e-aws-ovn-upgrade openshift/api#1865 openshift/cluster-monitoring-operator#2268 |
@machine424: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/ce85df90-02de-11ef-8c62-2e3db784f5f9-0 |
/payload-job-with-prs pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade openshift/api#1865 openshift/cluster-monitoring-operator#2268 |
@machine424: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command |
/payload-job-with-prs periodic-ci-openshift-release-master-ci-4.16-e2e-aws-ovn openshift/api#1865 openshift/cluster-monitoring-operator#2268 |
@machine424: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/77395d40-02e1-11ef-97dd-2274b69c7a75-0 |
/retest pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade |
@machine424: The
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/test e2e-gcp-ovn-upgrade e2e-aws-ovn-upgrade e2e-aws-ovn |
It's an awesome idea but we've caught this framework noticably impacting CPU/memory/network when the polling increased dramatically. If this involved a bunch of new endpoints, it very likely could start to skew CPU at the very least in ways that show up and send people on wild goose chases. If we knew about it and were expecting it, perhaps we could get away with it though, provided the resource requirements are not significantly different. How many endpoints would we be talking about? |
Job Failure Risk Analysis for sha: 5fed856
|
I see, I just need to discuss that with the team but I can think of 3 or 4 APIs. |
This should ensure the availability of the Metrics API during e2e tests including upgrades. Thus it should also help with https://issues.redhat.com/browse/MON-3539. The correctness of the API: whether the right/expected content is returned, should be tested elsewhere (we already have tests for that in CMO, and the HPA tests already make use of that etc.). This tests only check the availability.
/test e2e-gcp-ovn-upgrade e2e-aws-ovn-upgrade e2e-aws-ovn |
Job Failure Risk Analysis for sha: 2deef4c
|
1 similar comment
Job Failure Risk Analysis for sha: 2deef4c
|
Code looks good, I see the data coming out looking good. Adding another 3-4 of these we could probably get away with, your generic framework could be very useful. /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dgoodwin, machine424 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
1 similar comment
/retest |
@machine424: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
c38cf13
into
openshift:master
This should ensure the availability of the Metrics API during e2e tests including upgrades.
Thus it should also help with https://issues.redhat.com/browse/MON-3539.
The correctness of the API: whether the right/expected content is returned, should be tested elsewhere (we already have tests for that in CMO, and the HPA tests already make use of that etc.). This tests only check the availability.