Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metrics to acs for eni provisioning workflow monitoring #4443

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

chacheng21
Copy link

Summary

Adding new metrics and variables in acs session

Implementation details

Added new metrics and variables in acs session for StartSessionOnce

Testing

Tested by running all unit tests in acs

New tests cover the changes:
No new tests, but modified existing tests to account for new metrics (ie mocked the calls to create metrics)

Description for the changelog

Enhancement - add metrics to monitor latency in acs session

Additional Information

Does this PR include breaking model changes? If so, Have you added transformation functions?
No

Does this PR include the addition of new environment variables in the README?
No

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@chacheng21 chacheng21 self-assigned this Nov 27, 2024
@chacheng21 chacheng21 requested a review from a team as a code owner November 27, 2024 21:57
@@ -46,6 +46,11 @@ const (
ACSDisconnectTimeoutMetricName = agentAvailabilityNamespace + ".ACSDisconnectTimeout"
TCSDisconnectTimeoutMetricName = agentAvailabilityNamespace + ".TCSDisconnectTimeout"

// ACS Session Metrics
acsStartSessionNamespace = "ACSStartSession"
ACSDiscoverPollEndpointDurationName = acsStartSessionNamespace + ".DiscoverPollEndpointDuration"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we calling these two duration metrics? What duration is being measured?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified offline, these metrics include a duration by default which will tell us the connection duration.

@tshan2001
Copy link
Contributor

I don't think we're supposed to merge into the master branch, since master is used for version release. We should merge into dev instead

@@ -262,8 +274,13 @@ func (s *session) startSessionOnce(ctx context.Context) error {
})
return err
}
acsConnectionMetric.Done(err)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ACS metric won't fire if client.Connect returns a non-nil error. Is that intentional? If so, this behavior is inconsistent with the DiscoverPollEndpoint metric.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because ACS infinitely retries its connection, is this metric going to be too noisy if we fire a failure metric on every failed connection?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont think we would want to fire metrics on failure - to be consistent with DiscoverPollEndpoint i think i will only fire the DiscoverPollEndpoint metric for successful calls

s.firstDiscoverPollEndpointTime = time.Now()
}

discoverPollEndpointMetric := s.metricsFactory.New(metrics.ACSDiscoverPollEndpointDurationName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DiscoverPollEndpoint can be called by TACS session as well. Can we emit the metric in this single place - https://github.com/aws/amazon-ecs-agent/blob/master/ecs-agent/api/ecs/client/ecs_client.go#L679? which is used by both ACS and TACS.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, will move the metric over there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants