Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Agent Health Extension #1423

Open
wants to merge 24 commits into
base: main
Choose a base branch
from
Open

Adding Agent Health Extension #1423

wants to merge 24 commits into from

Conversation

Paramadon
Copy link
Contributor

@Paramadon Paramadon commented Nov 12, 2024


Description of the Issue

The CloudWatch Agent lacked an agenthealth configuration to monitor API health by tracking HTTP status codes for specific responses. Monitoring the status codes (200, 400, 408, 419, and 429) across all APIs is critical for diagnosing issues and ensuring comprehensive observability of API behaviors. Without this configuration, users were unable to quickly identify trends in API health metrics or correlate specific status codes with performance issues.


Changes Made

  • Added agenthealth Configuration:

    • Configured the CloudWatch Agent to track the following status codes across all APIs:
      • 200: Success
      • 400: Bad Request
      • 408: Request Timeout
      • 419: Authentication Timeout
      • 429: Too Many Requests
    • This ensures that health metrics for these common response codes are captured and reported.
  • Updated CloudWatch Agent Configuration JSON:

    • Introduced a new section under the metrics collection configuration to specify agenthealth metrics.
    • Ensured compatibility with existing metrics and logs collection by integrating the new configuration seamlessly.
  • Validated the Configuration:

    • Tested the updated configuration with the CloudWatch Agent to ensure the new metrics are collected and reported correctly.
    • Confirmed that the metrics appear as expected in CloudWatch for all specified APIs.

Impact

  • Provides real-time visibility into API status codes, helping identify anomalies or patterns that indicate issues.
  • Enhances observability for critical services by adding detailed health metrics.

Testing

  • Deployed the updated CloudWatch Agent health configuraion
  • Verified that metrics for status codes 200, 400, 408, 419, and 429 are captured in the header as you can see from the image below:
Screenshot 2024-11-15 at 4 36 21 PM

@Paramadon Paramadon requested a review from a team as a code owner November 12, 2024 20:47
@Paramadon Paramadon changed the title fixing issue Adding Agent Health Extension Nov 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant