Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Add metrics in spark #117

Closed
33 tasks
penghuo opened this issue Oct 31, 2023 · 4 comments
Closed
33 tasks

[FEATURE] Add metrics in spark #117

penghuo opened this issue Oct 31, 2023 · 4 comments
Assignees
Labels
0.2 enhancement New feature or request

Comments

@penghuo
Copy link
Collaborator

penghuo commented Oct 31, 2023

Requirements

  • Implement FlintMetrics library depend on MetricRegistry and doesn't depend on Spark.
  • FlintMetrics provide AOP for metrics publish.
  • FlintCore depend on FlintMetric.

Tasks

Metrics

Dependency Services

  • OpenSearch, dimensions: [clientId, domainName]
    • opensearch.read.403.count
    • opensearch.read.5xx.count
    • opensearch.read.2xx.count
    • opensearch.write.403.count
    • opensearch.write.5xx.count
    • opensearch.write.2xx.count
  • S3
  • Glue

Interactive Job, dimensions: [clientId, domainName, instance]

repl status

  • repl.running.count
  • repl.failed.count
  • repl.success.count

statement status

  • repl.processingTime (Timer)
  • repl.statement.running.count
  • repl.statement.failed.count
  • repl.statement.success.count
  • repl.statement.processingTime (Timer)
  • repl.statement.resultSize

requestIndex status

  • repl.requestIndex.waitingStatementSize
  • repl.requestIndex.read.4xx.count
  • repl.requestIndex.read.5xx.count
  • repl.requestIndex.read.2xx.count
  • repl.requestIndex.write.4xx.count
  • repl.requestIndex.write.4xx.count
  • repl.requestIndex.write.2xx.count
  • repl.requestIndex.heartbeatFailed.count

resultIndex status

  • repl.resultIndex.write.4xx.count
  • repl.resultIndex.write.4xx.count
  • repl.resultIndex.write.2xx.count

Streaming Job, dimensions: [clientId, domainName, instance, type]

  • streaming.running.count
  • streaming.failed.count
  • streaming.success.count
  • streaming.heartbeatFailed.count

Query Optimizer

@penghuo penghuo added enhancement New feature or request untriaged 0.1.1 labels Oct 31, 2023
@penghuo penghuo self-assigned this Oct 31, 2023
@dai-chen dai-chen removed the untriaged label Nov 1, 2023
@penghuo penghuo changed the title [FEATURE] Add metrics for all dependency services [FEATURE] Add metrics in spark Nov 16, 2023
@penghuo penghuo added 0.2 and removed 0.1.1 labels Dec 5, 2023
@noCharger
Copy link
Collaborator

  1. Developed a high-level sequence diagram following offline discussions with @penghuo.

SeqDiagram

  1. To progress, we need to clarify which components are currently in place and identify those requiring development. Key tasks include:
  • Understanding the existing solution for CloudWatchSink integration. This involves reviewing two specific PRs PR #173 and PR #176. @vamsi-amazon Feel free to share your insights here.
  • Investigating engineering best practices for integrating Spark metrics with CloudWatch. This encompasses examining existing solutions, such as the one involving the AmazonCloudWatchAgent. Based on these insights, the goal is to develop a Proof of Concept (POC).
  • Conducting a thorough evaluation of the various options to understand their trade-offs, including cost implications. This step is crucial to ensure that the chosen solution aligns with project objectives.

@penghuo
Copy link
Collaborator Author

penghuo commented Dec 12, 2023

limitation: Per account limitation is 300tps

@noCharger
Copy link
Collaborator

noCharger commented Dec 21, 2023

[Summary] Flint Metrics Framework

Approach one: Codahale/Dropwizard aggregated metrics

Pros:

  • Offers a complete solution using the stable version of Dropwizard Metrics, which is natively supported by Spark's metrics system.

Cons:

  • Lacks support for StatsD, which is capable for real-time data processing. Also limiting its extensiblity between Flint and various backend monitoring tools (like Graphite, Datadog, Telegraf, etc.).

Approach two: CloudWatch Agent Integration

Pros:

  • A complete polling solution for distributed systems, leveraging the CloudWatch Agent. It can retrieve custom metrics from applications or services using the StatsD and collectd protocols.

Cons:

  • Requires additional setup and configuration effort, especially on EMR serverless environments, which may increase complexity.

Approach three: Leverage Spark event logs

Spark's event logs, a comprehensive record of events during a Spark application's execution, can indeed be utilized for parsing metrics, although this method differs from the direct use of the Dropwizard Metrics Library. These logs, typically in JSON format, include detailed information about various Spark activities and can be accessed and analyzed post-execution for insights into job performance and system behavior. While the Dropwizard Metrics Library offers real-time metrics for ongoing monitoring, Spark event logs are more suited for retrospective analysis, debugging, and performance audits. Extracting metrics from these logs involves parsing the JSON data to identify relevant metrics, a process that can be resource-intensive and is generally more complex than real-time monitoring. Therefore, while Spark event logs provide a valuable resource for detailed analysis after the fact, they serve a different purpose compared to the immediate insights offered by Dropwizard's real-time metrics.

Cost Analysis:

  • Both approaches one and two involve translating Dropwizard metrics like Meter, Counter, Histogram, and Timer to CloudWatch MetricDatum. This could lead to higher CloudWatch costs due to the potential increase in the number of MetricDatums sent.
  • StatsD aggregates and samples incoming metrics over a period before sending them off to the backend monitoring system. This reduces network overhead and the load on the monitoring database, as fewer data points are transmitted.
  • The CloudWatch Agent approach may be more cost-effective for large-scale distributed systems. However, its use in EMR serverless environments could consume more vCPU hours, especially if deployed as a sidecar application.

cc: @anirudha @penghuo @vamsi-amazon

@noCharger
Copy link
Collaborator

close as completed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.2 enhancement New feature or request
Development

No branches or pull requests

3 participants