[FEATURE] Add metrics in spark #117

penghuo · 2023-10-31T03:53:08Z

noCharger · 2023-12-08T00:24:07Z

Developed a high-level sequence diagram following offline discussions with @penghuo.

To progress, we need to clarify which components are currently in place and identify those requiring development. Key tasks include:

Understanding the existing solution for CloudWatchSink integration. This involves reviewing two specific PRs PR #173 and PR #176. @vamsi-amazon Feel free to share your insights here.
Investigating engineering best practices for integrating Spark metrics with CloudWatch. This encompasses examining existing solutions, such as the one involving the AmazonCloudWatchAgent. Based on these insights, the goal is to develop a Proof of Concept (POC).
Conducting a thorough evaluation of the various options to understand their trade-offs, including cost implications. This step is crucial to ensure that the chosen solution aligns with project objectives.

penghuo · 2023-12-12T22:22:34Z

limitation: Per account limitation is 300tps

noCharger · 2023-12-21T17:58:54Z

[Summary] Flint Metrics Framework

Approach one: Codahale/Dropwizard aggregated metrics

Pros:

Offers a complete solution using the stable version of Dropwizard Metrics, which is natively supported by Spark's metrics system.

Cons:

Lacks support for StatsD, which is capable for real-time data processing. Also limiting its extensiblity between Flint and various backend monitoring tools (like Graphite, Datadog, Telegraf, etc.).

Approach two: CloudWatch Agent Integration

Pros:

A complete polling solution for distributed systems, leveraging the CloudWatch Agent. It can retrieve custom metrics from applications or services using the StatsD and collectd protocols.

Cons:

Requires additional setup and configuration effort, especially on EMR serverless environments, which may increase complexity.

Approach three: Leverage Spark event logs

Spark's event logs, a comprehensive record of events during a Spark application's execution, can indeed be utilized for parsing metrics, although this method differs from the direct use of the Dropwizard Metrics Library. These logs, typically in JSON format, include detailed information about various Spark activities and can be accessed and analyzed post-execution for insights into job performance and system behavior. While the Dropwizard Metrics Library offers real-time metrics for ongoing monitoring, Spark event logs are more suited for retrospective analysis, debugging, and performance audits. Extracting metrics from these logs involves parsing the JSON data to identify relevant metrics, a process that can be resource-intensive and is generally more complex than real-time monitoring. Therefore, while Spark event logs provide a valuable resource for detailed analysis after the fact, they serve a different purpose compared to the immediate insights offered by Dropwizard's real-time metrics.

Cost Analysis:

Both approaches one and two involve translating Dropwizard metrics like Meter, Counter, Histogram, and Timer to CloudWatch MetricDatum. This could lead to higher CloudWatch costs due to the potential increase in the number of MetricDatums sent.
StatsD aggregates and samples incoming metrics over a period before sending them off to the backend monitoring system. This reduces network overhead and the load on the monitoring database, as fewer data points are transmitted.
The CloudWatch Agent approach may be more cost-effective for large-scale distributed systems. However, its use in EMR serverless environments could consume more vCPU hours, especially if deployed as a sidecar application.

cc: @anirudha @penghuo @vamsi-amazon

noCharger · 2024-02-20T18:18:11Z

close as completed

penghuo added enhancement New feature or request untriaged 0.1.1 labels Oct 31, 2023

penghuo self-assigned this Oct 31, 2023

dai-chen removed the untriaged label Nov 1, 2023

dai-chen added this to OpenSearch Spark Project Planning Nov 1, 2023

dai-chen moved this to In Progress in OpenSearch Spark Project Planning Nov 1, 2023

penghuo changed the title ~~[FEATURE] Add metrics for all dependency services~~ [FEATURE] Add metrics in spark Nov 16, 2023

penghuo mentioned this issue Nov 21, 2023

Request index not exist handling #169

Merged

penghuo added 0.2 and removed 0.1.1 labels Dec 5, 2023

noCharger mentioned this issue Jan 11, 2024

Expose OpenSearch metrics #220

Closed

noCharger mentioned this issue Feb 5, 2024

Add interactive job metrics #240

Merged

noCharger mentioned this issue Feb 19, 2024

Add more flint metrics #255

Merged

noCharger closed this as completed Feb 20, 2024

github-project-automation bot moved this from In Progress to Done in OpenSearch Spark Project Planning Feb 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Add metrics in spark #117

[FEATURE] Add metrics in spark #117

penghuo commented Oct 31, 2023 •

edited

Loading

noCharger commented Dec 8, 2023

penghuo commented Dec 12, 2023

noCharger commented Dec 21, 2023 •

edited

Loading

noCharger commented Feb 20, 2024

[FEATURE] Add metrics in spark #117

[FEATURE] Add metrics in spark #117

Comments

penghuo commented Oct 31, 2023 • edited Loading

Requirements

Tasks

Metrics

Dependency Services

Interactive Job, dimensions: [clientId, domainName, instance]

repl status

statement status

requestIndex status

resultIndex status

Streaming Job, dimensions: [clientId, domainName, instance, type]

Query Optimizer

noCharger commented Dec 8, 2023

penghuo commented Dec 12, 2023

noCharger commented Dec 21, 2023 • edited Loading

[Summary] Flint Metrics Framework

Approach one: Codahale/Dropwizard aggregated metrics

Approach two: CloudWatch Agent Integration

Approach three: Leverage Spark event logs

Cost Analysis:

noCharger commented Feb 20, 2024

penghuo commented Oct 31, 2023 •

edited

Loading

noCharger commented Dec 21, 2023 •

edited

Loading