Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trace events lib changes #8816

Closed

Conversation

rishabhmaurya
Copy link
Contributor

@rishabhmaurya rishabhmaurya commented Jul 21, 2023

Description

As part of tracing framework #7648, we added tracing capabilities in OpenSearch, this proposal is to add diagnosis metrics which can be associated with spans. It can help provide additional insights on resource consumption around different parts of the system. When enabled, resource consumption will be recorded between start and end of the span and will publish the net consumption. It will consider resources consumed by all threads involved in the lifecycle of a span. Example usage, assuming segment merge is instrumented with tracing framework, on enabling diagnostic on segment merge traces, it will start emitting resource usage by merges such as heap allocation, cpu usage etc. Such diagnostic metrics could pin point the bottlenecks in the system and exact scaling requirements.

Tenet
  1. Minimal latency overhead. No impact when disabled.
  2. Failure in computing diagnosis metrics should not fail the request execution.
  3. Should be behind dynamic cluster setting based feature flag.
Trace Event Listener

In order to add diagnosis metrics associated with span, trace event listeners are required which are called on span start and span end. To handle thread context switch and record resource consumption associated with the new thread spawned, Runnables associated with a span can be wrapped with RunnableWrapper, which is provided by this framework. Wrapped Runnable generates start and end events to accurately measure the resource consumption. TraceEventListener are generic and can be used for other purposes if needed and not just diagnosis.
Context Propagation

Above figure illustrates the expected life cycle of a span, which is started in the middle of an execution of a task, started by Thread 1 (of ThreadPool 1). It performs an IO operation, and on completion of IO, Thread 2 (of ThreadPool 1) resumes the task. Thread 2 also does an IO operation and task is resumed later by the Thread 3 (of ThreadPool 1) on IO completion. Thread 3 performs 3 different sub-tasks, which it executes in parallel by spawning three different threads in different executor services, and is non-blocking in nature, so it continues to work on rest of the task, while sub-tasks are delegated to different threads. However, it does wait for sub-tasks completion, without blocking the thread, and on response to the last sub-task, the task is resumed by Thread 2 (of ThreadPool 1) to its completion and the Span is closed somewhere in between. ThreadPool 3, which doesn't have the Runnable wrapped and doesn't have the span and registered event listeners information, thus span events are not generated.

Diagnostic Trace Event Listener

This is a generic implementation of trace event listeners which records resource consumption using provided ThreadResourceRecorder and emit corresponding metrics using provided MetricEmitter.

Thread Resource Recorder

Thread resource recorder exposes start and end recording method which are used by DiagnosticTraceEventListener to record resource consumption. On start recording, it observes the current usage of a resource using provided resource observer. On end recording, it again observes the current readings, computes the difference in observed metric from start point and emits them. The metric is of delta temporality type.

Thread Resource Observer

It simply exposes observe() method to observe the current reading associated with a thread.

Related Tasks

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@@ -10,6 +10,9 @@
*/

dependencies {
// logging
api "org.apache.logging.log4j:log4j-api:${versions.log4j}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should check once if we can add logger dependency in Telemetry lib. Only core lib has this dependency which is a special lib. Though i dont see a problem but we can re-confirm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was a bit hesitant too initially, I will check with others on it. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The build fails if we add other dependencies. Please verify.

* @return an unmodifiable map of measurement names to measurement objects
*/
public Map<String, Measurement<Number>> getMeasurements() {
return measurements;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defensive copy?

* @param span the span that has completed
* @param t the thread associated with the span
*/
void onSpanComplete(Span span, Thread t);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: onSpanEnd

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Aug 1, 2023

Gradle Check (Jenkins) Run Completed with:

@Gaganjuneja
Copy link
Contributor

@rishabhmaurya, Thanks a lot for putting this up. I have a few question.

  1. I want to understand the motivation behind combining the span and diagnostic information. Do we want to capture the diagnostic info for a span/trace or to capture the diagnostic info we are creating the span.
  2. Anyways as of now these metrics are being collated at span level but emitted independently. So can we keep these 2 pieces separate and link with exemplars through spanContext?
  3. How are you thinking of linking these metrics and spans later for analysis? I see you are passing the attributes. Do attributes always have traceID?
  4. Maybe too early but how do these metrics work in case of sampling? Will these also be sampled along with span or not?
  5. I have concerns about exposing the Span publicly. If you just need SpanType and hasEnded attribute then can we just expose these properties through either SpanScope or minimum version of span?

@rishabhmaurya
Copy link
Contributor Author

rishabhmaurya commented Aug 2, 2023

I want to understand the motivation behind combining the span and diagnostic information. Do we want to capture the diagnostic info for a span/trace or to capture the diagnostic info we are creating the span.

the motivation is to capture additional metrics around resource usage associated with a span. This is an optional capability which can be enabled on per span basis if needed and default behavior would be disabled diagnostics. This is helpful for long running jobs, where resource consumption can be recorded for a span like segment merges, peer recovery, index engine start etc and for such long running job, the overhead would be very minimal when compared to the runtime of these jobs.
Diagnosis shouldn't be enabled for operations which have very tight SLA of order of 10ms latency.
We do something similar with resource tracking framework at task level and this is a generic way to solve the problem and is built in an extensible way.

Anyways as of now these metrics are being collated at span level but emitted independently. So can we keep these 2 pieces separate and link with exemplars through spanContext?

I didn't quite get this part. But you're right about collection and emission and they are independent. This framework leaves emission logic open ended. For example, This PR is using opentelemetry for metric emission. - https://github.com/rishabhmaurya/OpenSearch/pull/44/commits. Whereas, if we want to use something else, like stats API to emit into or logging exporter, it can be integrated too.

How are you thinking of linking these metrics and spans later for analysis? I see you are passing the attributes. Do attributes always have traceID?

traceID would a regular field associated with metric which can used to filter when visualizing, it will not be a dimension attribute to avoid high cardinality dimension.

Maybe too early but how do these metrics work in case of sampling? Will these also be sampled along with span or not?

It is the right time to discuss it. It depends on when we are sampling, if it is at collector level, then all of these diagnosis metrics will be available to collector. Metric sampling works differently at collector, they can both be aggregated or sampled at collector and a lot depends on the aggregation function. I will add more details around it.

I have concerns about exposing the Span publicly. If you just need SpanType and hasEnded attribute then can we just expose these properties through either SpanScope or minimum version of span?

Adding points we discussed offline around it a few days ago -
SpanScope is getting bloated with a lot of redundant methods same as Span. We wanted to avoid exposing Span because of endSpan() method, which we only want to be exposed to framework internal classes like SpanScope, which can call it when SpanScope is closed. Is there any other reason why we don't want to expose Span publicly?

@github-actions
Copy link
Contributor

github-actions bot commented Aug 2, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Aug 2, 2023

Gradle Check (Jenkins) Run Completed with:

@opensearch-trigger-bot
Copy link
Contributor

Compatibility status:


> Task :checkCompatibility
Checking compatibility for: https://github.com/opensearch-project/reporting.git with ref: main
Incompatible components: [https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/ml-commons.git]
Compatible components: [https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/reporting.git]

BUILD SUCCESSFUL in 33m 10s

@github-actions
Copy link
Contributor

github-actions bot commented Aug 3, 2023

Gradle Check (Jenkins) Run Completed with:

@opensearch-trigger-bot
Copy link
Contributor

Compatibility status:



> Task :checkCompatibility
Incompatible components: [https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/performance-analyzer.git]
Compatible components: [https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/reporting.git]

BUILD SUCCESSFUL in 36m 35s

@Gaganjuneja
Copy link
Contributor

I want to understand the motivation behind combining the span and diagnostic information. Do we want to capture the diagnostic info for a span/trace or to capture the diagnostic info we are creating the span.

the motivation is to capture additional metrics around resource usage associated with a span. This is an optional capability which can be enabled on per span basis if needed and default behavior would be disabled diagnostics. This is helpful for long running jobs, where resource consumption can be recorded for a span like segment merges, peer recovery, index engine start etc and for such long running job, the overhead would be very minimal when compared to the runtime of these jobs. Diagnosis shouldn't be enabled for operations which have very tight SLA of order of 10ms latency. We do something similar with resource tracking framework at task level and this is a generic way to solve the problem and is built in an extensible way.

Make sense, I was just thinking if we can keep metric and span separate and link them with traceID then metrics can be supported throughout as most of the work is already done by you.

Anyways as of now these metrics are being collated at span level but emitted independently. So can we keep these 2 pieces separate and link with exemplars through spanContext?

I didn't quite get this part. But you're right about collection and emission and they are independent. This framework leaves emission logic open ended. For example, This PR is using opentelemetry for metric emission. - https://github.com/rishabhmaurya/OpenSearch/pull/44/commits. Whereas, if we want to use something else, like stats API to emit into or logging exporter, it can be integrated too.

My point is, we are not doing any aggregation at span level so there metrics captured for thread are independent and can be independently emitted without clubbing with the span logic.

How are you thinking of linking these metrics and spans later for analysis? I see you are passing the attributes. Do attributes always have traceID?

traceID would a regular field associated with metric which can used to filter when visualizing, it will not be a dimension attribute to avoid high cardinality dimension.

How this is being done? Can you please point me to the code path?

Maybe too early but how do these metrics work in case of sampling? Will these also be sampled along with span or not?

It is the right time to discuss it. It depends on when we are sampling, if it is at collector level, then all of these diagnosis metrics will be available to collector. Metric sampling works differently at collector, they can both be aggregated or sampled at collector and a lot depends on the aggregation function. I will add more details around it.

I am talking for head sampling.

I have concerns about exposing the Span publicly. If you just need SpanType and hasEnded attribute then can we just expose these properties through either SpanScope or minimum version of span?

Adding points we discussed offline around it a few days ago - SpanScope is getting bloated with a lot of redundant methods same as Span. We wanted to avoid exposing Span because of endSpan() method, which we only want to be exposed to framework internal classes like SpanScope, which can call it when SpanScope is closed. Is there any other reason why we don't want to expose Span publicly?

No this is the main concern. Anyways this is being discussed here - https://github.com/opensearch-project/OpenSearch/pull/8831/files/7bac1e33dea887e9c35e08e54b8e7d5d2bc53bcf#r1281897313

@github-actions
Copy link
Contributor

github-actions bot commented Aug 3, 2023

Gradle Check (Jenkins) Run Completed with:

@opensearch-trigger-bot
Copy link
Contributor

Compatibility status:



> Task :checkCompatibility
Incompatible components: [https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/security-analytics.git]
Compatible components: [https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git]

BUILD SUCCESSFUL in 26m 48s

@github-actions
Copy link
Contributor

github-actions bot commented Aug 3, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Aug 3, 2023

Gradle Check (Jenkins) Run Completed with:

@opensearch-trigger-bot
Copy link
Contributor

Compatibility status:



> Task :checkCompatibility
Incompatible components: [https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/performance-analyzer.git]
Compatible components: [https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/performance-analyzer-rca.git]

BUILD SUCCESSFUL in 27m 58s

@github-actions
Copy link
Contributor

github-actions bot commented Aug 4, 2023

Gradle Check (Jenkins) Run Completed with:

@opensearch-trigger-bot
Copy link
Contributor

Compatibility status:



> Task :checkCompatibility
Incompatible components: [https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/security-analytics.git]
Compatible components: [https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git]

BUILD SUCCESSFUL in 26m 36s

@rishabhmaurya
Copy link
Contributor Author

cc @reta looking forward to your comments. Retagging in case you missed it earlier, thanks.

@rishabhmaurya could you please rebase? there are conflicts, thank you

There were some disruptive changes with last rebase, but its rebased now. Thank you.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@reta
Copy link
Collaborator

reta commented Aug 11, 2023

@rishabhmaurya I am not sure what is the state of this pull request but here are my general observations:

  • the telemetry support is split across 2 pieces (right now): tracing and metric
  • the telemetry may provide both or just one
  • the metrics and tracing should be separated (they are very much coupled now)

In case of OTel (our plugin):

Please bring this pull request into the state where at least checks are passing, so it could be reviewed knowing nothing is missed, thank you.

@rishabhmaurya
Copy link
Contributor Author

My bad on not getting to this one timely, I got side tracked into other work and didn't get chance to take it forward.

the metrics and tracing should be separated (they are very much coupled now)

If you look at the decoupling for example all the wrappers and Span event listener interfaces and implemented, they are very much decoupled.

the metrics, if available, could be integrated with tracing using SpanProcessor

I like the idea of using SpanProcessor for otel implementation, we still would need SpanProcess similar interfaces defined in opensearch telemetry packages, which would be similar to SpanEventListener, I will rename it and try to make it compatible with SpanProcessor.

the tracing should be using semantic conventions (https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/span-general.md#general-thread-attributes) to facilitate context passing (like thread ids, etc)

Sure, I will change them in the next iteration.

@rishabhmaurya
Copy link
Contributor Author

@Gaganjuneja @reta
I was thinking about using SpanProcessor however it can only be tied the to Span start/end events. The reason for using TraceEventListener was to accurately measure the resource consumed by the different threads involved in the lifecycle of the span, so just start and end event of span were not sufficient, thus I associated Runnable start and end event as well.
The framework implementation has changed quite a bit and if my understanding is correct, SpanScope is something which can start and end multiple times now for a given span depending on when a thread starts/ends its execution.
Does it makes sense to associate these events with SpanScope instead of Span start/end and Runnable start/end?

@Gaganjuneja
Copy link
Contributor

@Gaganjuneja @reta I was thinking about using SpanProcessor however it can only be tied the to Span start/end events. The reason for using TraceEventListener was to accurately measure the resource consumed by the different threads involved in the lifecycle of the span, so just start and end event of span were not sufficient, thus I associated Runnable start and end event as well. The framework implementation has changed quite a bit and if my understanding is correct, SpanScope is something which can start and end multiple times now for a given span depending on when a thread starts/ends its execution. Does it makes sense to associate these events with SpanScope instead of Span start/end and Runnable start/end?

@rishabhmaurya, I think here we have 2 uber points.

  1. How can we evaluate the metric in the scope of a span. Either through EventListener and Runnables or SpanScope.
  2. How can we integrate the metric and traces?

As @reta also mentioned, metric and traces are two distinct components and shouldn't be coupled as in the running prod environment it both may be independently enabled/disabled or even coexist. Traces would get sampled as well but metrics we may not want to sample as these are aggregate in nature. So in case we need to integrate them together then it should be done at a framework level once both are generated.

But if I understood your use case, then you just want to aggregate a metric for all the threads being used in the scope of a span. We need to think of a way to do that like using some callbacks/logic with span state or maybe adding spanId as an attribute to the metric etc.

@rishabhmaurya
Copy link
Contributor Author

rishabhmaurya commented Sep 21, 2023

This PR shouldn't be confused with adding metrics support in the framework. It has following components -

  1. Adding event listeners around spans to precisely track when a thread start/end working on a span for all threads involved for a given span (and subspans).
  2. Thread resource recorder and observer interfaces (ThreadResourceObserver & ThreadResourceRecorder) to track resource utilization of threads involved in a span (aka DiagnosticSpan).
  3. Emission of resource utilization computed in 2 using MetricEmitter interface.

If you look at the interfaces, all 3 are components are very much decoupled. MetricEmitter when implmented in OpenTelemetry uses OpenTelemetry meters which are part of another PR - rishabhmaurya#47

The motivation of this PR is, when diagnosis is enabled in tracing framework, it will start recording resource usage for all threads within span boundaries. Later metrics would have attributes associated with them to filter/aggregate on those attributes.

@Gaganjuneja
Copy link
Contributor

I think we are not digressing towards the Metrics Framework (Anyways for that I have put up an RFC #10141). I think we need the answer of following fundamental questions and then we should be proceed.

  1. How are we going to use these metrics?
  2. The impact on metric if spans are getting sampled?
  3. What is the expected state when either metrics or tracing or both are disabled.
  4. If this is so coupled with the scope of a Span then why can't we just make it an attribute to a Span?

@rishabhmaurya
Copy link
Contributor Author

How are we going to use these metrics?

The use of metric is open ended in the framework lib. If we use OpenTelemetry to emitMetric() then the opentelemetry meters would be used to emit these metrics and metrics would contain the span attributes which can be used to filter/aggregate metrics later on.
For instance, assuming PeerRecovery code is wrapped around spans, we can measure the resources consumed by the threads involved in PeerRecovery. We can visualize the overall impact and can drill down on the basis of attributes such as source_node, destination_node, index, shard or which phase consumed most resources (send_files, translog_recovery)as these attributes would be associated with metrics while emission.
Currently, its not adding trace-id as an attribute, however if added, consumption can be filtered on the basis of individual traces too if needed.

The impact on metric if spans are getting sampled?

As far as span context is propagated, span events would be generated and resource recording will happen and metric would be emitted.

What is the expected state when either metrics or tracing or both are disabled.

When metrics are disabled, resource usage would be computed but will not be emitted.
When tracing is completely disabled, resource usage won't be computed as there are no spans for which the resource needs to be computed and thus no metrics would be generated.
If both are disabled, nothing would be computed or emitted.

If this is so coupled with the scope of a Span then why can't we just make it an attribute to a Span?

the only reason its coupled with Span is because its recording the usage of a span and its not coupled directly to span but depends on the events of the spans.

Did you mean to say why can't we emit these metrics as part of span attribute or emit a span event?
If yes, we want them as metrics and not attributes. Metrics can be stored in a timeseries database, can be visualized, aggregated and filtered, which isn't possible with raw span attributes/events.

@reta
Copy link
Collaborator

reta commented Sep 22, 2023

@rishabhmaurya let's try to get back to basics

  1. We have two separate components: traces and metrics (see please [RFC] Metrics Framework #10141)
  2. By and large, those components are independent and could function independently together or individually

Now, let's try to understand what are the relations between tracing and metrics:

  • tracing is about execution flow: how the individual request travels through the system
  • metrics are about operational state of the system expressed in terms of quantifiable measures

What are the lifetimes of tracing data and metrics in the functioning (up and running) system? Most of the metrics live as along as the system is running (I am discarding the persistent layer as it is not relevant at the moment), the system like OpenSearch may have thousands of metrics (and it really has) exposed. Most of the tracing data (like spans) live for a duration of the request (again, I am discarding the persistent layer as it is not relevant at the moment), the system like OpenSearch may have thousands of spans every single second.

Span have a mechanism to capture additional details (like attributes and events), but I would argue that spans should not mess with metrics in any way imaginable. Why is that:

When enabled, resource consumption will be recorded between start and end of the span and will publish the net consumption. It will consider resources consumed by all threads involved in the lifecycle of a span.

First of all, we already have the tracking mechanism, see please TaskResourceTrackingService & related APIs, we definitely don't need over tracking here. Secondly, could you please explain what are practical reasons to track CPU time for each and every span in the system? I would suspect the >90% of them would have 0 time spent but the overhead of calling ThreadMXBean would be huge.

@rishabhmaurya
Copy link
Contributor Author

rishabhmaurya commented Sep 24, 2023

Thanks @reta for the feedback. You have raised all valid concerns and its a privilege to discuss them with folks here who understand this part of system well and get their feedback. I would like to trace back my thought process on how I arrived at this solution and then we can decide if we want to move ahead with this solution or a different solution.

Its mostly derived from managed service use cases (and should be helpful for the whole community) where we don't have enough insights on system level metrics for background jobs like peer recovery, merges, index management jobs like ultrawarm-migrations/transforms/rollover etc. These job can have impact on the search/indexing latency or resiliency of the system and it becomes very hard to troubleshoot such problem both for live and past events and pin point to specific job/event.

My plan was to add capability to emit metrics which are available with the performance analyzer plugin - https://opensearch.org/docs/latest/monitoring-your-cluster/pa/reference/ only for background jobs and not for each search/indexing request. The number of events generated by background tasks is most of the time directly proportional to the rate of growth of data, various indexing policies and cluster events. Whereas, the number of events generated by the requests is directly proportional to the number of requests. The majority of the events of any background task are long running ( > 1s) unlike requests where majority of events are shortlived (< 10 msecs). These are important invariants in the strategy to design the impact measurement for background tasks and for this reason the proposal here cannot be used for requests.

Your concern is valid and it's crucial to consider the trade-offs, including the potential overhead, and ensure that the data collected serves a meaningful purpose in optimizing system performance or diagnosing issues. There is no one-size-fits-all answer, and this approach should be selective in nature where operator can choose for which background job this should be enabled and which metrics to emit. For instance, if we want to get insights on segment merges, we may want to enable metrics like - cpu time, heap allocation rate, IO read/write throughput, Disk metrics, page faults - major/minor. Similarly, for peer recovery, there could be contention between threads esp while translog recovery, so we may want to monitor additionally thread block_time.

Once the impact is quantified, these metrics can be used for various purposes - (1) Improving the resiliency by detecting issues faster, understanding the bottlenecks because of a background task and devising the remediation by changing configuration at various layers - OpenSearch (Shards/Index or Cluster), JVM or by using the right hardware or scaling the machines. (2) Building automation to throttle the background tasks at most granular level like shards and nodes on which they run. This would contribute towards improving the availability of the system. (3) Improving the quality of software releases by using these metrics to create baselines of background tasks impact, for different workloads defined in the benchmark.

Why I'm trying to club span events with metrics?
If we have spans around these background jobs, then we don't have to write any additional code to get these system level metrics around them. It would just be some configuration to be enabled and choosing available metrics to be computed/emitted against each of these background task. Spans generated for background jobs would be very less in number compared to spans for search/indexing request, so scaling should not be an issue. Also, overhead of calling ThreadMXBean or any metric computation logic would be limited too. I agree we should not enable such metrics for regular search/indexing requests.

@msfroh
Copy link
Collaborator

msfroh commented Sep 25, 2023

@rishabhmaurya -- I've read through your comment above, and I'm still not entirely sure what the purpose of this PR should be -- which I think is more on me than it is on you -> I lack sufficient context, but I would like to get it. Essentially, I would like to review your PR, but I don't know where to start.

Can you provide a tl;dr: explanation in three sentences that conveys the following:

  1. Who needs to care about this PR -- essentially, who is the target audience?
  2. What problem does this PR solve?
  3. How does the PR solve the problem?

@rishabhmaurya
Copy link
Contributor Author

Who needs to care about this PR -- essentially, who is the target audience?

Target audience is cluster administrators to get insights into scaling requirements, schedule (if possible) background tasks at appropriate time, insights into increased latency because of contention with background tasks.
Background job could be segment merges, snapshots, shard peer recovery, index management jobs etc.

What problem does this PR solve?

  • It assumes once a background job is wrapped around span, its span events can be used to trigger computation of resources consumed by the threads involved. It will compute the delta of resources consumed by threads from start to end of the span and publish it in form of metrics.
  • Initial plan is to support metrics which can be computed using ThreadMXBean. Ultimately support subset of metrics from https://opensearch.org/docs/latest/monitoring-your-cluster/pa/reference/ like Disk IO - readBytes, writeBytes, readSysCallRate, writeSysCallRate, page faults etc.
  • Admins should be able to control for which background job this diagnosis needs to be enabled dynamically. They should also have control to select which metrics needs to be enabled against a background job.

How does the PR solve the problem?

  • It exposes trace event listeners and runnable event listeners (Runnable wrapper to listen to event for a Runnable when it starts/stop executing a task).
  • It adds a DiagnosisSpan, which is a wrapper over a Span. DiagnosticsEventListener which is a TraceEventListener have access to DiagnosisSpan to persist the start reading and other attributes and use it to compute delta at end event and publish the metrics with the right attributes.
  • This PR adds ThreadResourceObserver and ThreadResourceRecorder. ThreadResourceRecorder starts and ends recording for a ThreadResourceObserver is observing and is invoked by DiagnosticsEventListener.
  • Its a pluggable architecture so one can write a plugin to add logic to observe and record usage of some other resource.

Thanks @msfroh for taking a look.

@reta
Copy link
Collaborator

reta commented Sep 25, 2023

Thanks @reta for the feedback. You have raised all valid concerns and its a privilege to discuss them with folks here who understand this part of system well and get their feedback. I would like to trace back my thought process on how I arrived at this solution and then we can decide if we want to move ahead with this solution or a different solution.

Thanks for the context @rishabhmaurya, it is interesting that you have mentioned merges a few times, we have been thinking about them a lot but it seems like there should be a lot of prerequisite work to be done in order to even get this processes somehow instrumented (merge schedulers etc). Right now Apache Lucene does use internal pools and threads to spawn merges but has to be changed.

I would like to suggest the following plan of action:

  • kick off [RFC] Metrics Framework #10141 - that will give the foundation for metrics (as the holistic APIs)
  • look into existing TaskResourceTrackingService, it is supposed to be tracing the most important flows in the system (which are usually executed as tasks)
  • specifically related to merges, look into instrumenting the Apache Lucene merge process

With respect to this particular pull request, I would suggest to put it on hold (draft mode?) and get back to it when we a) have all pieces together b) conclude that this is the way to implement resource tracking or alike

@rishabhmaurya
Copy link
Contributor Author

@reta my understanding for merge was, since we use custom OpenSearchConcurrentMergeScheduler, thus wrapping doMerge() with span should be sufficient since its a single threaded task.

@reta
Copy link
Collaborator

reta commented Sep 25, 2023

@reta my understanding for merge was, since we use custom OpenSearchConcurrentMergeScheduler, thus wrapping doMerge() with span should be sufficient since its a single threaded task.

@rishabhmaurya hm ... doMerge() in called in MergeThread , whereas the merge thread is created by getMergeThread, wrapping doMerge() won't be sufficient since we have to propagate the context to the MergeThread first, unless I am missing something.

@rishabhmaurya
Copy link
Contributor Author

@rishabhmaurya hm ... doMerge() in called in MergeThread , whereas the merge thread is created by getMergeThread, wrapping doMerge() won't be sufficient since we have to propagate the context to the MergeThread first, unless I am missing something.

@reta since merge thread is single threaded for a single merge task, do we have to care of context propagation? It sounds like a little hack but should work if we pass the tracer to OpenSearchConcurrentMergeScheduler constructor and start and end the span in doMerge().

@reta
Copy link
Collaborator

reta commented Sep 26, 2023

@rishabhmaurya hm ... doMerge() in called in MergeThread , whereas the merge thread is created by getMergeThread, wrapping doMerge() won't be sufficient since we have to propagate the context to the MergeThread first, unless I am missing something.

@reta since merge thread is single threaded for a single merge task, do we have to care of context propagation? It sounds like a little hack but should work if we pass the tracer to OpenSearchConcurrentMergeScheduler constructor and start and end the span in doMerge().

@rishabhmaurya No, we need to see not only merge thread(s), which would appear as isolated spans, but how the merge was triggered, background task or force merge, other means. For this purposes we need context propagation.

@rishabhmaurya
Copy link
Contributor Author

@reta how about propagating context to merge thread in getMergeThread() -

protected MergeThread getMergeThread(MergeSource mergeSource, MergePolicy.OneMerge merge) throws IOException {

so if the thread executing IndexWriter merge functions has the context, it should work?

@reta
Copy link
Collaborator

reta commented Sep 26, 2023

so if the thread executing IndexWriter merge functions has the context, it should work?

Yeah, that the instrumentation that I mentioned and needed to be in place, however we also need to make sure the flow where OpenSearchConcurrentMergeScheduler is running is also instrumented. If you think to take that, please create an issue and submit the pull request, thank you.

@rishabhmaurya
Copy link
Contributor Author

@reta created #10244 feel free to assign it to me.

@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added the stalled Issues that have stalled label Jan 13, 2024
@kotwanikunal
Copy link
Member

@rishabhmaurya Are you still working on this one?

@opensearch-trigger-bot opensearch-trigger-bot bot removed the stalled Issues that have stalled label Feb 23, 2024
@kotwanikunal
Copy link
Member

Please reopen if you are still planning to contribute this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants