-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trace events lib changes #8816
Trace events lib changes #8816
Conversation
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
@@ -10,6 +10,9 @@ | |||
*/ | |||
|
|||
dependencies { | |||
// logging | |||
api "org.apache.logging.log4j:log4j-api:${versions.log4j}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should check once if we can add logger dependency in Telemetry lib
. Only core
lib has this dependency which is a special lib. Though i dont see a problem but we can re-confirm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was a bit hesitant too initially, I will check with others on it. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The build fails if we add other dependencies. Please verify.
* @return an unmodifiable map of measurement names to measurement objects | ||
*/ | ||
public Map<String, Measurement<Number>> getMeasurements() { | ||
return measurements; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defensive copy?
* @param span the span that has completed | ||
* @param t the thread associated with the span | ||
*/ | ||
void onSpanComplete(Span span, Thread t); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: onSpanEnd
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
27d5511
to
993f0e5
Compare
Gradle Check (Jenkins) Run Completed with:
|
@rishabhmaurya, Thanks a lot for putting this up. I have a few question.
|
the motivation is to capture additional metrics around resource usage associated with a span. This is an optional capability which can be enabled on per span basis if needed and default behavior would be disabled diagnostics. This is helpful for long running jobs, where resource consumption can be recorded for a span like segment merges, peer recovery, index engine start etc and for such long running job, the overhead would be very minimal when compared to the runtime of these jobs.
I didn't quite get this part. But you're right about collection and emission and they are independent. This framework leaves emission logic open ended. For example, This PR is using opentelemetry for metric emission. - https://github.com/rishabhmaurya/OpenSearch/pull/44/commits. Whereas, if we want to use something else, like stats API to emit into or logging exporter, it can be integrated too.
traceID would a regular field associated with metric which can used to filter when visualizing, it will not be a dimension attribute to avoid high cardinality dimension.
It is the right time to discuss it. It depends on when we are sampling, if it is at collector level, then all of these diagnosis metrics will be available to collector. Metric sampling works differently at collector, they can both be aggregated or sampled at collector and a lot depends on the aggregation function. I will add more details around it.
Adding points we discussed offline around it a few days ago - |
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
Compatibility status:
|
Gradle Check (Jenkins) Run Completed with:
|
Compatibility status:
|
Make sense, I was just thinking if we can keep metric and span separate and link them with traceID then metrics can be supported throughout as most of the work is already done by you.
My point is, we are not doing any aggregation at span level so there metrics captured for thread are independent and can be independently emitted without clubbing with the span logic.
How this is being done? Can you please point me to the code path?
I am talking for head sampling.
No this is the main concern. Anyways this is being discussed here - https://github.com/opensearch-project/OpenSearch/pull/8831/files/7bac1e33dea887e9c35e08e54b8e7d5d2bc53bcf#r1281897313 |
3a79984
to
6e7471d
Compare
Gradle Check (Jenkins) Run Completed with:
|
6e7471d
to
71206b9
Compare
Compatibility status:
|
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
Compatibility status:
|
e16a9df
to
997a01f
Compare
Gradle Check (Jenkins) Run Completed with:
|
Compatibility status:
|
997a01f
to
52a6212
Compare
There were some disruptive changes with last rebase, but its rebased now. Thank you. |
Gradle Check (Jenkins) Run Completed with:
|
@rishabhmaurya I am not sure what is the state of this pull request but here are my general observations:
In case of OTel (our plugin):
Please bring this pull request into the state where at least checks are passing, so it could be reviewed knowing nothing is missed, thank you. |
My bad on not getting to this one timely, I got side tracked into other work and didn't get chance to take it forward.
If you look at the decoupling for example all the wrappers and Span event listener interfaces and implemented, they are very much decoupled.
I like the idea of using SpanProcessor for otel implementation, we still would need SpanProcess similar interfaces defined in opensearch telemetry packages, which would be similar to SpanEventListener, I will rename it and try to make it compatible with SpanProcessor.
Sure, I will change them in the next iteration. |
@Gaganjuneja @reta |
@rishabhmaurya, I think here we have 2 uber points.
As @reta also mentioned, metric and traces are two distinct components and shouldn't be coupled as in the running prod environment it both may be independently enabled/disabled or even coexist. Traces would get sampled as well but metrics we may not want to sample as these are aggregate in nature. So in case we need to integrate them together then it should be done at a framework level once both are generated. But if I understood your use case, then you just want to aggregate a metric for all the threads being used in the scope of a span. We need to think of a way to do that like using some callbacks/logic with span state or maybe adding spanId as an attribute to the metric etc. |
This PR shouldn't be confused with adding metrics support in the framework. It has following components -
If you look at the interfaces, all 3 are components are very much decoupled. MetricEmitter when implmented in OpenTelemetry uses OpenTelemetry meters which are part of another PR - rishabhmaurya#47 The motivation of this PR is, when diagnosis is enabled in tracing framework, it will start recording resource usage for all threads within span boundaries. Later metrics would have attributes associated with them to filter/aggregate on those attributes. |
I think we are not digressing towards the Metrics Framework (Anyways for that I have put up an RFC #10141). I think we need the answer of following fundamental questions and then we should be proceed.
|
The use of metric is open ended in the framework lib. If we use OpenTelemetry to
As far as span context is propagated, span events would be generated and resource recording will happen and metric would be emitted.
When metrics are disabled, resource usage would be computed but will not be emitted.
the only reason its coupled with Span is because its recording the usage of a span and its not coupled directly to span but depends on the events of the spans. Did you mean to say why can't we emit these metrics as part of span attribute or emit a span event? |
@rishabhmaurya let's try to get back to basics
Now, let's try to understand what are the relations between tracing and metrics:
What are the lifetimes of tracing data and metrics in the functioning (up and running) system? Most of the metrics live as along as the system is running (I am discarding the persistent layer as it is not relevant at the moment), the system like OpenSearch may have thousands of metrics (and it really has) exposed. Most of the tracing data (like spans) live for a duration of the request (again, I am discarding the persistent layer as it is not relevant at the moment), the system like OpenSearch may have thousands of spans every single second. Span have a mechanism to capture additional details (like attributes and events), but I would argue that spans should not mess with metrics in any way imaginable. Why is that:
First of all, we already have the tracking mechanism, see please |
Thanks @reta for the feedback. You have raised all valid concerns and its a privilege to discuss them with folks here who understand this part of system well and get their feedback. I would like to trace back my thought process on how I arrived at this solution and then we can decide if we want to move ahead with this solution or a different solution. Its mostly derived from managed service use cases (and should be helpful for the whole community) where we don't have enough insights on system level metrics for background jobs like peer recovery, merges, index management jobs like ultrawarm-migrations/transforms/rollover etc. These job can have impact on the search/indexing latency or resiliency of the system and it becomes very hard to troubleshoot such problem both for live and past events and pin point to specific job/event. My plan was to add capability to emit metrics which are available with the performance analyzer plugin - https://opensearch.org/docs/latest/monitoring-your-cluster/pa/reference/ only for background jobs and not for each search/indexing request. The number of events generated by background tasks is most of the time directly proportional to the rate of growth of data, various indexing policies and cluster events. Whereas, the number of events generated by the requests is directly proportional to the number of requests. The majority of the events of any background task are long running ( > 1s) unlike requests where majority of events are shortlived (< 10 msecs). These are important invariants in the strategy to design the impact measurement for background tasks and for this reason the proposal here cannot be used for requests. Your concern is valid and it's crucial to consider the trade-offs, including the potential overhead, and ensure that the data collected serves a meaningful purpose in optimizing system performance or diagnosing issues. There is no one-size-fits-all answer, and this approach should be selective in nature where operator can choose for which background job this should be enabled and which metrics to emit. For instance, if we want to get insights on segment merges, we may want to enable metrics like - cpu time, heap allocation rate, IO read/write throughput, Disk metrics, page faults - major/minor. Similarly, for peer recovery, there could be contention between threads esp while translog recovery, so we may want to monitor additionally thread block_time. Once the impact is quantified, these metrics can be used for various purposes - (1) Improving the resiliency by detecting issues faster, understanding the bottlenecks because of a background task and devising the remediation by changing configuration at various layers - OpenSearch (Shards/Index or Cluster), JVM or by using the right hardware or scaling the machines. (2) Building automation to throttle the background tasks at most granular level like shards and nodes on which they run. This would contribute towards improving the availability of the system. (3) Improving the quality of software releases by using these metrics to create baselines of background tasks impact, for different workloads defined in the benchmark. Why I'm trying to club span events with metrics? |
@rishabhmaurya -- I've read through your comment above, and I'm still not entirely sure what the purpose of this PR should be -- which I think is more on me than it is on you -> I lack sufficient context, but I would like to get it. Essentially, I would like to review your PR, but I don't know where to start. Can you provide a tl;dr: explanation in three sentences that conveys the following:
|
Target audience is cluster administrators to get insights into scaling requirements, schedule (if possible) background tasks at appropriate time, insights into increased latency because of contention with background tasks.
Thanks @msfroh for taking a look. |
Thanks for the context @rishabhmaurya, it is interesting that you have mentioned merges a few times, we have been thinking about them a lot but it seems like there should be a lot of prerequisite work to be done in order to even get this processes somehow instrumented (merge schedulers etc). Right now Apache Lucene does use internal pools and threads to spawn merges but has to be changed. I would like to suggest the following plan of action:
With respect to this particular pull request, I would suggest to put it on hold (draft mode?) and get back to it when we a) have all pieces together b) conclude that this is the way to implement resource tracking or alike |
@reta my understanding for merge was, since we use custom |
@rishabhmaurya hm ... |
@reta since merge thread is single threaded for a single merge task, do we have to care of context propagation? It sounds like a little hack but should work if we pass the tracer to |
@rishabhmaurya No, we need to see not only merge thread(s), which would appear as isolated spans, but how the merge was triggered, background task or force merge, other means. For this purposes we need context propagation. |
@reta how about propagating context to merge thread in OpenSearch/server/src/main/java/org/opensearch/index/engine/OpenSearchConcurrentMergeScheduler.java Line 192 in 13fc7cc
so if the thread executing IndexWriter merge functions has the context, it should work? |
Yeah, that the instrumentation that I mentioned and needed to be in place, however we also need to make sure the flow where OpenSearchConcurrentMergeScheduler is running is also instrumented. If you think to take that, please create an issue and submit the pull request, thank you. |
This PR is stalled because it has been open for 30 days with no activity. |
@rishabhmaurya Are you still working on this one? |
Please reopen if you are still planning to contribute this. |
Description
As part of tracing framework #7648, we added tracing capabilities in OpenSearch, this proposal is to add diagnosis metrics which can be associated with spans. It can help provide additional insights on resource consumption around different parts of the system. When enabled, resource consumption will be recorded between start and end of the span and will publish the net consumption. It will consider resources consumed by all threads involved in the lifecycle of a span. Example usage, assuming segment merge is instrumented with tracing framework, on enabling diagnostic on segment merge traces, it will start emitting resource usage by merges such as heap allocation, cpu usage etc. Such diagnostic metrics could pin point the bottlenecks in the system and exact scaling requirements.
Tenet
Trace Event Listener
In order to add diagnosis metrics associated with span, trace event listeners are required which are called on span start and span end. To handle thread context switch and record resource consumption associated with the new thread spawned,
Runnables
associated with a span can be wrapped with RunnableWrapper, which is provided by this framework. Wrapped Runnable generates start and end events to accurately measure the resource consumption. TraceEventListener are generic and can be used for other purposes if needed and not just diagnosis.Above figure illustrates the expected life cycle of a span, which is started in the middle of an execution of a task, started by
Thread 1
(ofThreadPool 1
). It performs an IO operation, and on completion of IO,Thread 2
(ofThreadPool 1
) resumes the task.Thread 2
also does an IO operation and task is resumed later by theThread 3
(ofThreadPool 1
) on IO completion.Thread 3
performs 3 different sub-tasks, which it executes in parallel by spawning three different threads in different executor services, and is non-blocking in nature, so it continues to work on rest of the task, while sub-tasks are delegated to different threads. However, it does wait for sub-tasks completion, without blocking the thread, and on response to the last sub-task, the task is resumed byThread 2
(ofThreadPool 1
) to its completion and the Span is closed somewhere in between.ThreadPool 3
, which doesn't have the Runnable wrapped and doesn't have the span and registered event listeners information, thus span events are not generated.Diagnostic Trace Event Listener
This is a generic implementation of trace event listeners which records resource consumption using provided
ThreadResourceRecorder
and emit corresponding metrics using providedMetricEmitter
.Thread Resource Recorder
Thread resource recorder exposes start and end recording method which are used by
DiagnosticTraceEventListener
to record resource consumption. On start recording, it observes the current usage of a resource using provided resource observer. On end recording, it again observes the current readings, computes the difference in observed metric from start point and emits them. The metric is of delta temporality type.Thread Resource Observer
It simply exposes
observe()
method to observe the current reading associated with a thread.Related Tasks
Related Issues
Resolves #[Issue number to be closed when this PR is merged]
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.