-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Ability to (filter) trace the bulk/search requests with high latency or resource consumption #12315
Comments
[Triage - attendees 1 2 3 4 5 6] Adding @reta for more thoughts |
@backslasht I believe tail based sampling is the best option in this case since we cannot make the decision upfront regarding latency
The tail base sampling would be done outside of OpenSearch (on Otel exporter side), so we should be not concerned too much by additional computation and memory I believe |
@backslasht I don't fully understand how the threshold-based solution would work. Don't you have to capture all spans since you don't know at the start that they will breach the configured threshold? It's only when they complete (or even all spans within a trace complete) that you know the threshold wasn't breached and they are safe to discard. I'm confused about how this is different from tail-based sampling in practice. |
I think the ask here is to have delayed sampling decision (with in the context of a live request) because
We need to see how it can be accommodated in the otel framework. |
Single span collector node, just to clarify here
No all but the requests of specific types (actions), just to clarify on that |
@reta - Tail based sampling works if the number of requests are in hundreds per second, but it becomes a bottleneck when the count grows to tens of thousands of requests per second. It puts additional pressure on OpenSearch to collect, process and export. Also, if the sampling is done outside the OpenSearch cluster, then it brings in additional resource consumption w.r.t network I/O and compression.
@andrross - As @reta pointed out, in tail based sampling, the decision is taken much later, mostly in external systems as they get to see the full view of the request (coordinator nodes + N data nodes). What I suggest is a variation of that, where the decision to capture is taken by one or more of the data nodes and communicated back to the coordinator so that the corresponding spans (coordinator spans and the spans of the relevant data nodes) are captured. Though this doesn't provide the full view of the request, it does provide view of the problematic parts (spans) for the request which will help in debugging the issue. |
@backslasht Got it, this does seem like a variation of tail-based sampling where essentially each component can make an independent tail decision about whether its span should be captured. It also seems like any communication/coordination between the different components to ensure a coherent trace gets captured when one of the spans makes the decision to capture will be a challenge here (though to be honest I don't know OpenTelemetry well). |
@backslasht I think it is clear that in order to support feature like that, the complete state of the trace (spread across the cluster nodes) has to be kept somewhere for a duration of the request. In the view of your next comment ...
... keeping the traces state of tens of thousands of the requests on OpenSearch side in case any of them may backfire is unsound design decision from the start (at least, to me). OpenSearch does not do any processing over spans - it merely collects them and sends over the wire (in batches). The overhead of that could be measured and accounted for, but it is very lightweight. There are many large high volume systems out there that do use tracing at scale, it needs infrastructure obviously but that is different problem. At the end, the hit will be taken by the collector that has to accommodate tail sampling requirement - it make the system more reliable (if collector dies or needs scaling, no visible impact for users). Tail/Adaptive sampling is difficult problem to solve, I think we have to stay realistic and explore the limits of the existing systems before making any statements regarding how they behave or may behave. To my knowledge, we have not done any of that yet. |
I guess I was not very clear in my previous comment. The suggestion is not to keep the traces of completed requests. As per the design today, the traces are kept in memory (in coordinator nodes) when the requests are further getting processed in data nodes and will be written to the wire when the response is sent back. The proposal is an optimization where in OpenSearch can decide if the trace needs to be sent to the wire or not based on its significance and the significance is determined by a certain rule.
I agree, we don't have benchmarks on the impact of collecting traces at large scale. I can go create large volumes of requests and measure the impact, but open for suggestions if you any thoughts on large scale workloads? |
This is not how it works today (if by traces we mean the tracing instrumentation), the trace spans are flushed as they are ended, nothing is kept in memory (besides the contextual details which is basically trace / span ids)
We have benchmarking framework, we could simulate workloads. |
Yes, but in most of the cases parent span ends post child is ended (except some async scenarios). |
Thanks! exactly the point. |
@backslasht Thank you for filing the issue. I do see the point for this feature and did a quick load test to see how just creating lot of spans can impact the overall cluster performance. With the load test we could see there was a significant increase of the node network bandwidth where the number of spans generated are high. This does make a significant difference and can impact the workload and operations. Following are the tests which I conducted: Test Environment Setup:
When traces are disabled : When the traces were enabled: To further check if the bandwidth utilization is getting affected/worse with the count of spans I could see with 328K requests it further increased and even at a point it was utilizing the full network bandwidth of the node and a drop in spans were observed during the time frame. Not just spans, could also see overall node health check getting stalled and causing the nodes to get dropped and join the cluster back again. So to overcome such behavior and to be fault tolerant I do agree there is a merit of exporting limited spans. We can actually capture the child span if that is above a threshold and from there we can go ahead and capture the parent i.e a branch till the root span where it would have the details for all spans which have still not ended and are waiting for the request to be processed . I will soon come up with the overall proposal for the same. |
@nishchay21 there are tradeoff for every approach:
To have an understanding what exactly you experimented which, could you please share:
Thank you. |
@reta I do agree that we need to keep 100% sampling rate for above solution as well which might consume some extra resources on the cluster. To put it out this would be more helpful where someone wants to enable the tail sampling on the cluster and they get the ability to trace only the spans which are anomalous. As tail sampling has it's own caveats of using high storage cost, extra processing cost and extra network bandwidth this solution will help to reduce the same and still get to those anomalous spans. To Answer the questions:
|
I think, judging from your reply, the reasoning about this problem is not correct: we don't need anonymous spans, we need traces that are outliers. Basically, we need the whole system view for such outliers, not just "out of the context" spans (this is why it is a hard problem). PS: If we think a bit about the trade offs, I would like to refer to you the recently added SEGMENT based replication, where we traded computed to network bandwidth.
So this basically is not a "valid" test - the collector contributes to the network consumption but it has to be excluded. Please test with collector deployed separately so we would only count the cost of exporting spans. |
@reta by anomalous spans I basically meant spans which are outliers itself. So just to explain further the proposal is to not just capture one single span which is outlier but also capture the spans from there to the above chain of parent [Only if parent is still open and not ended]. So to explain on this:
Once the parent span receives the information about the child being sampled there are two possibilities as seen above: 1. Parent span is still recording - If the parent span is still recording we will have the span sampled as well and store the information for that span. Once the span is marked to be sampled we will send the same sampling information in the above hierarchy as seen below until we reach the root node. This way we will be able to sample the request to its root. 2. Parent Span has stopped recording - Another possibility is that the parent span has ended and not in recording phase as seen in below picture. In this case we will not have the parent wait to be sampled and just drop the parent information. As the parent is not waiting for the child span to complete we know that the overall request is not very much dependent on this span and there is no use to capture the same. So in this case we will just ignore the span and send the information to its parent span until the same hierarchy continues. As we have already discussed above that keeping all the traces or spans in memory for long can cause lot of memory usage so we will not do the same in this approach and just let the parent span get dropped if that is not marked to be sampled and ended before the child span. The main advantages of this approach:
Adding further on the test involved so I basically ran a custom collector on the node which is just reading the traces and pushing out. So it is just simple traces exported out this will not involve any other data from the collector itself. Ideally I believe this will mimic or emulate the same behavior. If required, we can test on having the collector outside the node as well. |
@nishchay21 this sounds very complicated to me and I honestly don't understand why we need to build that:
This decision could be only made upon completion and will not work with async scenarios when parent span may be ended well before the child (the timing will be reconciled at the query time). The OpenSearch is using async heavily everywhere.
Tracing supposed to be optional instrumentation, it will now leak everywhere: responses, requests (we need to understand which node the span comes from, the node may die meanwhile loosing the most important part of the trace completely, ...). We are struggling with instrumentation at the moment (it is very difficult to do right) yet we are looking for complex solution for the problems we don't have (to me), building tracing framework on top of another tracing framework. |
Have had an offline discussion with @Gaganjuneja abd @nishchay21 , here are the conclusions we ended up with:
|
Hi @reta , Thank you for the offline discussion. For point 2 I have done some POC around getting the outlier spans within the Otel plugin itself and not polluting the Opensearch core. Will get back to you with the details on this soon. |
Hi @reta, So here is what we plan to do for the detection of an outlier span :
Note: This way we will not populate the OpenSearch core itself and have the minimal implementation within the plugin. Also just to add the memory consumption of this implementation will be equivalent to the memory consumption if we enable tail sampling. |
I doubt that is going to work (since the initial trace could be initiated way outside of the context OpenSearch itself) but we have discussed that already (I hope am missing something). Looking forward to see implementation, thank you. |
@reta So, If the initial traces are generated from outside the context of Opensearch than we respect the client sampling decision itself and will not override with our decision [which holds true today as well]. This will act as a new sampler within the core itself and if the decision is taken by this sampler in core then only the feature would be applied. |
Is your feature request related to a problem? Please describe
With the introduction of Request Tracing Framework (RTF) using OpenTelemetry (OTel), requests can be traced to identify various code paths/modules which take more time to execute. While this solves the intended problem, enabling tracing for all requests does have an overhead with respect to additional computation/CPU and memory. The recommended solution to reduce the overhead is to sample the requests, OTel supports two types of sampling techniques, i) head based probabilistic sampling and ii) tail sampling.
While tailing sampling is good, it still requires the additional computation and memory as the optimization is only applicable for the network data transfer. Head based probabilistic sampling may not capture the requests with high latency if their occurrence is small (1 out of 10K requests).
This leaves us with either sample everything with additional cost or not sample at all.
Describe the solution you'd like
Threshold based trace capture: Given that both the head and tail based sampling doesn't solve the problem in an efficient way, we need a new way to identify if a particular request is important during the request processing and then capture all the related span/traces. This can be achieved by configuring thresholds for different spans and a request is considered important only when one/more spans have breached the threshold configured for those span(s). For example, say if we have a span on particular aggregation code, we can setup a threshold
search.aggregation.date_histogram.latency > 300ms
, then if any request which takes more than 300ms can then be captured. The threshold can be configured as a dynamic setting similar to the way log levels are configured today. This is one thought process, would like to get community's inputs on this.Related component
Other
Describe alternatives you've considered
i) Head based probabilistic sampling.
ii) Tail sampling.
Additional context
No response
The text was updated successfully, but these errors were encountered: