Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix updating the parent span #9450

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
Fix upadting the parent span
Signed-off-by: Gagan Juneja <[email protected]>
  • Loading branch information
Gagan Juneja committed Aug 20, 2023
commit 0cd65e800a59349d1122d6f9c46d77e31bcc8b74
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,11 @@ public SpanContext getCurrentSpan() {
private void endSpan(Span span) {
if (span != null) {
span.endSpan();
setCurrentSpanInContext(span.getParentSpan());
if (span.getParentSpan() != null && !span.getParentSpan().hasEnded()) {
Copy link
Collaborator

@reta reta Aug 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Gaganjuneja the thing you're trying to fix should never happen, I think we are managing context restoration incorrectly. If the execution is synchronous, thing are clear: the parent could not be finished before children, now the asynchronous case is different:

  • the parent span and child spans do not belong to the same thread
  • when the we start new span scope, we should capture the current span associated with the current thread (which could be not the same as the parent span)
  • when the span is ended, we have to restore the span that was active on that particular thread

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @reta for taking a look.
I had run a couple of tests and deep dive to understand the issue.

As of now Our context restoration in based on the fact that

* A ThreadContext is a map of string headers and a transient map of keyed objects that are associated with
Every opensearch thread is managed by a thread pool or executor being responsible for stashing and restoring the threads context. So ideally even we are setting the wrong/ended parent but it should be removed once the thread call is over and context is restored. One specific scenario where this is not happening is scheduled cluster monitoring actions.

Screenshot 2023-08-27 at 10 08 16 AM

Here if at Thread1 the context is not stashed the state will be persisted. This exactly happens with the NodeStatsTransportAction and possibly with other actions where startSpan through TransportService::sendRequest is creating the span and persisting with the current Management thread without stashing the context after the call is over.

We can fix this by adding the ThreadContext.stashContext() here https://github.com/opensearch-project/OpenSearch/pull/9415/files#diff-518cfddb0699fc99133f7a7dd5dc2224465c903405ae4aa9a8440abd03d1a686R868

We can fix this case by case. But now the question is should we add a check at the framework level like when we are starting the span then check if the current span is already ended then ignore that and create a new span without the current parent. Capturing the currentSpan while creating the span also wont help because most of the cases the context is being propagated to a caller thread from the calling thread.

Your thoughts on this please.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There could be more cases from the ClusterApplierService.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking @Gaganjuneja , so indeed we don't manage context properly. So we need to make very clear distinction between spans and scopes. The OpenTelemetry models it differently but we don't: the spans could outlive the scopes, but the scopes provide this short-living context propagation (I will share the idea and will help you to deal with API changes):

            // Creates a Span that is not associated with the thread (it is not attached to anything)
            final Span httpRequestSpan = tracer.startSpan(SpanBuilder.from(httpRequest));
            // Attaches the span to current thread (scope)
            final (final SpanScope scope = tracer.spanInScope(httpRequestSpan)) {
                HttpChannel traceableHttpChannel = new TraceableHttpChannel(httpChannel, httpRequestSpanScope);
                handleIncomingRequest(httpRequest, traceableHttpChannel, httpRequest.getInboundException());
            } 
            // When scope is closed, the previously attached span is restored, but the current span is not closed

With this model (span + scope), we will be able to cover any async and reactive context propagation. On the side note, OpenTelemetry has pretty good debugging capabilities to help validating the spans / scopes lifecycle:

-Dio.opentelemetry.context.enableStrictContext=true

We are building our MockXxx implementation but I think we should stop doing it and use OpenTelemetry in tests instead, it will save us tons of time and efforts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Scope should be closed and cleared out the current span from the current context. Span could be ended later on once the aync operation is completed.

-Dio.opentelemetry.context.enableStrictContext=true

Yes, we also took the idea of Strict check from here only.

We are building our MockXxx implementation but I think we should stop doing it and use OpenTelemetry in tests instead, it will save us tons of time and efforts.

MockXxx implementation we had written because OTEL is a plugin and we need to take the otel dependency inside the test framework and in future if multiple telemetry implementations are there then it would be difficult to do.

Let me also think about it and share the thoughts.

setCurrentSpanInContext(span.getParentSpan());
} else {
setCurrentSpanInContext(null);
}
}
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -90,4 +90,10 @@ public interface Span {
*/
String getSpanId();

/**
* Returns whether the span is ended or not.
* @return Span end Status.
*/
boolean hasEnded();

}
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,28 @@ public void testCreateSpanWithAttributesWithParentMock() {
verify(mockTracerContextStorage, never()).get(TracerContextStorage.CURRENT_SPAN);
}

public void testCreateSpaWithParentEndsBeforeChild() {
DefaultTracer defaultTracer = new DefaultTracer(mockTracingTelemetry, mockTracerContextStorage);
Attributes attributes = Attributes.create().addAttribute("name", "value");
when(mockTracingTelemetry.createSpan("span_name", mockParentSpan, attributes)).thenReturn(mockSpan);
when(mockParentSpan.hasEnded()).thenReturn(true);
SpanScope childSpanScope = defaultTracer.startSpan("span_name", new SpanContext(mockParentSpan), attributes);
childSpanScope.close();
verify(mockTracingTelemetry).createSpan("span_name", mockParentSpan, attributes);
verify(mockTracerContextStorage).put(TracerContextStorage.CURRENT_SPAN, null);
}

public void testCreateSpaWithParentEndsAfterChild() {
DefaultTracer defaultTracer = new DefaultTracer(mockTracingTelemetry, mockTracerContextStorage);
Attributes attributes = Attributes.create().addAttribute("name", "value");
when(mockTracingTelemetry.createSpan("span_name", mockParentSpan, attributes)).thenReturn(mockSpan);
when(mockParentSpan.hasEnded()).thenReturn(false);
SpanScope childSpanScope = defaultTracer.startSpan("span_name", new SpanContext(mockParentSpan), attributes);
childSpanScope.close();
verify(mockTracingTelemetry).createSpan("span_name", mockParentSpan, attributes);
verify(mockTracerContextStorage).put(TracerContextStorage.CURRENT_SPAN, mockParentSpan);
}

public void testCreateSpanWithAttributes() {
TracingTelemetry tracingTelemetry = new MockTracingTelemetry();
DefaultTracer defaultTracer = new DefaultTracer(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,11 @@ public String getSpanId() {
return delegateSpan.getSpanContext().getSpanId();
}

@Override
public boolean hasEnded() {
return !delegateSpan.isRecording();
}

io.opentelemetry.api.trace.Span getDelegateSpan() {
return delegateSpan;
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,15 @@ public void testEndSpanTest() {
verify(mockSpan).end();
}

public void testHasEndSpanTest() {
Span mockSpan = getMockSpan();
when(mockSpan.isRecording()).thenReturn(false);
OTelSpan oTelSpan = new OTelSpan("spanName", mockSpan, null);
oTelSpan.endSpan();
assertTrue(oTelSpan.hasEnded());
verify(mockSpan).end();
}

public void testAddAttributeString() {
Span mockSpan = getMockSpan();
OTelSpan oTelSpan = new OTelSpan("spanName", mockSpan, null);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -141,10 +141,7 @@ public String getSpanId() {
return spanId;
}

/**
* Returns whether the span is ended or not.
* @return span end status.
*/
@Override
public boolean hasEnded() {
synchronized (lock) {
return hasEnded;
Expand Down