-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix updating the parent span #9450
Open
Gaganjuneja
wants to merge
3
commits into
opensearch-project:main
Choose a base branch
from
Gaganjuneja:main-telemetry-span-end-bug
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 1 commit
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next
Next commit
Fix upadting the parent span
Signed-off-by: Gagan Juneja <[email protected]>
- Loading branch information
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Gaganjuneja the thing you're trying to fix should never happen, I think we are managing context restoration incorrectly. If the execution is synchronous, thing are clear: the parent could not be finished before children, now the asynchronous case is different:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @reta for taking a look.
I had run a couple of tests and deep dive to understand the issue.
As of now Our context restoration in based on the fact that
OpenSearch/server/src/main/java/org/opensearch/common/util/concurrent/ThreadContext.java
Line 74 in 8cfde6c
Every opensearch thread is managed by a thread pool or executor being responsible for stashing and restoring the threads context
. So ideally even we are setting the wrong/ended parent but it should be removed once the thread call is over and context is restored. One specific scenario where this is not happening is scheduled cluster monitoring actions.Here if at Thread1 the context is not stashed the state will be persisted. This exactly happens with the NodeStatsTransportAction and possibly with other actions where startSpan through TransportService::sendRequest is creating the span and persisting with the current Management thread without stashing the context after the call is over.
We can fix this by adding the ThreadContext.stashContext() here https://github.com/opensearch-project/OpenSearch/pull/9415/files#diff-518cfddb0699fc99133f7a7dd5dc2224465c903405ae4aa9a8440abd03d1a686R868
We can fix this case by case. But now the question is should we add a check at the framework level like when we are starting the span then check if the current span is already ended then ignore that and create a new span without the current parent. Capturing the currentSpan while creating the span also wont help because most of the cases the context is being propagated to a caller thread from the calling thread.
Your thoughts on this please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There could be more cases from the ClusterApplierService.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking @Gaganjuneja , so indeed we don't manage context properly. So we need to make very clear distinction between spans and scopes. The
OpenTelemetry
models it differently but we don't: the spans could outlive the scopes, but the scopes provide this short-living context propagation (I will share the idea and will help you to deal with API changes):With this model (span + scope), we will be able to cover any async and reactive context propagation. On the side note, OpenTelemetry has pretty good debugging capabilities to help validating the spans / scopes lifecycle:
We are building our MockXxx implementation but I think we should stop doing it and use OpenTelemetry in tests instead, it will save us tons of time and efforts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Scope should be closed and cleared out the current span from the current context. Span could be ended later on once the aync operation is completed.
Yes, we also took the idea of Strict check from here only.
MockXxx implementation we had written because OTEL is a plugin and we need to take the otel dependency inside the test framework and in future if multiple telemetry implementations are there then it would be difficult to do.
Let me also think about it and share the thoughts.