-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HPCC4J-611 Add OpenTelemetry tracing to dfsclient #716
Conversation
Jira Issue: https://hpccsystems.atlassian.net/browse/HPCC4J-611 Jirabot Action Result: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpmcmu provided some high level feedback, let's discuss offline
@@ -294,6 +311,7 @@ private boolean retryRead() | |||
} | |||
catch (Exception e) | |||
{ | |||
this.readSpan.end(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This situation seems important enough to report as an exception on the span, and if we can determine the readSpan has failed at this point, we should set the span status to fail.
@@ -344,6 +391,13 @@ public RowServiceInputStream(DataPartition dp, FieldDef rd, FieldDef pRd, int co | |||
|
|||
this.dataPart = dp; | |||
|
|||
this.readSpan = readSpan; | |||
if (readSpan != null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should also test if readspan is valid
this.readSpan = readSpan; | ||
if (readSpan != null) | ||
{ | ||
this.readSpanTraceID = "00-" + readSpan.getSpanContext().getTraceId() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is actually the trace context and is propagated via a "traceparent" header, let's refactor this.readSpanTraceID
{ | ||
this.prefetchException = e; | ||
|
||
if (readSpan != null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and is valid
|
||
if (readSpan != null) | ||
{ | ||
Attributes attributes = Attributes.of( ServerAttributes.SERVER_ADDRESS, getIP(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are the server address, etc attributes only added in exception handlers?
@@ -1986,7 +2128,8 @@ private void makeFetchObject(StringBuilder sb) | |||
|
|||
private String makeGetVersionRequest() | |||
{ | |||
final String versionMsg = RFCCodes.RFCStreamReadCmd + "{ \"command\" : \"version\", \"handle\": \"-1\", \"format\": \"binary\" }"; | |||
final String trace = readSpan != null ? "\"_trace/traceparent\" : \"" + readSpanTraceID + "\",\n" : ""; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to send the traceparent is the span isn't valid
return GlobalOpenTelemetry.get().getTracer("DFSClient"); | ||
} | ||
|
||
public static Span createRootSpan(String name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure DFUClient should ever create a root span, since this is a client library we should inherit the trace from the caller
HpccFileException wrappedException = new HpccFileException("Failed on initial remote read read trans", e); | ||
if (writeSpan != null) | ||
{ | ||
Attributes attributes = Attributes.of(ServerAttributes.SERVER_ADDRESS, this.rowServiceIP, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed this seems to be repeated in a few places, why not add a function in utils to handle the attributes annotation in a single place. Also, why aren't we annotating the span as soon as it's created?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to cleanup how I am handling adding events / exceptions in the next PR, so I was holding off on cleaning up how attributes get added for those events. However, those attributes should have been added to the span when it was originally created. I will do that as part of this PR.
return versionMsg; | ||
} | ||
|
||
private void makeInitialWriteRequest() throws Exception | ||
{ | ||
String jsonRecordDef = RecordDefinitionTranslator.toJsonRecord(this.recordDef).toString(); | ||
|
||
final String trace = writeSpan != null ? "\"_trace/traceparent\" : \"" + writeSpanTraceID + "\",\n" : ""; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might not want to report traceparents unless we know it's a valid traceparent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I added a check during construction, such that writeSpan will only be set if its context is valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpmcmu looks fine, I did leave a couple of comments we need to resolve.
I don't want to delay this pr anymore than needed, but we have to confirm the traceparent is being propagated in the dafilserv requests (perhaps simple junit test cases?)
Also, since you're creating new trace/spans, we need to exercise that logic as well.
Can you capture log output showing the toplevel trace/span, and at least one of the child spans and confirm the child inherits the trace id from the parent span?
@rpastrana Still looking into unit tests to capture the trace IDs, but here is example log output for a read: |
- Added tracing support to read pathways - Added tracing support to write pathways Signed-off-by: James McMullan [email protected]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpmcmu approved
@@ -287,16 +374,19 @@ private static class RowServiceResponse | |||
|
|||
private String makeGetVersionRequest() | |||
{ | |||
final String versionMsg = RFCCodes.RFCStreamReadCmd + "{ \"command\" : \"version\", \"handle\": \"-1\", \"format\": \"binary\" }"; | |||
final String trace = traceContextHeader != null ? "\"_trace/traceparent\" : \"" + traceContextHeader + "\",\n" : ""; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jakesmith FYI all HPCC4J requests into DAFILESrv will contain a new header with OTel trace context. We've confirmed on our end that the header does not seem to break the server side, but we'd like to get your thoughts. Also, we're opening a jira requesting that dafilesrv at minimum consume this info and propagate it, and optimally add helpful instrumentation based on jtrace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpmcmu - is a backslash in an JSON identifier valid? Regardless it looks a little odd, why not just "traceparent" or if want a separate node, generate '"_trace" : { "traceparent": traceContextHeader }' ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jakesmith it's supposed to be
"_trace": { "traceparent": "00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-00" } ,
as you point out. Once fixed, will that layout work for your side?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I would make it :
"trace": { "parent": "00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-00" }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The leading underscore comes from the roxie protocol, I vote we keep it for symetry unless it breaks some sort of convention
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right ok, and "traceparent" is also part of that protocol?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it was recently added: https://hpccsystems.atlassian.net/issues/HPCC-31685
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. So bar the fix to the format, it seems fine.
@@ -513,8 +662,11 @@ public void write(byte[] b) throws IOException | |||
*/ | |||
public void write(byte[] b, int off, int len) throws IOException | |||
{ | |||
String request = "{ \"format\" : \"binary\", \"handle\" : \"" + this.handle + "\"" | |||
+ (useOldProtocol ? "" : ", \"command\" : \"continue\"") + " }"; | |||
final String trace = traceContextHeader != null ? "\"_trace/traceparent\" : \"" + traceContextHeader + "\",\n" : ""; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any value in sending this every request ?
Will it always match the one sent in makeInitialWriteRequest ? If so, dafilesrv can maintain what it saw and reuse, rather than re-read etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the most impactful value in reporting it is validation of expected parent trace. The server side can choose to drop it and ignore, how concerned are we about the potentially superfluous header in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the expectation is that it should match the initial request value? If it does not, should it ignore it and warn, or use the new value. Is it likely to change, or is it impossible?
In general, the dafilesrv protocol does not expect clients to resend info. that has been encompassed in the initial request, e.g. the "node" type and format etc. - the "handle" on all subsequent requests represents the reference to established stream info.
If this is a field value that might change, then I'm not particularly concerned with the extra expense of the comm traffic, assuming the client is dealing with reasonable sized chunks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it comes down to how much overhead we're incurring vs how much it could help us track a problematic activity...
The overhead seems minimal, but it also doesn't seem particularly helpful.
If the client/server ever get out of synch (I don't know how likely this scenario is), could the traceparent help us track the problem (ie, the out of synch request would be tracked under a known trace)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not particularly fussed either way, but it is out of character with other info that is sent once.
And, no, I don't think it will be useful if the client gets out of sync. If that happens, the handle (which is the key reference), will be out of sync, the server (dafilesrv) will reject it (challenge response) and the client should reinitiate the connection, along with sending "_trace" info.
Signed-off-by: James McMullan [email protected]
Type of change:
Checklist:
Testing: