Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPCC4J-611 Add OpenTelemetry tracing to dfsclient #722

Merged
merged 1 commit into from
Jul 19, 2024

Conversation

jpmcmu
Copy link
Contributor

@jpmcmu jpmcmu commented Jul 10, 2024

  • Updated event attribute reporting
  • Fixed trace structure in dfsclient protocol
  • Added tracing to FileUtility

Signed-off-by: James McMullan [email protected]

Type of change:

  • This change is a bug fix (non-breaking change which fixes an issue).
  • This change is a new feature (non-breaking change which adds functionality).
  • This change is a breaking change (fix or feature that will cause existing behavior to change).

Checklist:

  • I have created a corresponding JIRA ticket for this submission
  • My code follows the code style of this project.
    • I have applied the Eclipse code-format template provided.
  • My change requires a change to the documentation.
    • I have updated the documentation accordingly, or...
    • I have created a JIRA ticket to update the documentation.
    • Any new interfaces or exported functions are appropriately commented.
  • I have read the HPCC Systems CONTRIBUTORS document (https://github.com/hpcc-systems/HPCC-Platform/wiki/Guide-for-contributors).
  • The change has been fully tested:
    • This change does not cause any existing JUnits to fail.
    • I have include JUnit coverage to test this change
    • I have performed system test and covered possible regressions and side effects.
  • I have given due consideration to all of the following potential concerns:
    • Scalability
    • Performance
    • Security
    • Thread-safety
    • Premature optimization
    • This change fixes the problem, not just the symptom

Testing:

@jpmcmu jpmcmu requested a review from rpastrana July 10, 2024 13:29
Copy link

Jira Issue: https://hpccsystems.atlassian.net/browse/HPCC4J-611

Jirabot Action Result:
Workflow Transition: Merge Pending
Additional PR: #722

@jpmcmu
Copy link
Contributor Author

jpmcmu commented Jul 16, 2024

@rpastrana Wrapped up code review changes, please review

Copy link
Member

@rpastrana rpastrana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpmcmu looks good, but left you a few suggestions. Let's discuss.

@@ -129,6 +161,15 @@ public boolean hasOperation()

public void startOperation(String operationName)
{
if (taskSpan != null)
{
taskSpan = Utils.createChildSpan(taskSpan, operationName);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it makes sense to have a single Utils.createSpan(parentSpan, spanName)? If parentSpan != null, it sets it as the parent

}

taskSpan.end();

long totalOperationTime = System.nanoTime();
totalOperationTime -= operationStart;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some of these context values seem worth reporting in the span

@@ -1095,7 +1149,8 @@ private static void performRead(String[] args, TaskContext context)
for (int i = 0; i < datasets.length; i++)
{
String datasetName = datasets[i];
context.startOperation("Read " + datasetName);
context.startOperation("FileUtility.Read_" + datasetName);
context.setTaskSpanAttributes(Attributes.of(AttributeKey.stringKey("server.url"), connString));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way to add individual attributes after this initial set attributes call?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did we expose addTaskSpanAttribute(Attribute)?

context.startOperation("Read Test " + datasetName);
context.startOperation("FileUtility.ReadTest_" + datasetName);

context.setTaskSpanAttributes(Attributes.of(AttributeKey.stringKey("server.url"), connString));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there other attributes we might want to report to help track issues?

@@ -1463,7 +1521,9 @@ private static void performCopy(String[] args, TaskContext context)
String srcFile = copyPairs[i];
String destFile = copyPairs[i+1];

context.startOperation("Copy " + srcFile + " -> " + destFile);
context.startOperation("FileUtility.Copy_ " + srcFile + " -> " + destFile);
context.setTaskSpanAttributes(Attributes.of(AttributeKey.stringKey("server.src.url"), srcURL,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the context.staroperation call add any relevant attributes to the span?


private static FileWriteContext constructReadContext(FieldDef recordDef, CompressionAlgorithm fileCompression, int connectTimeoutMs, int socketOpTimeoutMs)
{
FileWriteContext context = new FileWriteContext();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do all/some of these end up tracked in a span?

}

public static Span createSpan(String name)
{
return Utils.getTelemetryTracer().spanBuilder(name)
Span span = Utils.getTelemetryTracer().spanBuilder(name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we combine the two functions?
Span createSpan(String name, Span parent)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we combining these 2?

@jpmcmu
Copy link
Contributor Author

jpmcmu commented Jul 19, 2024

@rpastrana please review

Copy link
Member

@rpastrana rpastrana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpmcmu reviwed

public AtomicLong bytesRead = new AtomicLong(0);
public AtomicLong bytesWritten = new AtomicLong(0);

public Span taskSpan = null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the operation span?

private Stack<TaskOperation> operations = new Stack<TaskOperation>();
public List<JSONObject> operationResults = new ArrayList<JSONObject>();

public void setTaskSpanAttributes(Attributes attributes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setCurrentOperationSpanAttributes?

}
}

public void addWarn(String warn)
{
synchronized(warnMessages)
if (operations.empty())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a big deal, but since you created the hasOperation function, why not use it here?

{
return;
}
TaskOperation op = operations.peek();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since you created the getOperation() function, why not use it here and in all the TaskContext methods

private static class TaskOperation
{
public String currentOperationDesc = "";
public long operationStart = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor, but the var name is somewhat ambiguous, operationStartTimeNS ? or startTimeNS?

double writeBandwidth = (double) bytesWritten.get() / (1_000_000.0 * timeInSeconds);
results.put("Write Bandwidth", String.format("%.2f MB/s", writeBandwidth));

TaskOperation op = getOperation();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this makes sense, but could L291-294 be condensed into the following?

operationResults.add(endCurrentOperation());

@@ -659,6 +764,10 @@ private static void executeTasks(Runnable[] tasks, int numThreads) throws Except

public void run()
{
// Make the task span is current for the thread, otherwise spans created
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a typo in the comment

@@ -1095,7 +1149,8 @@ private static void performRead(String[] args, TaskContext context)
for (int i = 0; i < datasets.length; i++)
{
String datasetName = datasets[i];
context.startOperation("Read " + datasetName);
context.startOperation("FileUtility.Read_" + datasetName);
context.setTaskSpanAttributes(Attributes.of(AttributeKey.stringKey("server.url"), connString));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did we expose addTaskSpanAttribute(Attribute)?

return true;
}

public TaskOperation getOperation()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and if for nothing other than symmetry, addOperation? setOperation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want all new operations to go through startOperation / endOperation. So I don't want to expose an addOperation etc

}

public static Span createSpan(String name)
{
return Utils.getTelemetryTracer().spanBuilder(name)
Span span = Utils.getTelemetryTracer().spanBuilder(name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we combining these 2?

@jpmcmu
Copy link
Contributor Author

jpmcmu commented Jul 19, 2024

Tracing results from Kibana, second screenshot is the list of events under the read span.
Screenshot 2024-07-19 at 2 22 27 PM
Screenshot 2024-07-19 at 2 22 50 PM

- Updated event attribute reporting
- Fixed trace structure in dfsclient protocol
- Added tracing to FileUtility
- Improved operation tracking in TaskContext
- Added FileReadContext & FileWriteContext to DFSClient APIs

Signed-off-by: James McMullan [email protected]
@jpmcmu
Copy link
Contributor Author

jpmcmu commented Jul 19, 2024

With ESP tracing enabled:
Screenshot 2024-07-19 at 4 17 02 PM

@rpastrana
Copy link
Member

Tracing results from Kibana, second screenshot is the list of events under the read span. Screenshot 2024-07-19 at 2 22 27 PM Screenshot 2024-07-19 at 2 22 50 PM

@jakesmith once we get dafilesrv instrumentation we'll be able to track these transactions end-to-end. We're able to deploy small, local Elastic stack based trace aggregator which would be very helpful in the instrumentation process. We can share that info if interested.

@rpastrana rpastrana merged commit a27f95b into hpcc-systems:candidate-9.6.x Jul 19, 2024
4 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants