Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: tritonfrontend gRPC Streaming Segmentation Fault #7671

Merged
merged 15 commits into from
Oct 7, 2024

Conversation

KrishnanPrash
Copy link
Contributor

What does the PR do?

Currently, the tritonfrontend bindings do not support a tracing because a nullptr is passed as a TraceManager object to the respective frontends. Due to a lack of checks for valid TraceManager object, any operations issued assuming the TraceManager object is valid, leads to a segmentation fault.

This PR wraps tracing operations in a if(trace_manager_obj) to ensure the TraceManager is not nullptr.

Additional Changes:

  1. Added test case test_streaming_inference() in server/qa/L0_python_api/test_kserve.py to catch errors in CI pipeline
  2. Added beta tag to tritonfrontend README.md.
  3. Refactored how testing_utils is imported and used to prevent shadowing based on this suggestion.

Checklist

  • PR title reflects the change and is of format <commit_type>: <Title>
  • Changes are described in the pull request.
  • Related issues are referenced.
  • Populated github labels field
  • Added test plan and verified test passes.
  • Verified that the PR passes existing CI.
  • Verified copyright is correct on all changed files.
  • Added succinct git squash message before merging ref.
  • All template sections are filled out.
  • Optional: Additional screenshots for behavior/output changes with before/after.

Test plan:

Added server/qa/L0_python_api/test_kserve.py::test_streaming_inference()

  • CI Pipeline ID: 18855908

@KrishnanPrash KrishnanPrash added the bug Something isn't working label Sep 29, 2024
@KrishnanPrash KrishnanPrash self-assigned this Sep 29, 2024
if (trace_manager_) {
GrpcServerCarrier carrier(state->context_->ctx_.get());
auto start_options =
trace_manager_->GetTraceStartOptions(carrier, request.model_name());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now. I think this is not the only potential place for a SegFault. Do we want to fix it in other places as well? Or we're targeting streaming case at the moment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be happy to add checks in other places as well. Could you provide an example where these checks would be needed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KrishnanPrash Can you look for TRITON_ENABLE_TRACING blocks and break the logic into separate utility function in grpc_utils.h?
This function will take trace_manager_ as input. If trace_manager_ is nullptr then the logic is skipped, otherwise the same logic is run.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this place (HandleGenerate) seem to be relevant:

server/src/http_server.cc

Lines 3235 to 3237 in dbb064f

TRITONSERVER_InferenceTrace* triton_trace = nullptr;
std::shared_ptr<TraceManager::Trace> trace =
StartTrace(req, model_name, &triton_trace);

I can see that HandleInfer on HTTP is guarded:

server/src/http_server.cc

Lines 3599 to 3602 in dbb064f

if (trace_manager_) {
// If tracing is enabled see if this request should be traced.
trace = StartTrace(req, model_name, &triton_trace);
}

qq to @GuanLuo , I can see that you've added guards for HandleInfer (link above), was there a reason not to guard HandleGenerate ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KrishnanPrash, do you need to support trace update with this PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see that you've added guards for HandleInfer (link above), was there a reason not to guard HandleGenerate ?

Probably just a missed spot..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see in the HandleInfer() method - that the trace_manager_ check is not guarded with #IF TRACING compile time macro - should it be?

I guess secondary question - if tracing is compiled in - should trace_manager_ be null? - could we make that a pre-condition instead of runtime check if tracing is enabled? (just a question)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe trace_manager_ needs to be guarded with a #ifdef TRACING_ENABLE_TRACING compile time macro because StartTrace() function has it's own check for if tracing is enabled.

As for the secondary question, if tracing is compiled in, trace_manager_ should technically never be null after being passed to the services. However, because the bindings do not yet support tracing, a hopefully temporary situation arises of tracing being enabled, but trace_manager_ being null.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah so this is temporary? - once enable tracing is added we can remove the runtime check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could remove the runtime check, but personally I am in favor of keeping these checks because it allows to catch unexpected behavior earlier, rather than later.

After tracing support is provided in tritonfrontend, we could probably modify these checks and return an error to fail earlier with a cleaner error message.

qa/L0_python_api/test_kserve.py Outdated Show resolved Hide resolved
if (trace_manager_) {
GrpcServerCarrier carrier(state->context_->ctx_.get());
auto start_options =
trace_manager_->GetTraceStartOptions(carrier, request.model_name());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KrishnanPrash Can you look for TRITON_ENABLE_TRACING blocks and break the logic into separate utility function in grpc_utils.h?
This function will take trace_manager_ as input. If trace_manager_ is nullptr then the logic is skipped, otherwise the same logic is run.

src/http_server.cc Outdated Show resolved Hide resolved
@oandreeva-nv
Copy link
Contributor

lgtm, I would also let @rmccorm4 take a look

"INPUT0": input_text,
}

response = requests.post(url, json=data, stream=True)
Copy link
Collaborator

@rmccorm4 rmccorm4 Oct 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment for future: I'm not sure that stream=True here is meaningful -- what was your intention or expected behavior by setting it?

(Can revisit later though)

@rmccorm4
Copy link
Collaborator

rmccorm4 commented Oct 5, 2024

LGTM - pending passing pipeline with latest changes (including L0_build_variants, L0_python_api, L0_trace, etc)

std::shared_ptr<TraceManager::Trace> trace =
StartTrace(req, model_name, &triton_trace);
std::shared_ptr<TraceManager::Trace> trace;
if (trace_manager_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not put into StartTrace()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will make these changes as well as a part of this refactoring ticket [DLIS-7380].

@KrishnanPrash KrishnanPrash merged commit b247eb5 into main Oct 7, 2024
3 checks passed
@KrishnanPrash KrishnanPrash deleted the kprashanth-grpc-trace branch October 7, 2024 17:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

Successfully merging this pull request may close these issues.

6 participants