-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tensorRT runtime provider is not thread-safe #19275
Comments
would it be possible for you to provide us a repro test case so we can look into it deeper from our end? |
I can't provide a test case immediately. Our C++ code is wrapped by Go and I can't provide source to everything. However I could provide an onnx model with a fitting image if that helps? |
Do you know whether the whole model can be run by TRT EP? or there are multiple partitions and some of them are assigned to CUDA EP or CPU? |
yes, please provide the onnx model. |
Yes, the model can be run by TRT. We first thought, that this is a TensorRT problem and tested it against the TensorRT backend without onnxruntime code. I uploaded the onnx model and a simple sample image for forwarding. This is a segmentation task and the output must not be 0. I checked the output matrix with a countNonZero function and failed the test, if the image contains only zero values. Switch between the CPU and TensorRT runtime provider and you will see the different behavior. Important: this will only occur with concurrent forwarding. Model forwarding info
|
Just tested this against the onnxruntime master code. This issue is still present. Please let me know, if you need help to reproduce this issue. |
If you can provide a simple c++ program for us to repro that would be great. I'm using the BTW, could you help try with the TRT EP provider options "trt_cuda_graph_enable" to enable cuda graph to see whether the concurrent issue still exists? |
to repro, can we feed all zeros of shape [1,3, 80,128] into input. and if executed concurrently, the output matrix will be zero (when it should be nonzero). is that the correct understanding? (then we don't need the .png image you provided?) |
The input image (input.png) is required. Load and resize it to 128x80, apply normalization with mean & std. If all input values are zero, the output will also be zero. Here is a part of our forward code. void ONNXRuntime::forward(const vector<ModelMat>& inputs, vector<ModelMat>& outputs) {
const int inputSize = inputs.size();
const int outputSize = outputs.size();
// Check arguments.
if (inputSize != inputNames_.size()) {
throw std::runtime_error(
"ONNXRuntime::forward: expected " + to_string(inputNames_.size()) + " inputs, got " + to_string(inputSize)
);
} else if (outputSize != outputNames_.size()) {
throw std::runtime_error(
"ONNXRuntime::forward: expected " + to_string(outputNames_.size()) + " outputs, got " + to_string(outputSize)
);
}
// Create the tensors.
const vector<Ort::Value> inputTensors = createTensors_(inputs, inputTypes_);
vector<Ort::Value> outputTensors = createTensors_(outputs, outputTypes_);
// TODO: remove once tensorrt issue is fixed: https://github.com/microsoft/onnxruntime/issues/19275
if (runMxRequired_) {
unique_lock<std::mutex> lock(runMx_);
session_->Run(runOpts_, inputNamesCstr_.data(), inputTensors.data(), inputSize, outputNamesCstr_.data(), outputTensors.data(), outputSize);
return;
}
// Forward pass.
// Mutex is not needed: https://github.com/microsoft/onnxruntime/issues/114
session_->Run(runOpts_, inputNamesCstr_.data(), inputTensors.data(), inputSize, outputNamesCstr_.data(), outputTensors.data(), outputSize);
}
Here is our crash testing code in go:
func init() {
App.AddCommand(&grumble.Command{
Name: "crash",
Help: "crash",
Run: runCrash,
Args: func(a *grumble.Args) {
a.String("src", "source image")
a.String("npack", "model")
a.String("cache", "cache")
},
})
}
func runCrash(ctx *grumble.Context) (err error) {
cl := ctx.App.CloserOneWay()
defer func() {
cErr := cl.Close()
if cErr != nil && err == nil {
err = cErr
}
}()
go cos.ListenForInterrupts(cl)
// Our model options.
lo := nlib.DefaultSemanticSegmentationOptions()
lo.Backend = nlib.ModelBackend_ONNXRuntime_TensorRT
lo.GPUID = 0
lo.ConfThresh = 0.5
lo.Workers = runtime.NumCPU()
lo.MatPoolSize = 3 * runtime.NumCPU()
lo.Interpolation = nlib.InterArea
lo.ResizeMaskToInput = false
lo.TRTEnableFP16 = true
lo.TRTCacheDir = ctx.Args.String("cache")
// Load the locate model.
lp, err := npack.OpenFile(cl.CloserTwoWay(), ctx.Args.String("npack"))
if err != nil {
return
}
model, err := nlib.NewSemanticSegmentationModel(cl.CloserTwoWay(), lp, lo)
if err != nil {
return
}
src := nlib.NewMat()
defer src.Free()
err = src.ReadFromFile(ctx.Args.String("src"))
if err != nil {
return
}
src.SetReadOnly(true)
start := time.Now()
var wg sync.WaitGroup
for i := 0; i < runtime.NumCPU(); i++ {
wg.Add(1)
go func() {
defer wg.Done()
mask := nlib.NewMat()
defer mask.Free()
for i := 0; i < 20; i++ {
err := model.Segment(src, mask)
if err != nil {
panic(err)
}
c, err := mask.CountNonZero()
if err != nil {
panic(err)
} else if c == 0 {
panic(fmt.Errorf("zero"))
}
}
}()
}
wg.Wait()
fmt.Println("took:", time.Since(start))
return
} |
it'd help save us time if you can just provide us the input as either onnx TensorProto or numpy array. thanks! |
Sorry I missed @chilo-ms reply. Will be back with more info soon. |
we think we've identified the root cause. it's a bit complicated. we had previously fixed it but had to back out that fix due to another issue. unfortunately, we weren't able to resume that work until now. we will now work on implementing a full solution and provide you an ETA. |
@jywu-msft That commit fixes the issue. Thanks for working on a final fix :) @chilo-ms Thank you for the info. We always did a synchronous warmup straight after loading the model, It would be great, if the thread-safety of the session Run would be documented in the docs. Last time I checked, I only found some issues talking about it. I started to prepare a small C++ file to reproduce this issue. Do you still need it or is there a way to add this to the onnxruntime tests? |
Given that InferenceSession::Run() is guaranteed to be thread-safe meaning multiple threads can call this function concurrently, TRT EP needs to carefully take care of concurrency here, if not, following concurrent issue might happen: - It's suggested that to perform inference concurrently in multiple streams, use one trt execution context per stream. In the design of TRT EP (Not apply per-thread context implementation) and if multiple threads are calling InferenceSession::Run() concurrently, the trt execution context instance is shared by all the threads and each thread aquires different stream from ORT. So TRT EP will end up having one trt execution context using multiple streams which is not suggested. But, since the whole compute_func() is protected by the lock and if cudaStreamSynchronize() is enforced here, one trt execution context per stream is guaranteed. Therefore, TRT EP needs to call cudaStreamSynchronize() at compute_func() which means to wait until stream has completed all operations to prevent the concurrent github isse: #19275
Given that InferenceSession::Run() is guaranteed to be thread-safe meaning multiple threads can call this function concurrently, TRT EP needs to carefully take care of concurrency here, if not, following concurrent issue might happen: - It's suggested that to perform inference concurrently in multiple streams, use one trt execution context per stream. In the design of TRT EP (Not apply per-thread context implementation) and if multiple threads are calling InferenceSession::Run() concurrently, the trt execution context instance is shared by all the threads and each thread aquires different stream from ORT. So TRT EP will end up having one trt execution context using multiple streams which is not suggested. But, since the whole compute_func() is protected by the lock and if cudaStreamSynchronize() is enforced here, one trt execution context per stream is guaranteed. Therefore, TRT EP needs to call cudaStreamSynchronize() at compute_func() which means to wait until stream has completed all operations to prevent the concurrent github isse: #19275
@chilo-ms that was fast. Just checked out the latest main branch and the issue is fixed now. Thank you very much! |
Hi Everyone, |
Should have been the 1.17 release. Check your tensorRT version. The newest version from NVIDIA has some problems. |
Hi @r0l1, Thanks for your suggestion. Since I am using tensorrt with onnxruntime-gpu 1.17.1 (by doc tensorrt 8.6 is compatible)
Onnxruntime has dependency on these libs of tensorrt was my understanding, Since 1.17 had this patch 1.17.1 should be holding good, |
Describe the issue
Doing concurrent session->Run calls without a mutex lock causes corrupt output matrices. The CPU and CUDA runtime providers work as excepted. Using the tensorRT provider causes this issue.
To reproduce
Forward matrices concurrently to one session without a mutex lock. Use the tensorRT runtime provider. The output matrices are corrupt and invalid.
Urgency
This issue is a release blocker and we must downgrade to ONNXRuntime 1.12.1 with tensorRT 8.4.3.
Platform
Linux
OS Version
linux 6.7.0 kernel
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
1.16.3
ONNX Runtime API
C++
Architecture
X64
Execution Provider
TensorRT
Execution Provider Library Version
CUDA 11.8.0 TensorRT 8.6.1.6
The text was updated successfully, but these errors were encountered: