[Performance] TensorRT EP produce different inference results compared to CUDA/CPU #22354

kylekam · 2024-10-08T23:21:26Z

Describe the issue

Inference results with YOLOv8 in C++ differ between TensorRT EP and CUDA EP. I'm unable to share all of the code because most of it is proprietary and it would require a lot of refactoring to make it viewable.

We use YOLOv8 as an object detector and results are satisfactory using CUDA EP, but we would like to speed up inference results using TensorRT.

#21457 may be related but their issues lie with a newer version of TensorRT, not 8.6.1.6

I've noticed some slightly different results from preprocesing steps by doing preprocessing on CPU vs GPU but nothing that should significantly affect the inference results. The output from YOLOv8 using TensorRT produces bounding boxes that are believable, but the confidence scores are terrible.

To reproduce

Code snippet to initialize environment.

  // ORT Environment
  std::string instanceName{ "Image classifier" };
  auto ortEnv = std::make_unique<Ort::Env>(OrtLoggingLevel::ORT_LOGGING_LEVEL_WARNING, instanceName.c_str());
  Ort::SessionOptions sessionOptions;

  // Attempt to add to GPU
  int deviceCount;
  cudaGetDeviceCount(&deviceCount);
  const auto& api = Ort::GetApi();
  OrtTensorRTProviderOptionsV2* tensorrt_options = nullptr;
  try
  {
    Ort::ThrowOnError(api.CreateTensorRTProviderOptions(&tensorrt_options));
    std::vector<const char*> keys
    {
      "device_id",
      "trt_max_workspace_size",
      "trt_engine_cache_enable",
      "trt_engine_cache_path",
      "trt_timing_cache_enable",
      "trt_layer_norm_fp32_fallback",
      "trt_fp16_enable",
    };
    std::vector<const char*> values
    {
      "0",
      "24696061952",
      "1",
      "../models",
      "1",
      "1",
      "0",
    };
    Ort::ThrowOnError(api.UpdateTensorRTProviderOptions(tensorrt_options, keys.data(), values.data(), keys.size()));
    
    // Try setting lower priority for CUDA streams
    int priority_high;
    int priority_low;
    cudaStream_t stream_low;
    cudaDeviceGetStreamPriorityRange(&priority_low, &priority_high);
    cudaStreamCreateWithPriority(&stream_low, cudaStreamNonBlocking, priority_low);

    tensorrt_options->user_compute_stream = stream_low;
    tensorrt_options->has_user_compute_stream = true;
  }
  catch (Ort::Exception& e)
  {
    TrdDiags::Instance().error(__FUNCTION__, "Failing to update ORT Tensor RT provider options: %s ", e.what());
    return nullptr;
  }
  sessionOptions.AppendExecutionProvider_TensorRT_V2(*tensorrt_options);
  sessionOptions.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_EXTENDED);

  // Initialize session
  try
  {
    mSession = std::make_unique<Ort::Session>(*ortEnv, _modelFilePath.c_str(), sessionOptions);
  }
  catch (std::bad_alloc& e)
  {
    TrdDiags::Instance().error(__FUNCTION__, "Failed to initialize ORT session: %s ", e.what());
    return nullptr;
  }

  try
  {
    mIOBinding = std::make_unique<Ort::IoBinding>(mSession);
  }
  catch (std::bad_alloc& e)
  {
    TrdDiags::Instance().error(__FUNCTION__, "Failed to initialize IOBinding session: %s ", e.what());
    return nullptr;
  }

Snippet to run.

mIOBinding->BindInput(mInputNameString.c_str(), *inputTensors);
mIOBinding->BindOutput(mOutputNameString.c_str(), *outputTensors);
ortSession->Run(Ort::RunOptions{ nullptr }, *(mIOBinding));

Urgency

No response

Platform

Windows

OS Version

Windows 11

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.15.1

ONNX Runtime API

C++

Architecture

X64

Execution Provider

TensorRT

Execution Provider Library Version

CUDA 11.2, CUDNN 8.9.2, TensorRT 8.6.1.6

Model File

No response

Is this a quantized model?

No

The text was updated successfully, but these errors were encountered:

tianleiwu · 2024-10-09T20:52:14Z

See #21345 (comment) for TRT fp16 accuracy.

kylekam · 2024-10-10T07:26:30Z

This issue is with FP32, but I appreciate your response. I'll see if there's anything I can use in that thread.

tianleiwu · 2024-10-10T17:29:48Z

You can try the TensorRT tool polygraphy to investigate the precision issue of TensorRT.

See some examples: https://github.com/NVIDIA/TensorRT/tree/release/10.4/tools/Polygraphy/examples/cli/run/01_comparing_frameworks

chilo-ms · 2024-10-15T17:27:38Z

@kylekam
Did you try trtexec to see whether the output's confidence scores are terrible as well?
If yes, then the issue is coming from TRT. Otherwise, the issue might be from "old" TRT EP.

Since you can't share the model, it's hard for us to investigate, but we can still try several things for TRT EP:

set NVIDIA_TF32_OVERRIDE=0
disable graph optimization, sessionOptions.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_DISABLE_ALL);
In TRT 8.6, even though the model is FP32, TRT still builds lower precision for some layers if the perf is better. In TRT 10, the new flag is added to build strongly-typed network meaning TRT will stick to ONNX precision.

chilo-ms · 2024-10-15T17:32:25Z

Just wondering could you use the TRT 10 instead of the old TRT 8.6?
Even though we later root cause the issue, it's hard for us to fix the issue in the combination of ORT 1.15.1 + TRT 8.6

kylekam · 2024-10-15T22:44:52Z

After some investigation it turns out the issue came from a preprocessing step, so the issue wasn't with TensorRT itself.

kylekam added the performance issues related to performance regressions label Oct 8, 2024

github-actions bot added ep:CUDA issues related to the CUDA execution provider ep:TensorRT issues related to TensorRT execution provider labels Oct 8, 2024

kylekam closed this as completed Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] TensorRT EP produce different inference results compared to CUDA/CPU #22354

[Performance] TensorRT EP produce different inference results compared to CUDA/CPU #22354

kylekam commented Oct 8, 2024 •

edited

Loading

tianleiwu commented Oct 9, 2024

kylekam commented Oct 10, 2024

tianleiwu commented Oct 10, 2024

chilo-ms commented Oct 15, 2024

chilo-ms commented Oct 15, 2024

kylekam commented Oct 15, 2024

[Performance] TensorRT EP produce different inference results compared to CUDA/CPU #22354

[Performance] TensorRT EP produce different inference results compared to CUDA/CPU #22354

Comments

kylekam commented Oct 8, 2024 • edited Loading

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

tianleiwu commented Oct 9, 2024

kylekam commented Oct 10, 2024

tianleiwu commented Oct 10, 2024

chilo-ms commented Oct 15, 2024

chilo-ms commented Oct 15, 2024

kylekam commented Oct 15, 2024

kylekam commented Oct 8, 2024 •

edited

Loading