Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] TensorRT EP produce different inference results compared to CUDA/CPU #22354

Closed
kylekam opened this issue Oct 8, 2024 · 6 comments
Labels
ep:CUDA issues related to the CUDA execution provider ep:TensorRT issues related to TensorRT execution provider performance issues related to performance regressions

Comments

@kylekam
Copy link

kylekam commented Oct 8, 2024

Describe the issue

Inference results with YOLOv8 in C++ differ between TensorRT EP and CUDA EP. I'm unable to share all of the code because most of it is proprietary and it would require a lot of refactoring to make it viewable.

We use YOLOv8 as an object detector and results are satisfactory using CUDA EP, but we would like to speed up inference results using TensorRT.

#21457 may be related but their issues lie with a newer version of TensorRT, not 8.6.1.6

I've noticed some slightly different results from preprocesing steps by doing preprocessing on CPU vs GPU but nothing that should significantly affect the inference results. The output from YOLOv8 using TensorRT produces bounding boxes that are believable, but the confidence scores are terrible.

To reproduce

Code snippet to initialize environment.

  // ORT Environment
  std::string instanceName{ "Image classifier" };
  auto ortEnv = std::make_unique<Ort::Env>(OrtLoggingLevel::ORT_LOGGING_LEVEL_WARNING, instanceName.c_str());
  Ort::SessionOptions sessionOptions;

  // Attempt to add to GPU
  int deviceCount;
  cudaGetDeviceCount(&deviceCount);
  const auto& api = Ort::GetApi();
  OrtTensorRTProviderOptionsV2* tensorrt_options = nullptr;
  try
  {
    Ort::ThrowOnError(api.CreateTensorRTProviderOptions(&tensorrt_options));
    std::vector<const char*> keys
    {
      "device_id",
      "trt_max_workspace_size",
      "trt_engine_cache_enable",
      "trt_engine_cache_path",
      "trt_timing_cache_enable",
      "trt_layer_norm_fp32_fallback",
      "trt_fp16_enable",
    };
    std::vector<const char*> values
    {
      "0",
      "24696061952",
      "1",
      "../models",
      "1",
      "1",
      "0",
    };
    Ort::ThrowOnError(api.UpdateTensorRTProviderOptions(tensorrt_options, keys.data(), values.data(), keys.size()));
    
    // Try setting lower priority for CUDA streams
    int priority_high;
    int priority_low;
    cudaStream_t stream_low;
    cudaDeviceGetStreamPriorityRange(&priority_low, &priority_high);
    cudaStreamCreateWithPriority(&stream_low, cudaStreamNonBlocking, priority_low);

    tensorrt_options->user_compute_stream = stream_low;
    tensorrt_options->has_user_compute_stream = true;
  }
  catch (Ort::Exception& e)
  {
    TrdDiags::Instance().error(__FUNCTION__, "Failing to update ORT Tensor RT provider options: %s ", e.what());
    return nullptr;
  }
  sessionOptions.AppendExecutionProvider_TensorRT_V2(*tensorrt_options);
  sessionOptions.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_EXTENDED);

  // Initialize session
  try
  {
    mSession = std::make_unique<Ort::Session>(*ortEnv, _modelFilePath.c_str(), sessionOptions);
  }
  catch (std::bad_alloc& e)
  {
    TrdDiags::Instance().error(__FUNCTION__, "Failed to initialize ORT session: %s ", e.what());
    return nullptr;
  }

  try
  {
    mIOBinding = std::make_unique<Ort::IoBinding>(mSession);
  }
  catch (std::bad_alloc& e)
  {
    TrdDiags::Instance().error(__FUNCTION__, "Failed to initialize IOBinding session: %s ", e.what());
    return nullptr;
  }

Snippet to run.

mIOBinding->BindInput(mInputNameString.c_str(), *inputTensors);
mIOBinding->BindOutput(mOutputNameString.c_str(), *outputTensors);
ortSession->Run(Ort::RunOptions{ nullptr }, *(mIOBinding));

Urgency

No response

Platform

Windows

OS Version

Windows 11

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.15.1

ONNX Runtime API

C++

Architecture

X64

Execution Provider

TensorRT

Execution Provider Library Version

CUDA 11.2, CUDNN 8.9.2, TensorRT 8.6.1.6

Model File

No response

Is this a quantized model?

No

@kylekam kylekam added the performance issues related to performance regressions label Oct 8, 2024
@github-actions github-actions bot added ep:CUDA issues related to the CUDA execution provider ep:TensorRT issues related to TensorRT execution provider labels Oct 8, 2024
@tianleiwu
Copy link
Contributor

See #21345 (comment) for TRT fp16 accuracy.

@kylekam
Copy link
Author

kylekam commented Oct 10, 2024

This issue is with FP32, but I appreciate your response. I'll see if there's anything I can use in that thread.

@tianleiwu
Copy link
Contributor

You can try the TensorRT tool polygraphy to investigate the precision issue of TensorRT.

See some examples: https://github.com/NVIDIA/TensorRT/tree/release/10.4/tools/Polygraphy/examples/cli/run/01_comparing_frameworks

@chilo-ms
Copy link
Contributor

@kylekam
Did you try trtexec to see whether the output's confidence scores are terrible as well?
If yes, then the issue is coming from TRT. Otherwise, the issue might be from "old" TRT EP.

Since you can't share the model, it's hard for us to investigate, but we can still try several things for TRT EP:

  • set NVIDIA_TF32_OVERRIDE=0
  • disable graph optimization, sessionOptions.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_DISABLE_ALL);
  • In TRT 8.6, even though the model is FP32, TRT still builds lower precision for some layers if the perf is better. In TRT 10, the new flag is added to build strongly-typed network meaning TRT will stick to ONNX precision.

@chilo-ms
Copy link
Contributor

Just wondering could you use the TRT 10 instead of the old TRT 8.6?
Even though we later root cause the issue, it's hard for us to fix the issue in the combination of ORT 1.15.1 + TRT 8.6

@kylekam
Copy link
Author

kylekam commented Oct 15, 2024

After some investigation it turns out the issue came from a preprocessing step, so the issue wasn't with TensorRT itself.

@kylekam kylekam closed this as completed Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:CUDA issues related to the CUDA execution provider ep:TensorRT issues related to TensorRT execution provider performance issues related to performance regressions
Projects
None yet
Development

No branches or pull requests

3 participants