[Performance] Abnormal latencies on certain tasks and a GPU on standby. #17720

Manutea · 2023-09-27T12:02:10Z

Describe the issue

Some inferences (taskId ~250, 450, 800 and 1700) cost more than others.
It seems that during this time, the GPU does nothing and is on standby.
I have the same problem with P100 GPUs or RTX8000s.
I've tried the AlexNet or GoogleNet models.

Perhaps this is related to this discussion ? #14023

I also find these standbys with onnxruntime_perf_test, the command :
./onnxruntime_perf_test -I -S 1 -e cuda -r 2048 -p profile.json -s /data/model/googlenet/dynamic_batch_googlenet_opt.onnx

To reproduce

void onnx_benchmark_GPU(std::string &modelPath, std::string &inputTensorName, std::string &outputTensorName, int deviceId, int batch)
{
  std::vector<float> image(batch * 3 * 224 * 224,150);
  std::vector<int64_t> inputDims = {batch, 3, 224, 224};
  std::vector<int64_t> outputDims = {batch, 1000};
  
  Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "InferenceGPU");
  Ort::SessionOptions sessionOptions;

  //sessionOptions.SetGraphOptimizationLevel(ORT_DISABLE_ALL);
  sessionOptions.EnableProfiling("gpu_profile_file");
  OrtStatus* status = OrtSessionOptionsAppendExecutionProvider_CUDA(sessionOptions, deviceId);
  if (status != nullptr) {
    printf("Provider error\n");
    exit(EXIT_FAILURE);
  }

  Ort::Session session(env, modelPath.c_str(), sessionOptions);
  Ort::MemoryInfo infoCuda("Cuda", OrtAllocatorType::OrtArenaAllocator, deviceId, OrtMemTypeDefault);
  Ort::Allocator cudaAllocator(session, infoCuda);

  int num_iterations = 2048;
  Ort::IoBinding binding(session);
  std::ofstream f("mesureonnx.txt");
  for (int i = 0; i < num_iterations; i++)
  {
    auto input = cudaAllocator.GetAllocation(image.size() * sizeof(float));
    cudaMemcpy(input.get(), image.data(), sizeof(float) * image.size(), cudaMemcpyHostToDevice);
    
    auto startGPU = std::chrono::high_resolution_clock::now();

    // Create an OrtValue tensor backed by data on CUDA memory
    Ort::Value boundX = Ort::Value::CreateTensor(infoCuda, reinterpret_cast<float*>(input.get()), image.size(), inputDims.data(), inputDims.size());
    std::vector<float> outputData(std::accumulate(outputDims.begin(), outputDims.end(), 1, std::multiplies<int>()));
    auto output = cudaAllocator.GetAllocation(outputData.size() * sizeof(float));

    // Create an OrtValue tensor backed by data on CUDA memory
    Ort::Value boundY = Ort::Value::CreateTensor(infoCuda, reinterpret_cast<float*>(output.get()), outputData.size(), outputDims.data(), outputDims.size());
    binding.BindInput(inputTensorName.c_str(), boundX);
    binding.BindOutput(outputTensorName.c_str(), boundY);
    binding.SynchronizeInputs();

    session.Run(Ort::RunOptions(), binding);
    binding.SynchronizeOutputs();
    auto endGPU = std::chrono::high_resolution_clock::now();
    auto durationGPU = std::chrono::duration_cast<std::chrono::nanoseconds>(endGPU - startGPU);
    binding.ClearBoundInputs();
    binding.ClearBoundOutputs();

    f<<i<<" -- GPU inference duration : "<<durationGPU.count()<< "ns Debit : " << (1.0/(durationGPU.count()/1e9))*batch << std::endl;
  }                        
}

Urgency

No response

Platform

Linux

OS Version

CentOS Linux release 7.6.1810 (Core)

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

ONNX Runtime 1.15.0

ONNX Runtime API

C++

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 11.7

Model File

No response

Is this a quantized model?

Unknown

The text was updated successfully, but these errors were encountered:

hariharans29 · 2023-09-27T21:58:06Z

Is your benchmark code just adding a loop to this test -

onnxruntime/onnxruntime/test/shared_lib/test_inference.cc

Line 1759 in 870b0bc

TEST(CApiTest, io_binding_cuda) {

?

The block of code for which duration is measured has an allocation (GetAllocation()) and 2 device synchronizations (binding.SynchronizeInputs(); and binding.SynchronizeOutputs();). Even if the allocation is for the same number of bytes each time and there is no real allocation every time (because of any underlying memory pool in the allocator), I would consider moving that out of the time measurement code block. In any case, the device synchronization(s) that you have there in order to ensure that the copy on the default stream has completed (cudaMemcpy(input.get(), image.data(), sizeof(float) * image.size(), cudaMemcpyHostToDevice);) may be contributing to the variance if the device was doing something else at that time. I think
SynchronizeInputs() is superfluous because cudaMemcpy() is in any case a blocking call and the data would have copied over to the cuda buffer by the end of that call. SychronizeOutputs() isn't really needed either as Run() should do a stream sync before returning.

Do you see such variances when the input is already on the right device and no IOBinding is used (i.e.) you supply OrtValues backed by CUDA memory via regular Run() (no IOBinding) ?

Manutea · 2023-09-28T08:14:54Z

Hello, thank you for replying.

I tried without using IOBiding.

void onnx_benchmark_GPU(std::string &modelPath, int deviceId)
{
  Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "ModelInference");

  Ort::SessionOptions options;
  options.EnableProfiling("gpu_profile_file");
  OrtStatus* status = OrtSessionOptionsAppendExecutionProvider_CUDA(options, deviceId);
  Ort::Session session(env, modelPath.c_str(), options);

  std::array<float, 1*3*224*224> input_data;
  input_data.fill(150.0f);
  std::vector<int64_t> input_dims = {1, 3, 224, 224};

  auto memory_info = Ort::MemoryInfo::CreateCpu(OrtArenaAllocator, OrtMemTypeDefault);
  Ort::Value input_tensor = Ort::Value::CreateTensor<float>(memory_info, input_data.data(), input_data.size(), input_dims.data(), input_dims.size());

  const char* inputNames[] = {"input"};
  const char* outputNames[] = {"output"};
  std::ofstream f("mesureonnx.txt");

  for(int i=0; i<500; ++i)
  {
    auto startGPU = std::chrono::high_resolution_clock::now();
    auto output_tensors = session.Run(Ort::RunOptions{nullptr}, inputNames, &input_tensor, 1, outputNames, 1);
    auto endGPU = std::chrono::high_resolution_clock::now();
    auto durationGPU = std::chrono::duration_cast<std::chrono::nanoseconds>(endGPU - startGPU);
    f<<i<<" -- GPU inference duration : "<<durationGPU.count()<< "ns" << std::endl;
  }
}

I've also noticed here that inferences are sometimes too slow, and in the same task id for each test.

With perffeto

And nvvp

Manutea · 2023-10-03T07:10:36Z

I've also just tried it on the CPU provider.
And, I also have a CPU waiting for something.

void onnx_benchmark_CPU(std::string &filePath, std::string &modelPath, std::string &inputTensorName, std::string &outputTensorName, int batch)                                                                     
{                                                                                                                                                                                                                  
  std::vector<float> image(batch * 3 * 224 * 224, 150);                                                                                                                                                            
  std::vector<int64_t> inputDims = {batch, 3, 224, 224};                                                                                                                                                           
  std::vector<int64_t> outputDims = {batch, 1000};                                                                                                                                                                 
                                                                                                                                                                                                                   
  Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "InferenceCPU");                                                                                                                                                         
  Ort::SessionOptions sessionOptions;                                                                                                                                                     
  sessionOptions.EnableProfiling("cpu_profile_file");                                                                                                                                                              
  sessionOptions.SetIntraOpNumThreads(1);                                                                                                                                                                         
  Ort::Session session(env, modelPath.c_str(), sessionOptions);                                                                                                                                                    
                                                                                                                                                                                                                   
  auto memoryInfo = Ort::MemoryInfo::CreateCpu(OrtArenaAllocator, OrtMemTypeDefault);                                                                                                                              
  Ort::Value inputTensor = Ort::Value::CreateTensor<float>(memoryInfo, image.data(), image.size(), inputDims.data(), inputDims.size());                                                                            
                                                                                                                                                                                                                   
  const char* inputNames[] = {inputTensorName.c_str()};                                                                                                                                                            
  const char* outputNames[] = {outputTensorName.c_str()};                                                                                                                                                          
                                                                                                                                                                                                                   
  std::ofstream file(filePath, std::ios::app);                                                                                                                                                                     
  int nbIterations = 800;                                                                                                                                                                                          
  for (int i = 0; i < nbIterations; i++)                                                                                                                                                                           
  {                                                                                                                                                                                                                
    auto startCPU = std::chrono::high_resolution_clock::now();                                                                                                                                                     
    auto outputTensors = session.Run(Ort::RunOptions{nullptr}, inputNames, &inputTensor, 1, outputNames, 1);                                                                                                       
    auto endCPU = std::chrono::high_resolution_clock::now();                                                                                                                                                       
    auto durationCPU = std::chrono::duration_cast<std::chrono::nanoseconds>(endCPU - startCPU);                                                                                                                    
    std::cout<<"CPU inference duration : "<<durationCPU.count()<< " ns" <<std::endl;                                                                                                                               
    if(i>0)                                                                                                                                                                                                        
      file << batch << " " << durationCPU.count() << "\n";                                                                                                                                                         
  }                                                                                                                                                                                                                
  file.close();
}

github-actions · 2023-11-02T15:01:11Z

This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

github-actions bot added the ep:CUDA issues related to the CUDA execution provider label Sep 27, 2023

github-actions bot added the stale issues that have not been addressed in a while; categorized by a bot label Nov 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Abnormal latencies on certain tasks and a GPU on standby. #17720

[Performance] Abnormal latencies on certain tasks and a GPU on standby. #17720

Manutea commented Sep 27, 2023 •

edited

Loading

hariharans29 commented Sep 27, 2023

Manutea commented Sep 28, 2023 •

edited

Loading

Manutea commented Oct 3, 2023 •

edited

Loading

github-actions bot commented Nov 2, 2023

[Performance] Abnormal latencies on certain tasks and a GPU on standby. #17720

[Performance] Abnormal latencies on certain tasks and a GPU on standby. #17720

Comments

Manutea commented Sep 27, 2023 • edited Loading

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

hariharans29 commented Sep 27, 2023

Manutea commented Sep 28, 2023 • edited Loading

Manutea commented Oct 3, 2023 • edited Loading

github-actions bot commented Nov 2, 2023

Manutea commented Sep 27, 2023 •

edited

Loading

Manutea commented Sep 28, 2023 •

edited

Loading

Manutea commented Oct 3, 2023 •

edited

Loading