-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance] Abnormal latencies on certain tasks and a GPU on standby. #17720
Comments
Is your benchmark code just adding a loop to this test -
The block of code for which duration is measured has an allocation ( Do you see such variances when the input is already on the right device and no IOBinding is used (i.e.) you supply OrtValues backed by CUDA memory via regular |
Hello, thank you for replying. I tried without using IOBiding. void onnx_benchmark_GPU(std::string &modelPath, int deviceId)
{
Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "ModelInference");
Ort::SessionOptions options;
options.EnableProfiling("gpu_profile_file");
OrtStatus* status = OrtSessionOptionsAppendExecutionProvider_CUDA(options, deviceId);
Ort::Session session(env, modelPath.c_str(), options);
std::array<float, 1*3*224*224> input_data;
input_data.fill(150.0f);
std::vector<int64_t> input_dims = {1, 3, 224, 224};
auto memory_info = Ort::MemoryInfo::CreateCpu(OrtArenaAllocator, OrtMemTypeDefault);
Ort::Value input_tensor = Ort::Value::CreateTensor<float>(memory_info, input_data.data(), input_data.size(), input_dims.data(), input_dims.size());
const char* inputNames[] = {"input"};
const char* outputNames[] = {"output"};
std::ofstream f("mesureonnx.txt");
for(int i=0; i<500; ++i)
{
auto startGPU = std::chrono::high_resolution_clock::now();
auto output_tensors = session.Run(Ort::RunOptions{nullptr}, inputNames, &input_tensor, 1, outputNames, 1);
auto endGPU = std::chrono::high_resolution_clock::now();
auto durationGPU = std::chrono::duration_cast<std::chrono::nanoseconds>(endGPU - startGPU);
f<<i<<" -- GPU inference duration : "<<durationGPU.count()<< "ns" << std::endl;
}
} I've also noticed here that inferences are sometimes too slow, and in the same task id for each test. |
I've also just tried it on the CPU provider. void onnx_benchmark_CPU(std::string &filePath, std::string &modelPath, std::string &inputTensorName, std::string &outputTensorName, int batch)
{
std::vector<float> image(batch * 3 * 224 * 224, 150);
std::vector<int64_t> inputDims = {batch, 3, 224, 224};
std::vector<int64_t> outputDims = {batch, 1000};
Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "InferenceCPU");
Ort::SessionOptions sessionOptions;
sessionOptions.EnableProfiling("cpu_profile_file");
sessionOptions.SetIntraOpNumThreads(1);
Ort::Session session(env, modelPath.c_str(), sessionOptions);
auto memoryInfo = Ort::MemoryInfo::CreateCpu(OrtArenaAllocator, OrtMemTypeDefault);
Ort::Value inputTensor = Ort::Value::CreateTensor<float>(memoryInfo, image.data(), image.size(), inputDims.data(), inputDims.size());
const char* inputNames[] = {inputTensorName.c_str()};
const char* outputNames[] = {outputTensorName.c_str()};
std::ofstream file(filePath, std::ios::app);
int nbIterations = 800;
for (int i = 0; i < nbIterations; i++)
{
auto startCPU = std::chrono::high_resolution_clock::now();
auto outputTensors = session.Run(Ort::RunOptions{nullptr}, inputNames, &inputTensor, 1, outputNames, 1);
auto endCPU = std::chrono::high_resolution_clock::now();
auto durationCPU = std::chrono::duration_cast<std::chrono::nanoseconds>(endCPU - startCPU);
std::cout<<"CPU inference duration : "<<durationCPU.count()<< " ns" <<std::endl;
if(i>0)
file << batch << " " << durationCPU.count() << "\n";
}
file.close();
} |
This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details. |
Describe the issue
Some inferences (taskId ~250, 450, 800 and 1700) cost more than others.
It seems that during this time, the GPU does nothing and is on standby.
I have the same problem with P100 GPUs or RTX8000s.
I've tried the AlexNet or GoogleNet models.
Perhaps this is related to this discussion ? #14023
I also find these standbys with onnxruntime_perf_test, the command :
./onnxruntime_perf_test -I -S 1 -e cuda -r 2048 -p profile.json -s /data/model/googlenet/dynamic_batch_googlenet_opt.onnx
To reproduce
Urgency
No response
Platform
Linux
OS Version
CentOS Linux release 7.6.1810 (Core)
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
ONNX Runtime 1.15.0
ONNX Runtime API
C++
Architecture
X64
Execution Provider
CUDA
Execution Provider Library Version
CUDA 11.7
Model File
No response
Is this a quantized model?
Unknown
The text was updated successfully, but these errors were encountered: