Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] Some allocations perform even after many Run() invocations on fixed inputs #17758

Open
SolomidHero opened this issue Oct 2, 2023 · 5 comments
Labels
stale issues that have not been addressed in a while; categorized by a bot

Comments

@SolomidHero
Copy link

SolomidHero commented Oct 2, 2023

Describe the issue

Prerequisites

I use onnxruntime model with audio processing which performs best without any non-determined time operations. Thus many practitioners say to remove 1)allocations, 2)blocking, 3)sleeping, etc.
In my algorithm I use onnx model every single chunk of input signal with 100Hz frequency.
So i need running such onnx model without additional allocations over time.
I've seen #14960 but it doesn't seem to help.

Problem description

I use onnxruntime with c++ api. Here is how I configured running model:

  1. set options for environment and session:
// environment
Ort::Env env;

// session opts, I did it inside CustomSessionOptions class
SetInterOpNumThreads(1);
SetIntraOpNumThreads(1);
EnableCpuMemArena();
EnableMemPattern();
SetExecutionMode(ExecutionMode::ORT_SEQUENTIAL);
SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_ALL);
  1. Create ort session session_ with env and options above
  2. Created input and output vectors of fixed size and defined corresponding shapes of them. Each vector size is just a product of shape values
  3. Created tensors using Ort::Value::CreateTensor for each data vector and remembered them to Ort::Value vector.
  4. Binded IO to such tensors:
Ort::IoBinding> io_binding(session_);
io_binding.BindInput(input_name, input_value); // for each input
io_binding.BindOutput(output_name, output_value); // for each output
  1. Perform running step:
Ort::RunOptions run_options{nullptr};
session_.Run(
  run_options,
  input_names.data(), inputs.data(), input_names.size(),
  output_names.data(), outputs.data(), output_names.size()
);

Things I point:

  • running step 7 many times
  • all shapes are const, vectors sizes are same, ort values are same
  • underlying onnx model has only deterministic operations for shapes. I.e. conv, lstm, FC, etc.

Memory profiling

Then I started using gperftools with their tcmalloc. I compared then difference with two profile snapshots (at 100seconds and 120seconds of runtime):

pprof --nodefraction=0.0003 --edgefraction=0.0003 --show_bytes --cum --text --base=$name.0009.heap bin/$exec_name $name.0010.heap

And I found that onnxruntime still uses some allocations/deallocations:

   10112  10.5%  96.6%    10112  10.5% onnxruntime::Tensor::InitOrtValue@931860
...
    7176   7.4% 104.0%     7176   7.4% absl::lts_20220623::inlined_vector_internal::Storage::Resize@8b8310
       0   0.0% 104.0%     7176   7.4% onnxruntime::ExecutionFrame::ExecutionFrame
       0   0.0% 104.0%     7176   7.4% onnxruntime::IExecutionFrame::Init
       0   0.0% 104.0%     7176   7.4% onnxruntime::StreamExecutionContext::StreamExecutionContext
       0   0.0% 104.0%     2688   2.8% onnxruntime::Reshape::Compute
       0   0.0% 104.0%     2184   2.3% onnxruntime::Add::Compute
       0   0.0% 104.0%     2184   2.3% onnxruntime::Transpose::Compute
       0   0.0% 104.0%     2184   2.3% onnxruntime::UntypedBroadcastTwo
       0   0.0% 104.0%     1920   2.0% onnxruntime::Slice10::Compute
       0   0.0% 104.0%     1920   2.0% onnxruntime::SliceBase::Compute
    1680   1.7% 105.8%     1680   1.7% OrtValue::Init
...
    -432  -0.4% 105.4%     -432  -0.4% google::protobuf::Arena::CreateMaybeMessage@c685c0
...
    -432  -0.4% 104.9%     -432  -0.4% onnx::OnnxParser::Parse@b8b5c0

Don't pay attention to percentage, it is just not only onnxruntime doing some allocations.

What I expect: all onnxruntime entities do not alloc/dealloc

To reproduce

I think I need to share some minimal model, but I don't have available. Then I need to create one and provide how-to guide. But I believe that problem might be described more and solved by explanation above if anyone knows about such behavior

Urgency

No response

Platform

Mac

OS Version

12.3.1

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.14.1

ONNX Runtime API

C++

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

Model File

No response

Is this a quantized model?

No

@SolomidHero
Copy link
Author

I've got the model which has such problem: Silero VAD v3
vad_model.onnx.zip

@snnn snnn closed this as not planned Won't fix, can't repro, duplicate, stale Oct 2, 2023
@snnn snnn reopened this Oct 2, 2023
@snnn
Copy link
Member

snnn commented Oct 2, 2023

If you do not want to see absl data structures in the callstack, you might build the code with "--cmake_extra_defines onnxruntime_DISABLE_ABSEIL=ON" which could make a difference.

@wschin
Copy link
Contributor

wschin commented Oct 3, 2023

Small vectors are likely created for shape manipulation happened on stack. Their allocation/deallocation are usually way cheaper than typical model running time. Did you profile with real model you want to run? If yes, what's the percentage the allocation/deallocation time?

@SolomidHero
Copy link
Author

SolomidHero commented Oct 3, 2023

@snnn
Hi, thank you for response!
Removing such flags builds onnxruntime with STL containers, right?
Am I correct that inside session.Run() there created execution scheme and some mappings are used via abseil hash_map?
I also found something like this in my report:

...
       0   0.0% 116.7%      432   5.6% onnx::OpSchema::TypeConstraintParam::TypeConstraintParam
       0   0.0% 116.7%      432   5.6% onnx::OpSet_Onnx_ver15::ForEachSchema
       0   0.0% 116.7%      432   5.6% std::__1::__function::__func::operator@28810
     432   5.6% 122.3%      432   5.6% std::__1::vector::vector
       0   0.0% 122.3%     -432  -5.6% onnx::OpSchema::OpSchema
       0   0.0% 122.3%     -432  -5.6% onnx::OpSet_Onnx_ver17::ForEachSchema
       0   0.0% 122.3%     -432  -5.6% std::__1::__function::__func::operator@29020
...
       0   0.0% 122.3%     -864 -11.1% onnx::OpSchema::Finalize
       0   0.0% 122.3%     -864 -11.1% onnx::OpSchema::ParseAndSetTypes
       0   0.0% 122.3%     -864 -11.1% onnx::OpSchemaRegistry::OpSchemaRegisterOnce::OpSchemaRegisterOnce
       0   0.0% 122.3%     -864 -11.1% onnx::RegisterSchema
    -864 -11.1% 111.1%     -864 -11.1% std::__1::__hash_table::__assign_multi@bc3b30
       0   0.0% 111.1%     -864 -11.1% std::__1::__hash_table::__construct_node_hash@15540
       0   0.0% 111.1%     -864 -11.1% std::__1::unordered_map::unordered_map@14a60
    -864 -11.1% 100.0%     -864 -11.1% std::__1::unordered_set::unordered_set@141d0
...

@wschin
Hi, thank you for such suggestion. I have some points:

  • gperftools gives heap profiler tool (https://gperftools.github.io/gperftools/heapprofile.html). I guess it registers only allocations on heap, not on stack. So my analysis at first message about heap alloc/dealloc. So they might be not shapes entities that I found in mem profiler report as you assumed.
  • I also tried to use cpu profiling from gperftools and profiling for wall time seems a little hard on macos. Do you have some guide on how to calculate "the percentage the allocation/deallocation time"?
    I use this model :

I've got the model which has such problem: Silero VAD v3
vad_model.onnx.zip

Copy link
Contributor

github-actions bot commented Nov 2, 2023

This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

@github-actions github-actions bot added the stale issues that have not been addressed in a while; categorized by a bot label Nov 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale issues that have not been addressed in a while; categorized by a bot
Projects
None yet
Development

No branches or pull requests

3 participants