[Performance] Some allocations perform even after many Run() invocations on fixed inputs #17758

SolomidHero · 2023-10-02T14:16:57Z

Describe the issue

Prerequisites

I use onnxruntime model with audio processing which performs best without any non-determined time operations. Thus many practitioners say to remove 1)allocations, 2)blocking, 3)sleeping, etc.
In my algorithm I use onnx model every single chunk of input signal with 100Hz frequency.
So i need running such onnx model without additional allocations over time.
I've seen #14960 but it doesn't seem to help.

Problem description

I use onnxruntime with c++ api. Here is how I configured running model:

set options for environment and session:

// environment
Ort::Env env;

// session opts, I did it inside CustomSessionOptions class
SetInterOpNumThreads(1);
SetIntraOpNumThreads(1);
EnableCpuMemArena();
EnableMemPattern();
SetExecutionMode(ExecutionMode::ORT_SEQUENTIAL);
SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_ALL);

Create ort session session_ with env and options above
Created input and output vectors of fixed size and defined corresponding shapes of them. Each vector size is just a product of shape values
Created tensors using Ort::Value::CreateTensor for each data vector and remembered them to Ort::Value vector.
Binded IO to such tensors:

Ort::IoBinding> io_binding(session_);
io_binding.BindInput(input_name, input_value); // for each input
io_binding.BindOutput(output_name, output_value); // for each output

Perform running step:

Ort::RunOptions run_options{nullptr};
session_.Run(
  run_options,
  input_names.data(), inputs.data(), input_names.size(),
  output_names.data(), outputs.data(), output_names.size()
);

Things I point:

running step 7 many times
all shapes are const, vectors sizes are same, ort values are same
underlying onnx model has only deterministic operations for shapes. I.e. conv, lstm, FC, etc.

Memory profiling

Then I started using gperftools with their tcmalloc. I compared then difference with two profile snapshots (at 100seconds and 120seconds of runtime):

pprof --nodefraction=0.0003 --edgefraction=0.0003 --show_bytes --cum --text --base=$name.0009.heap bin/$exec_name $name.0010.heap

And I found that onnxruntime still uses some allocations/deallocations:

   10112  10.5%  96.6%    10112  10.5% onnxruntime::Tensor::InitOrtValue@931860
...
    7176   7.4% 104.0%     7176   7.4% absl::lts_20220623::inlined_vector_internal::Storage::Resize@8b8310
       0   0.0% 104.0%     7176   7.4% onnxruntime::ExecutionFrame::ExecutionFrame
       0   0.0% 104.0%     7176   7.4% onnxruntime::IExecutionFrame::Init
       0   0.0% 104.0%     7176   7.4% onnxruntime::StreamExecutionContext::StreamExecutionContext
       0   0.0% 104.0%     2688   2.8% onnxruntime::Reshape::Compute
       0   0.0% 104.0%     2184   2.3% onnxruntime::Add::Compute
       0   0.0% 104.0%     2184   2.3% onnxruntime::Transpose::Compute
       0   0.0% 104.0%     2184   2.3% onnxruntime::UntypedBroadcastTwo
       0   0.0% 104.0%     1920   2.0% onnxruntime::Slice10::Compute
       0   0.0% 104.0%     1920   2.0% onnxruntime::SliceBase::Compute
    1680   1.7% 105.8%     1680   1.7% OrtValue::Init
...
    -432  -0.4% 105.4%     -432  -0.4% google::protobuf::Arena::CreateMaybeMessage@c685c0
...
    -432  -0.4% 104.9%     -432  -0.4% onnx::OnnxParser::Parse@b8b5c0

Don't pay attention to percentage, it is just not only onnxruntime doing some allocations.

What I expect: all onnxruntime entities do not alloc/dealloc

To reproduce

I think I need to share some minimal model, but I don't have available. Then I need to create one and provide how-to guide. But I believe that problem might be described more and solved by explanation above if anyone knows about such behavior

Urgency

No response

Platform

Mac

OS Version

12.3.1

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.14.1

ONNX Runtime API

C++

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

Model File

No response

Is this a quantized model?

No

The text was updated successfully, but these errors were encountered:

SolomidHero · 2023-10-02T16:23:39Z

I've got the model which has such problem: Silero VAD v3
vad_model.onnx.zip

snnn · 2023-10-02T17:13:53Z

If you do not want to see absl data structures in the callstack, you might build the code with "--cmake_extra_defines onnxruntime_DISABLE_ABSEIL=ON" which could make a difference.

wschin · 2023-10-03T00:21:59Z

Small vectors are likely created for shape manipulation happened on stack. Their allocation/deallocation are usually way cheaper than typical model running time. Did you profile with real model you want to run? If yes, what's the percentage the allocation/deallocation time?

SolomidHero · 2023-10-03T11:03:00Z

@snnn
Hi, thank you for response!
Removing such flags builds onnxruntime with STL containers, right?
Am I correct that inside session.Run() there created execution scheme and some mappings are used via abseil hash_map?
I also found something like this in my report:

...
       0   0.0% 116.7%      432   5.6% onnx::OpSchema::TypeConstraintParam::TypeConstraintParam
       0   0.0% 116.7%      432   5.6% onnx::OpSet_Onnx_ver15::ForEachSchema
       0   0.0% 116.7%      432   5.6% std::__1::__function::__func::operator@28810
     432   5.6% 122.3%      432   5.6% std::__1::vector::vector
       0   0.0% 122.3%     -432  -5.6% onnx::OpSchema::OpSchema
       0   0.0% 122.3%     -432  -5.6% onnx::OpSet_Onnx_ver17::ForEachSchema
       0   0.0% 122.3%     -432  -5.6% std::__1::__function::__func::operator@29020
...
       0   0.0% 122.3%     -864 -11.1% onnx::OpSchema::Finalize
       0   0.0% 122.3%     -864 -11.1% onnx::OpSchema::ParseAndSetTypes
       0   0.0% 122.3%     -864 -11.1% onnx::OpSchemaRegistry::OpSchemaRegisterOnce::OpSchemaRegisterOnce
       0   0.0% 122.3%     -864 -11.1% onnx::RegisterSchema
    -864 -11.1% 111.1%     -864 -11.1% std::__1::__hash_table::__assign_multi@bc3b30
       0   0.0% 111.1%     -864 -11.1% std::__1::__hash_table::__construct_node_hash@15540
       0   0.0% 111.1%     -864 -11.1% std::__1::unordered_map::unordered_map@14a60
    -864 -11.1% 100.0%     -864 -11.1% std::__1::unordered_set::unordered_set@141d0
...

@wschin
Hi, thank you for such suggestion. I have some points:

gperftools gives heap profiler tool (https://gperftools.github.io/gperftools/heapprofile.html). I guess it registers only allocations on heap, not on stack. So my analysis at first message about heap alloc/dealloc. So they might be not shapes entities that I found in mem profiler report as you assumed.
I also tried to use cpu profiling from gperftools and profiling for wall time seems a little hard on macos. Do you have some guide on how to calculate "the percentage the allocation/deallocation time"?
I use this model :

I've got the model which has such problem: Silero VAD v3
vad_model.onnx.zip

github-actions · 2023-11-02T15:01:04Z

This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

snnn closed this as not planned Won't fix, can't repro, duplicate, stale Oct 2, 2023

snnn reopened this Oct 2, 2023

SolomidHero mentioned this issue Oct 5, 2023

Inconsistent inference timing on CPU #10270

Open

github-actions bot added the stale issues that have not been addressed in a while; categorized by a bot label Nov 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Some allocations perform even after many Run() invocations on fixed inputs #17758

[Performance] Some allocations perform even after many Run() invocations on fixed inputs #17758

SolomidHero commented Oct 2, 2023 •

edited

Loading

SolomidHero commented Oct 2, 2023

snnn commented Oct 2, 2023

wschin commented Oct 3, 2023

SolomidHero commented Oct 3, 2023 •

edited

Loading

github-actions bot commented Nov 2, 2023

[Performance] Some allocations perform even after many Run() invocations on fixed inputs #17758

[Performance] Some allocations perform even after many Run() invocations on fixed inputs #17758

Comments

SolomidHero commented Oct 2, 2023 • edited Loading

Describe the issue

Prerequisites

Problem description

Memory profiling

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

SolomidHero commented Oct 2, 2023

snnn commented Oct 2, 2023

wschin commented Oct 3, 2023

SolomidHero commented Oct 3, 2023 • edited Loading

github-actions bot commented Nov 2, 2023

SolomidHero commented Oct 2, 2023 •

edited

Loading

SolomidHero commented Oct 3, 2023 •

edited

Loading