Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove useless NodeProto serializations #18791

Merged
merged 5 commits into from
Jan 4, 2024

Conversation

neNasko1
Copy link
Contributor

Description

This pull request aims to enhance the efficiency of the inference session creation by eliminating unnecessary Node::ToProto invocations. The current codebase presents opportunities for optimization, particularly in the removal of superfluous Node::ToProto calls, along with their subsequent ~NodeProto invocations.

Motivation and Context

The optimization focus of this pull request is on addressing low-hanging fruit in the inference session creation process. By strategically removing undesired Node::ToProto calls, we aim to streamline the codebase and enhance the overall performance. The flame graphs illustrate the notable improvements achieved by reducing the percentage of Node::ToProto calls, thereby optimizing the execution flow.

Code Snippet

TEST(InferenceSessionTests, Bench) {
  // Initialize logging manager
  auto logging_manager = std::make_unique<logging::LoggingManager>(
      std::unique_ptr<ISink>(new CLogSink()), logging::Severity::kVERBOSE, false,
      LoggingManager::InstanceType::Temporal);

  // Create environment
  std::unique_ptr<Environment> env;
  auto st = Environment::Create(std::move(logging_manager), env);
  ASSERT_TRUE(st.IsOK());

  // Configure session options
  SessionOptions so;
  so.execution_mode = ExecutionMode::ORT_SEQUENTIAL;
  so.graph_optimization_level = TransformerLevel::Level2;
  so.intra_op_param.thread_pool_size = 1;

  // Initialize and load the InferenceSession
  InferenceSessionTestGlobalThreadPools session1{so, *env};
  ASSERT_STATUS_OK(session1.Load("big.onnx"));
  ASSERT_STATUS_OK(session1.Initialize());
}

big.onnx model creation

import onnx
import numpy as np
from spox import argument, build, Tensor, Var
from spox.opset.ai.onnx import v17 as op
from spox.opset.ai.onnx.ml.v3 import label_encoder

a = argument(Tensor(np.int64, ('N',)))
c = a

for x in range(1000):
    c = op.mul(c, op.const(np.ones(10000, dtype=np.int64)))

for x in range(3000):
    all_strings = list("random_string" + str(i) for i in range(100))
    all_ints = list(range(len(all_strings)))
    c = label_encoder(
        c,
        keys_int64s=all_ints,
        values_strings=all_strings
    )
    c = label_encoder(c, keys_strings=all_strings, values_int64s=all_ints)

model: onnx.ModelProto = build(inputs={'a': a}, outputs={'c': c})
onnx.save(model, "big.onnx")

Testing in Release with perf yields:
Before: 3.3% spent in Node::ToProto
After: 1.6% spent in Node::ToProto

@cbourjau
Copy link
Contributor

Thanks for looking into this! Can you or somebody else tell me if there is any functional difference between calling check_node here and calling check_graph earlier? The latter would have the advantage that every node is still in its NodeProto format.

@neNasko1
Copy link
Contributor Author

The improvements seem more drastic when working on Node-s with large attributes:

import onnx
import numpy as np
from spox import argument, build, Tensor, Var
from spox.opset.ai.onnx import v17 as op
from spox.opset.ai.onnx.ml.v3 import label_encoder

a = argument(Tensor(np.int64, ('N',)))
c = a

for x in range(100):
    all_strings = list("random_string" + str(i) for i in range(100000))
    all_ints = list(range(len(all_strings)))
    c = op.if_(
        op.const(True),
        then_branch=lambda: [
            label_encoder(
                c,
                keys_int64s=all_ints,
                values_strings=all_strings
            )
        ],
        else_branch=lambda: [
            label_encoder(
                c,
                keys_int64s=all_ints,
                values_strings=all_strings
            )
        ]
    )[0]
    c = label_encoder(c, keys_strings=all_strings, values_int64s=all_ints)

model: onnx.ModelProto = build(inputs={'a': a}, outputs={'c': c})
onnx.save(model, "big.onnx")

On my machine the provided InferenceSessionTests.Bench goes down from 4800ms to 4000ms for the Release build.

@neNasko1 neNasko1 changed the title Remove useless nodeproto ser Remove useless NodeProto serializations Dec 14, 2023
@neNasko1
Copy link
Contributor Author

Thanks for looking into this! Can you or somebody else tell me if there is any functional difference between calling check_node here and calling check_graph earlier? The latter would have the advantage that every node is still in its NodeProto format.

Also can someone elaborate if check_graph is even needed in this context?

@cbourjau
Copy link
Contributor

Sorry, for randomly tagging you, @snnn, but I saw your name in the git blame. Would you know the right person to review this?

@snnn snnn added the core runtime issues related to core runtime label Dec 20, 2023
@neNasko1
Copy link
Contributor Author

neNasko1 commented Jan 2, 2024

Am I right to assume that in this case check_graph can be skipped? If this is not the case, maybe some additional tests have to be added, since currently all are passing.

@justinchuby
Copy link
Contributor

/azp run Windows GPU TensorRT CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, onnxruntime-python-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-ortmodule-distributed, orttraining-linux-gpu-ci-pipeline, Linux QNN CI Pipeline, Windows ARM64 QNN CI Pipeline

@justinchuby
Copy link
Contributor

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux Nuphar CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline, Linux OpenVINO CI Pipeline

Copy link

Azure Pipelines successfully started running 7 pipeline(s).

Copy link

Azure Pipelines successfully started running 9 pipeline(s).

@justinchuby
Copy link
Contributor

/azp run Windows x64 QNN CI Pipeline

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@justinchuby justinchuby requested review from snnn and skottmckay January 3, 2024 00:06
@neNasko1
Copy link
Contributor Author

neNasko1 commented Jan 3, 2024

TLDR: This change now consists of creating NodeProto objects only when needed. This is a significant improvement in the case of large Attribute-s.

@justinchuby
Copy link
Contributor

/azp run Windows GPU TensorRT CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, onnxruntime-python-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-ortmodule-distributed, orttraining-linux-gpu-ci-pipeline, Linux QNN CI Pipeline, Windows ARM64 QNN CI Pipeline

@justinchuby
Copy link
Contributor

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux Nuphar CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline, Linux OpenVINO CI Pipeline

Copy link

Azure Pipelines successfully started running 7 pipeline(s).

Copy link

Azure Pipelines successfully started running 9 pipeline(s).

@cbourjau
Copy link
Contributor

cbourjau commented Jan 3, 2024

Can you or somebody else tell me if there is any functional difference between calling check_node here and calling check_graph earlier? The latter would have the advantage that every node is still in its NodeProto format.

From afar I'm still wondering if we couldn't just call check_graph in the very beginning. Am I missing something obvious?

@skottmckay
Copy link
Contributor

skottmckay commented Jan 3, 2024

I think there's still a gap. Modifications to the graph due to optimizations can potentially invalidate nodes as they can change the values that are available. e.g. if you fuse 2 nodes, the value between the two nodes becomes internal to the fused node (i.e. it is no longer produced in the graph), and any other node that was consuming that value would be broken (consumers can include nodes in subgraphs).

obviously this is an invalid fusion and the error is there, but if we don't run checker after making these sorts of changes we don't discover the issue until graph execution time, making it far harder to debug.

however that's an existing issue so I assume optimizer issues are obvious enough to be caught prior to checkin

@skottmckay
Copy link
Contributor

From afar I'm still wondering if we couldn't just call check_graph in the very beginning. Am I missing something obvious?

Potentially you could on the initial load, but following that we only call check_node for new nodes (e.g. created by optimizers). Graph::Resolve, which is where VerifyNodeAndOpMatch is called from, runs many (too many but that's a different issue) times so it needs to be as efficient as possible.

There may be other historical reasons as well.

@skottmckay
Copy link
Contributor

/azp run Windows x64 QNN CI Pipeline

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@skottmckay skottmckay merged commit 4e2d88b into microsoft:main Jan 4, 2024
62 of 70 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core runtime issues related to core runtime
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants