Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-partition support for context binary cache feature #18865

Merged
merged 32 commits into from
Feb 1, 2024
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
eba7e03
Support multi-partition for context cache feature
HectorSVC Dec 14, 2023
8ffa12e
2. Load and execute the model with multiple EPContext
HectorSVC Dec 15, 2023
8117368
3. Mode: run with QDQ model + QNN context model
HectorSVC Dec 16, 2023
8a00784
Remove QNN EP options: qnn_context_cache_enable, qnn_context_cache_pa…
HectorSVC Dec 18, 2023
16058a8
update test code
HectorSVC Dec 18, 2023
aacab16
Update test code to reflect the changes which move provider options t…
HectorSVC Dec 19, 2023
a5a9aef
Merge branch 'main' of https://github.com/microsoft/onnxruntime
HectorSVC Dec 20, 2023
1567bf2
merge main
HectorSVC Dec 20, 2023
22b4c93
Fix Linux build
HectorSVC Dec 27, 2023
de53da1
fix some build issues
HectorSVC Dec 29, 2023
c3883b1
Set inputs outputs explicitly to make sure the order is same as the u…
HectorSVC Jan 18, 2024
a457b70
Merge branch 'main' into qnn_ctx_multi_partition_support
HectorSVC Jan 19, 2024
30c1ed7
resolve conflict
HectorSVC Jan 20, 2024
55d10b2
resolved merge conflicts
HectorSVC Jan 21, 2024
ce3c64f
resolve merge conflicts
HectorSVC Jan 21, 2024
8c55f19
remove the validation mode
HectorSVC Jan 22, 2024
e7c0827
clean up some not used code
HectorSVC Jan 22, 2024
d3feaa4
renaming
HectorSVC Jan 22, 2024
33516cd
Update tests
HectorSVC Jan 23, 2024
445bc1b
fix the issue relate to initializer handling
HectorSVC Jan 25, 2024
9c7bdfc
Move QNN context cache related tests to a separate file
HectorSVC Jan 25, 2024
3dfd94b
rename some tests
HectorSVC Jan 25, 2024
3b8e879
Add UT to verify the multi-partition support
HectorSVC Jan 26, 2024
ff2c313
fill some gaps in UT and fix an issue relate to context cache path
HectorSVC Jan 26, 2024
c8ea83d
add one more UT to dumps the context cache model with 2 EPContext nodes
HectorSVC Jan 26, 2024
3dbb95d
formating
HectorSVC Jan 26, 2024
9eb32aa
Use ContribOp FusedMatMul instead of Add op to make sure the float32 …
HectorSVC Jan 26, 2024
1d4fa6f
Merge branch 'main' into qnn_ctx_multi_partition_support
HectorSVC Jan 26, 2024
74c5cef
update according review comments.
HectorSVC Feb 1, 2024
0137172
update according review comments
HectorSVC Feb 1, 2024
b79a256
formating
HectorSVC Feb 1, 2024
9e71147
revert a change
HectorSVC Feb 1, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions include/onnxruntime/core/framework/execution_provider.h
Original file line number Diff line number Diff line change
Expand Up @@ -326,6 +326,14 @@
*/
virtual std::vector<AllocatorPtr> CreatePreferredAllocators() { return std::vector<AllocatorPtr>(); };

/**
* Get the array of pointers for EPContext nodes
* Default return an empty vector if not provided by the Execution Provider
*/
virtual const std::vector<const Node*> GetEpContextNodes() const {
return std::vector<const Node*>();

Check warning on line 334 in include/onnxruntime/core/framework/execution_provider.h

View workflow job for this annotation

GitHub Actions / cpplint

[cpplint] include/onnxruntime/core/framework/execution_provider.h#L334

Add #include <vector> for vector<> [build/include_what_you_use] [4]
Raw output
include/onnxruntime/core/framework/execution_provider.h:334:  Add #include <vector> for vector<>  [build/include_what_you_use] [4]
}

private:
const std::string type_;

Expand Down
6 changes: 0 additions & 6 deletions include/onnxruntime/core/session/onnxruntime_c_api.h
Original file line number Diff line number Diff line change
Expand Up @@ -3593,17 +3593,11 @@ struct OrtApi {
*
* QNN supported keys:
* "backend_path": file path to QNN backend library.
* "qnn_context_cache_enable": 1 to enable QNN graph creation from cached QNN context file. If it's enabled: QNN EP will
* load from cached QNN context binary if it exist. It will generate a context binary file if it's not exist
* "qnn_context_cache_path": explicitly provide the QNN context cache file. Default to model_file.onnx.bin if not provided.
* "profiling_level": QNN profiling level, options: "off", "basic", "detailed". Default to off.
* "rpc_control_latency": QNN RPC control latency.
* "vtcm_mb": QNN VTCM size in MB. default to 0(not set).
* "htp_performance_mode": QNN performance mode, options: "burst", "balanced", "default", "high_performance",
* "high_power_saver", "low_balanced", "low_power_saver", "power_saver", "sustained_high_performance". Default to "default".
* "qnn_context_embed_mode", 1 means dump the QNN context binary into node attribute EPContext->ep_cache_context in the ONNX skeleton model.
* 0 means dump the QNN context binary into separate bin file and set the path to EPContext->ep_cache_context.
* The path is relative path to the ONNX skeleton model file.
* "qnn_saver_path": File path to the QNN Saver backend library. If specified, QNN Saver will be enabled and will
* dump QNN API calls to disk for replay/debugging. QNN Saver produces incorrect model inference results and
* may alter model/EP partitioning. Use only for debugging.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -235,3 +235,21 @@
// Use this config to control the minimum size of the initializer when externalizing it during serialization
static const char* const kOrtSessionOptionsOptimizedModelExternalInitializersMinSizeInBytes =
"session.optimized_model_external_initializers_min_size_in_bytes";

// Enable EP context feature to dump the partitioned graph which include the EP context into Onnx file.
// The dumped Onnx model with EP context can be used for future inference to avoid the EP graph partitioning/compile overhead.

Check warning on line 240 in include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h

View workflow job for this annotation

GitHub Actions / cpplint

[cpplint] include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h#L240

Lines should be <= 120 characters long [whitespace/line_length] [2]
Raw output
include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h:240:  Lines should be <= 120 characters long  [whitespace/line_length] [2]
// "0": disable. (default)
// "1": enable.
static const char* const kOrtSessionOptionEpContextEnable = "ep.ep_context_enable";
HectorSVC marked this conversation as resolved.
Show resolved Hide resolved

// Specify the file path for the Onnx model which has EP context.
// Default to original_file_name_ctx.onnx if not specified
static const char* const kOrtSessionOptionEpContextFilePath = "ep.ep_context_file_path";

// Flag to specify whether to dump the EP context into the Onnx model.
// "0": dump the EP context into separate file, keep the file name in the Onnx model.
// "1": dump the EP context into the Onnx model. (default).
static const char* const kOrtSessionOptionEpContextEmbedMode = "ep.ep_context_embed_mode";

// Dump the model after graph partitioning to file "partitioned_graph.onnx".
static const char* const kDumpPartitionedGraph = "session.dump_partitioned_graph";

Check warning on line 255 in include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h

View workflow job for this annotation

GitHub Actions / cpplint

[cpplint] include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h#L255

Could not find a newline character at the end of the file. [whitespace/ending_newline] [5]
Raw output
include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h:255:  Could not find a newline character at the end of the file.  [whitespace/ending_newline] [5]
75 changes: 72 additions & 3 deletions onnxruntime/core/framework/graph_partitioner.cc
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
#include "core/graph/function_utils.h"
#include "core/graph/graph_viewer.h"
#include "core/graph/model.h"
#include "core/session/onnxruntime_session_options_config_keys.h"

// uncomment this line to count non-CUDA ops in ONNX domain
// #define COUNT_NON_CUDA_OPS
Expand Down Expand Up @@ -634,9 +635,68 @@
return Status::OK();
}

static Status CreateEpContextModel(const ExecutionProviders& execution_providers,
const Graph& graph,
const std::string& ep_context_path,
const logging::Logger& logger) {
std::vector<const Node*> all_ep_context_nodes;
for (const auto& ep : execution_providers) {
const std::vector<const Node*> ep_context_nodes = ep->GetEpContextNodes();
all_ep_context_nodes.insert(all_ep_context_nodes.begin(), ep_context_nodes.begin(), ep_context_nodes.end());
}

auto get_ep_context_node = [&all_ep_context_nodes](const std::string& node_name) -> std::pair<bool, const Node*> {
for (auto& node : all_ep_context_nodes) {
if (node_name == node->Name()) {
return std::make_pair(true, node);
}
}
return std::make_pair(false, static_cast<const Node*>(nullptr));
};

onnxruntime::PathString context_cache_path;
PathString model_pathstring = graph.ModelPath().ToPathString();
if (all_ep_context_nodes.size() > 0) {
if (!ep_context_path.empty()) {
context_cache_path = ToPathString(ep_context_path);
} else if (!model_pathstring.empty()) {
context_cache_path = model_pathstring + ToPathString("_ctx.onnx");
}

bool file_exist = std::filesystem::is_regular_file(context_cache_path) && std::filesystem::exists(context_cache_path);

Check warning on line 666 in onnxruntime/core/framework/graph_partitioner.cc

View workflow job for this annotation

GitHub Actions / cpplint

[cpplint] onnxruntime/core/framework/graph_partitioner.cc#L666

Lines should be <= 120 characters long [whitespace/line_length] [2]
Raw output
onnxruntime/core/framework/graph_partitioner.cc:666:  Lines should be <= 120 characters long  [whitespace/line_length] [2]

if (file_exist) {
// User need to remove the existing file if want to re-generate it
LOGS(logger, INFO) << "Ep context file exist already.";
return Status::OK();
}

Model ep_context_model(graph.Name(), false, ModelMetaData(), PathString(), IOnnxRuntimeOpSchemaRegistryList(),
graph.DomainToVersionMap(), {}, logger);
auto& ep_graph = ep_context_model.MainGraph();
ep_graph.SetDescription(graph.Description());
for (const auto& node : graph.Nodes()) {
// the fused node and EPContext node has same node name
auto ep_context_node = get_ep_context_node(node.Name());
// Use EpContext node created by the EPs if name matched, otherwise use node from original model
if (ep_context_node.first) {
ep_graph.AddNode(*ep_context_node.second);
} else {
ep_graph.AddNode(node);
}
}
ORT_RETURN_IF_ERROR(Model::Save(ep_context_model, context_cache_path));
}

return Status::OK();
}

static Status PartitionOnnxFormatModel(const PartitionParams& partition_params, GraphPartitioner::Mode mode,
const ExecutionProviders& execution_providers,
KernelRegistryManager& kernel_registry_manager) {
KernelRegistryManager& kernel_registry_manager,
bool ep_context_enabled,
std::string ep_context_path,
const logging::Logger& logger) {
bool modified_graph = false;

auto& graph = partition_params.graph.get();
Expand All @@ -654,6 +714,10 @@
partition_params.debug_graph_fn));
}

if (ep_context_enabled) {
ORT_RETURN_IF_ERROR(CreateEpContextModel(execution_providers, graph, ep_context_path, logger));
}

// expand any nodes that have an ONNX function definition but no matching ORT kernel.
modified_graph = false;
ORT_RETURN_IF_ERROR(InlineNodes(graph, modified_graph));
Expand Down Expand Up @@ -840,6 +904,8 @@

Status GraphPartitioner::Partition(Graph& graph, FuncManager& func_mgr,
const layout_transformation::TransformLayoutFunction& transform_layout_function,
const ConfigOptions& config_options,
const logging::Logger& logger,
Mode mode,
const layout_transformation::DebugGraphFn& debug_graph_fn) const {
// It is a greedy partitioning algorithm per provider preferences user provided when calling ONNX RUNTIME right now.
Expand Down Expand Up @@ -884,8 +950,11 @@

if (mode == Mode::kNormal || mode == Mode::kAssignOnly) {
#if !defined(ORT_MINIMAL_BUILD)
ORT_RETURN_IF_ERROR(PartitionOnnxFormatModel(partition_params, mode,
providers_, kernel_registry_mgr_));
bool ep_context_enabled = config_options.GetConfigOrDefault(kOrtSessionOptionEpContextEnable, "0") == "1";
std::string ep_context_path = config_options.GetConfigOrDefault(kOrtSessionOptionEpContextFilePath, "");

Check warning on line 954 in onnxruntime/core/framework/graph_partitioner.cc

View workflow job for this annotation

GitHub Actions / cpplint

[cpplint] onnxruntime/core/framework/graph_partitioner.cc#L954

Add #include <string> for string [build/include_what_you_use] [4]
Raw output
onnxruntime/core/framework/graph_partitioner.cc:954:  Add #include <string> for string  [build/include_what_you_use] [4]
ORT_RETURN_IF_ERROR(PartitionOnnxFormatModel(partition_params, mode, providers_,
kernel_registry_mgr_, ep_context_enabled,
ep_context_path, logger));
#else
return ORT_MAKE_STATUS(ONNXRUNTIME, FAIL, "ONNX models are not supported in this build.");
#endif //! defined(ORT_MINIMAL_BUILD)
Expand Down
3 changes: 3 additions & 0 deletions onnxruntime/core/framework/graph_partitioner.h
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ namespace onnxruntime {
class ExecutionProviders;
class KernelRegistryManager;
class Model;
struct ConfigOptions;

class GraphPartitioner {
public:
Expand All @@ -31,6 +32,8 @@ class GraphPartitioner {
// Run partitioning.
Status Partition(Graph& graph, FuncManager& func_mgr,
const layout_transformation::TransformLayoutFunction& transform_layout_function,
const ConfigOptions& config_options,
const logging::Logger& logger,
Mode mode = Mode::kNormal,
const layout_transformation::DebugGraphFn& debug_graph_fn = {}) const;

Expand Down
Loading
Loading