Skip to content

Commit

Permalink
[ORTModule] Use Default Topo-order for GraphViewer (#18410)
Browse files Browse the repository at this point in the history
ORT's default topo-order is a reversed DFS algorithm, while the
priority-based topo-order is a forward BFS algorithm. It's likely that
the default order is better than priority-based order on memory because
tensor memory is more likely to be released right after it's consumed.

Currently ORTModule uses priority-based order, for some models, it sorts
lots of small Ops to the beginning, this introduces big CPU overhead at
the beginning (see below screenshot), this PR is to use default order
for training. The priority-based order is heavily used for some
recompute optimization, so if there is recompute enabled, we will still
use priority-based order.

This PR also adds an optimization to the default order, which is to move
all Shape/Size Ops to right after their parent nodes. This is to make
sure the shape and size nodes are executed right after their parents so
it's possible the input tensor memory can be released as soon as
possible. This is especially important for non-CPU devices or for
training case where some gradient graphs use only shape/size of tensors
from forward.

Profiling result:
Before
<img width="910" alt="截屏2023-11-13 12 09 02"
src="https://github.com/microsoft/onnxruntime/assets/11661208/e54d5ead-274f-4725-923e-521bbcfce752">

After
<img width="910" alt="截屏2023-11-13 12 10 44"
src="https://github.com/microsoft/onnxruntime/assets/11661208/f50d196d-11ac-43a2-9493-517e4552ffab">
  • Loading branch information
centwang authored Nov 30, 2023
1 parent e1d1033 commit 148495e
Show file tree
Hide file tree
Showing 3 changed files with 39 additions and 3 deletions.
29 changes: 29 additions & 0 deletions onnxruntime/core/graph/graph_viewer.cc
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,12 @@ GraphViewer::GraphViewer(const Graph& graph, const IndexedSubGraph* filter_info)
: ConstGraphNodes::NodeFilterFunc(nullptr))},
filter_info_{filter_info} {
std::vector<const Node*> leaf_nodes;
// Keep the info of shape and size nodes and their parents so that after topological sort, we can move them
// right after their parents. This is to make sure the shape and size nodes are executed right after their parents
// so it's possible the input tensor memory can be released as soon as possible. This is especially important
// for non-CPU devices or for training case where some gradient graphs use only shape/size of tensors from forward.
InlinedHashSet<NodeIndex> shape_size_nodes;
InlinedHashMap<NodeIndex, InlinedVector<NodeIndex>> shape_size_parents;
for (auto& node : graph_->Nodes()) {
// This is a leaf node (without any output node)
if (node.OutputNodesBegin() == node.OutputNodesEnd()) {
Expand All @@ -66,6 +72,15 @@ GraphViewer::GraphViewer(const Graph& graph, const IndexedSubGraph* filter_info)
if (node.InputEdgesBegin() == node.InputEdgesEnd()) {
root_nodes_.push_back(node.Index());
}
if ((node.OpType() == "Shape" || node.OpType() == "Size") && node.InputEdgesBegin() != node.InputEdgesEnd()) {
shape_size_nodes.insert(node.Index());
NodeIndex parent = node.InputNodesBegin()->Index();
if (shape_size_parents.find(parent) == shape_size_parents.end()) {
shape_size_parents[parent] = InlinedVector<NodeIndex>{node.Index()};
} else {
shape_size_parents[parent].push_back(node.Index());
}
}
}

graph.ReverseDFSFrom(
Expand All @@ -76,6 +91,20 @@ GraphViewer::GraphViewer(const Graph& graph, const IndexedSubGraph* filter_info)
},
NodeCompare());

auto original = std::move(nodes_in_topological_order_);
nodes_in_topological_order_.reserve(original.size());
for (auto& node : original) {
if (shape_size_nodes.find(node) != shape_size_nodes.end()) {
continue;
}
nodes_in_topological_order_.push_back(node);
if (shape_size_parents.find(node) != shape_size_parents.end()) {
for (auto& following_node : shape_size_parents[node]) {
nodes_in_topological_order_.push_back(following_node);
}
}
}

#if !defined(ORT_MINIMAL_BUILD)
graph.KahnsTopologicalSort(
[this](const Node* n) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -238,8 +238,14 @@ def _get_session_config(self):
session_options.enable_mem_pattern = False
session_options.enable_mem_reuse = False
session_options.use_deterministic_compute = _are_deterministic_algorithms_enabled()
# default to PRIORITY_BASED execution order
session_options.execution_order = onnxruntime.ExecutionOrder.PRIORITY_BASED
# DEFAULT order is reversed DFS order, while PRIORITY_BASED order is forward BFS order.
# DEFAULT order is likely to be better than PRIORITY_BASED order on memory. However, our recompute feature
# requires PRIORITY_BASED order to work properly. So we use PRIORITY_BASED order when recompute is enabled.
session_options.execution_order = (
onnxruntime.ExecutionOrder.PRIORITY_BASED
if self._runtime_options.memory_optimizer_config != ""
else onnxruntime.ExecutionOrder.DEFAULT
)
# 0:Verbose, 1:Info, 2:Warning. 3:Error, 4:Fatal. Default is 2.
session_options.log_severity_level = int(self._debug_options.logging.log_level)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,8 @@ TEST(MemoryOptimizerTests, GeluRecompute) {
ASSERT_EQ(original_gelu_node->Priority(), static_cast<int>(ExecutionPriority::DEFAULT));
}

TEST(MemoryOptimizerTests, TileRecompute) {
// Disable this UT for now. It has strong dependency on graph topological order, which is not correct logically.
TEST(MemoryOptimizerTests, DISABLED_TileRecompute) {
const logging::Logger* logger = &logging::LoggingManager::DefaultLogger();
auto model_uri = MODEL_FOLDER "recompute_tile.onnx";
std::shared_ptr<Model> model;
Expand Down

0 comments on commit 148495e

Please sign in to comment.