Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize onnxruntime::InferenceSession::Initialize with focus on GrapViewer. For large models the speedup of this function can be up to 3x. #19080
Optimize onnxruntime::InferenceSession::Initialize with focus on GrapViewer. For large models the speedup of this function can be up to 3x. #19080
Changes from 1 commit
1f7eea4
6673a0c
e984882
fe742a6
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the default compiled binaries even train a model or is a specialized compilation needed ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this training specific and the additional code to track whether it's a forward node or not could be inside #if defined(ENABLE_TRAINING)?
nit: IsForwardNode/is_forward_node_ would be consistent with the coding standards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for the PriorityNodeCompare class in graph_viewer.cc.
Reading graph_viewer and this comment
makes me curious, is this purely for visualization? The graph_viewer ist used by a lot of operations in TransformGraph. Does 'will be output first' mean for printing or is it really required for the graph transformation?
Is the information about training available at runtime as well? In this case we could pass the Information to the graph viewer and skip this expensive portion of code.
Researching further into avoiding hash computation, it probably be best to have a special key type which precomputes the hash to avoid the hash value computation, e.g.
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1661r1.html
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p0920r2.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updating the forward node on each change is kind of ugly and risky. With regards to performance it should be unproblematic since the string compare is quite efficient doing a length check first before actually comparing bytes.
The reason for caching isForwardNode_ is each check for the kBackwardNodeAttributeName within the PriorityNodeCompare actually computes the hash of the string which is the costly part.
Alternate solutions would change the container and precompute the hash of kBackwardNodeAttributeName and incorperate it into the key. The downside with this solution is that it'd work only for cases where colisions are handled by lists instead of multiple iterations of hash keys. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to be a duplicate of the same below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not particular happy tracking the kBackwardAttributeName at multiple places. Is there a chance to have a common pair of functions to add/remove new nodes to minimize breaking changes in the future?
#Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with the training code and the usage of this attribute. When is the attribute added, and once added is it actually ever removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with training as well. IMHO it doesn't matter if this attribute gets removed or not in normal workflows. What matters is that it can be removed and not checking for removal would change the behaviour and potentially even introduce bugs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case it doesn't seem like a standard attribute, so I would like to understand how it's used before adding more stuff to the production code to handle it theoretically changing. There's a binary size and maintenance cost, and if it's a purely internal value with specific usage it may be better to validate the usage remains as expected via unit tests.
Based on this github code search it has a very internal name, only seems to be set in a training optimizer, and only seems to be read in the graph_viewer code.
Unless there's some external usage of this magic value outside of ORT, it seems like it would be simpler for a Node to have a bool member that is directly set by the training optimizer instead of the indirect costly usage of a specially named attribute.
@askhade do you know if this special value is used outside of ORT and must be set in the Node attributes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the code and IMO it is OK to make this a book member of the Node class instead of making it a named attribute. There is widespread usage of this in the code so the change should not be very cumbersome. @mtavenrath let me know if you have any questions regarding this. Tagging @pengwa to validate this.
@mtavenrath what timeline are you targeting for this change? We may need a couple of days to get this reviewed from Peng.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One note - when finding usage to update you need to search for both the kBackwardNodeAttributeName constant as well as the string "__backwardpass". Ideally we can make all places use the constant.
https://github.com/search?q=repo%3Amicrosoft%2Fonnxruntime%20kBackwardNodeAttributeName&type=code
https://github.com/search?q=repo%3Amicrosoft%2Fonnxruntime%20__backwardpass&type=code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@askhade I'm happy with any timeline as it is not in the a month+ timeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @askhade @skottmckay Yes, it is a purely internal used attributes, and
it is firstly introduced as a backward-data-range specific perf improvement in https://github.com/microsoft/onnxruntime/blame/dfeda9019cfed2d6df5bcacc54269c7de481bdee/onnxruntime/core/providers/rocm/rocm_kernel.h#L29.
A second usage pattern is: in priority based topo ordering, we consider the backward tagged node has lower priority than forward node, in training code path
onnxruntime/orttraining/orttraining/core/optimizer/memory_optimizer/memory_insight.cc
Line 172 in dfeda90
While both usage are having similar logic when set the backward pass to be true, e.g:
in d5d6924#diff-8d8d103ec215ba8edb8ab23e876080adfd60f6f377084ffeca041c8b4f189a2cR13
and
in
onnxruntime/orttraining/orttraining/core/optimizer/memory_optimizer/memory_insight.cc
Line 172 in dfeda90
Plus "YieldOp" is used in ORTModule (e.g. --enable_training) build only. So I think it is possible we restrict the getforward/setforward in ENABLE_TRAINING macro. While this may need some change in onnxruntime/core/session/provider_bridge_ort.cc to wrap the new bool property ( Not sure whether your tried build/running your local code with ROCM, while I feel the change should be needed to make the ROCM ep code work.).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I noticed for a while. Both training and inference build can use priority based order, it is indeed needed for some training features, while I don't know how much value it brings for model inferencing.
If most inference users don't have such a need, maybe we can load the
nodes_in_topological_order_with_priority_
in lazy mode, e.g. we only initialize it when first time user needed it viaGetNodesInTopologicalOrder(ExecutionOrder::PRIORITY_BASED)
.@skottmckay @askhade
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer InlinedVector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change shaves off nearly 2s from the original 20s of TransformGraph and will be maximum efficient when loading models for inference.
For training models the hash computation is still done once. Before the hash has potentially been computed twice, one time for find and one time for at. By fetching the iterator of std::fine and using it later to get the value of i the number of hash computation could be halved (saving ~1s instead of 2s).
If it's known that the model is used for inference only it'd be great if PriorityNodeCompare could skip this test altogether. This could be achieved most efficient by making this a template class and specialize for inference / training.