New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[Experimental] Add a path to fallback more nodes to CPUs. #19769

Closed

wschin wants to merge 6 commits into main from wechi/more-cpu-fallback-for-shape

Contributor

wschin commented Mar 4, 2024 •

edited

Loading

Shape-related nodes don't only start with Shape or Size. In dynamo-captured ONNX model, it can starts with a graph input. A new transform is added to fallback all nodes which can be reversely traversed from a shape-like variable. Some shape-like variables are list below.

all inputs of Range
2nd input of Reshape
2nd input of Unsqueeze
1st input of ConstantOfShape
2nd-to-last inputs of Slice.

For example, the comment below explains the desired CPU ops for a Reshape.

def model(x, dim0, dim1, ...):
  # Should fallbakc to CPU
  dim_2 = dim0 + dim1
  # Should fallback to CPU
  new_shape = opset.Concat(dim0, dim2)
  # On GPU
  new_x = opset.Reshape(x, new_shape)
  return new_x

This PR fixes my llama model + AtenOp. The running time is reduced from 4.x sec to 0.6 sec. The side effect of this change to other graph transformers is still unclear, so it's off by default. To enable it, set ORT_AGGRESSIVE_CPU_FALLBACK=1. Ideally, we should fallback all small computation node (nodes with small inputs/outputs) to CPU, but shape information is not available for each of NodeArg. We should also improve shape inference in ORT in the future.

The old GetCpuPreferredNodes traverses the graph topologically from CPU-output generating nodes and tries to place downstream nodes on CPU when possible. This is different since this PR's traverses the graph reversely topologically starting with CPU-output consuming nodes.

wschin force-pushed the wechi/more-cpu-fallback-for-shape branch 3 times, most recently from 8da6b51 to ffa61d7 Compare

March 5, 2024 00:47


          Add a new function to fallback more nodes to CPUs.

ed79ec7

Shape-related nodes don't only start with `Shape` or `Size`.
In dynamo-captured ONNX model, it can starts with a graph input.
A new transform is added to fallback `all` nodes which can be
reversely traversed from a `shape-like` variable. Some
`shape-like` variables are list below.
- all inputs of Range
- 2nd input of Reshape
- 2nd input of Unsqueeze
- 1st input of ConstantOfShape
- 2nd-to-last inputs of Slice.

Fix header

Remove unused variable

Versioning shape inputs

Fix

wschin force-pushed the wechi/more-cpu-fallback-for-shape branch from ffa61d7 to ed79ec7 Compare

March 5, 2024 01:38


          Fix segfault

060e09c

wschin force-pushed the wechi/more-cpu-fallback-for-shape branch from 08ac5f3 to f896fb8 Compare

March 5, 2024 22:52

github-advanced-security bot found potential problems

View reviewed changes

orttraining/orttraining/test/python/orttraining_test_aggressive_cpu_fallback.py Fixed Show fixed Hide fixed

github-advanced-security bot found potential problems

View reviewed changes

orttraining/orttraining/test/python/orttraining_test_aggressive_cpu_fallback.py Fixed Show fixed Hide fixed

orttraining/orttraining/test/python/orttraining_test_aggressive_cpu_fallback.py Fixed Show fixed Hide fixed

github-advanced-security bot found potential problems

View reviewed changes

orttraining/orttraining/test/python/orttraining_test_aggressive_cpu_fallback.py Fixed Show fixed Hide fixed

orttraining/orttraining/test/python/orttraining_test_aggressive_cpu_fallback.py Fixed Show fixed Hide fixed

wschin force-pushed the wechi/more-cpu-fallback-for-shape branch 3 times, most recently from e9de2b8 to 5abdb86 Compare

March 6, 2024 18:37


          Add a simple test

af9319b

Fix typo

Write to fixed place

Remove unused import's

run it

Fix

Change test location

wschin force-pushed the wechi/more-cpu-fallback-for-shape branch from 5abdb86 to af9319b Compare

March 6, 2024 19:08

wschin closed this

wschin reopened this


          Install onnxscript

97e3d6b

wschin requested a review from a team as a code owner

March 6, 2024 23:45

wschin commented

View reviewed changes

tools/ci_build/github/azure-pipelines/templates/orttraining-linux-gpu-test-ci-pipeline.yml

@@ @@ -39,6 +39,7 @@ steps: @@
                 timeoutInMinutes: 60
               # Entry point for all ort training api tests
+              # TODO: move onnxscript installation to CI image.

Contributor Author

wschin Mar 6, 2024

I am not sure when it will be the right time as onnxscript is a relatively new tool.

hariharans29 reviewed

View reviewed changes

onnxruntime/core/framework/fallback_cpu_capability.cc

+                std::unordered_map<std::string, std::unordered_map<int64_t, std::vector<size_t>>> shape_related_inputs_in_nodes = {
+                    // 2nd input of Expand-13 is a shape-related input.
+                    {"Expand", {{13 /* since version */, {1} /* shape inputs' indices */}}},

Member

hariharans29 Mar 7, 2024 •

edited

Loading

Instead of reverse traversal from these pre-specified list of ops (which requires periodic maintenance - updating based on new ops added to the ONNX standard, op version revisions, shape input indices across op version revisions, etc.) - can the reverse traversal start from a provider assigned node requiring a specific input on CPU (usually any input needed on CPU by a provider node is "shape like") and this information is available in the kernel def of the node ? That seems like a more "automated" way of the pre-cooked list approach ?

snnn previously approved these changes

View reviewed changes

pranavsharma reviewed

View reviewed changes

onnxruntime/core/framework/fallback_cpu_capability.cc Show resolved Hide resolved

yuslepukhin requested changes

View reviewed changes

onnxruntime/core/framework/fallback_cpu_capability.cc Show resolved Hide resolved

onnxruntime/core/framework/fallback_cpu_capability.cc Show resolved Hide resolved

onnxruntime/core/framework/fallback_cpu_capability.cc Show resolved Hide resolved

onnxruntime/core/framework/fallback_cpu_capability.cc

+                //  shape = onnx::Concat(s0, s1)
+                //  reshaped = onnx::Reshape(x, shape)
+                // Then, the shape-producing node is Concat.
+                std::unordered_set<const Node*> shape_producing_nodes;

Member

yuslepukhin Mar 7, 2024

InlinedHashSet

onnxruntime/core/framework/fallback_cpu_capability.cc Show resolved Hide resolved

onnxruntime/core/framework/fallback_cpu_capability.cc

+                  //  2. finds `shape` is a shape-related variable since Reshape's 2nd input is a shape-related input,
+                  //  3. and then records the producer of `shape` (i.e., `Concat`).
+                  for (auto& input_index : shape_input_indices) {
+                    auto input = node.InputDefs().at(input_index);

Member

yuslepukhin Mar 7, 2024

Should we check if this is an iniitializer?

Contributor Author

wschin Mar 7, 2024

What is the difference? From finding shape-related nodes' perspective, graph input and initializer are the same. I am not sure if ORT have different assumptions somewhere.

onnxruntime/core/framework/fallback_cpu_capability.cc

+                  // Stop the traversal when a "Shape" node is found.
+                  graph.ReverseDFSFrom(
+                      start_nodes,
+                      [&shape_related_node_indices](const Node* n) {

Member

yuslepukhin Mar 7, 2024

Is there are nodes, where shape is just one of the outputs, but the rest of the computation should be done on device?

Contributor Author

wschin Mar 7, 2024

I am not aware of any examples. If you are looking for an op producing both of CPU and GPU outputs, attention could be a case when it wants to pass forward's random seed (int64 scalar) to backward.

hariharans29 reviewed

View reviewed changes

onnxruntime/core/framework/fallback_cpu_capability.cc Show resolved Hide resolved

hariharans29 reviewed

View reviewed changes

onnxruntime/core/framework/fallback_cpu_capability.cc

               }
               }  // namespace
+              std::unordered_set<NodeIndex> GetShapeRelatedNodes(const onnxruntime::GraphViewer& viewer) {
+                // Conceptually, this function traverse from shape-consuming nodes
+                // to fallback all its upstream nodes to CPU. Consider a graph

Member

hariharans29 Mar 7, 2024 •

edited

Loading

Maybe add some TODOs to enhance this for situations where it won't work:

(1) There is no shape "consumer" at all (i.e.) the "shape like" output eventually becomes graph output (Rare corner case - but there are definitiely models like these)

(2) Cases where the shape subgraph is split across graph levels - main graph has some portion of the shape nodes and a subgraph has a portion of the shape nodes - in this case the "shape consumer" at the main graph level will be a subgraph containing node (If/Loop/Scan) - and the shape info may be consumed "explicitly" (as a graph input to If/Loop/Scan) or implicitly by the node (i.e.) not as an explicit graph input but due to some node in the subgraph referencing the main graph node output(s)


          new session option key

xadupre reviewed

View reviewed changes

onnxruntime/core/framework/fallback_cpu_capability.cc

+                    // 1st input of ConstantOfShape is a shape-related input.
+                    {"ConstantOfShape", {{9, {0}}, {20, {0}}, {21, {0}}}},
+                    // 2nd to 5th inputs of Slice-13 are shape-related inputs.
+                    {"Slice", {{13, {1, 2, 3, 4}}}}};

Member

xadupre Mar 11, 2024

I don't know if operator Range is inlined but it could be considered as consuming a shape as well.

xadupre reviewed

View reviewed changes

onnxruntime/core/framework/fallback_cpu_capability.cc

+                      to_stop);
+                }
+                return shape_related_node_indices;

Member

xadupre Mar 11, 2024

What happens if an shape input is on CUDA when this algorithm is moved to CPU?

Contributor Author

wschin Mar 12, 2024

It will fallback the producer of the shape input and its upstream nodes to CPU.


          Make the flag a session option instead of env var

cc64a2e

wschin dismissed snnn’s stale review via

cc64a2e

March 12, 2024 22:32

wschin closed this

wschin mentioned this pull request

[Experimental] Add a path to fallback more nodes to CPUs #19875

Open

Contributor Author

wschin commented Mar 13, 2024

Force-pushed and now I can't update this branch anymore. #19875 continues the work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

jywu-msft jywu-msft left review comments

xadupre xadupre left review comments

yuslepukhin yuslepukhin requested changes

hariharans29 hariharans29 left review comments

pranavsharma pranavsharma left review comments

snnn snnn left review comments

Labels

None yet