Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Experimental] Add a path to fallback more nodes to CPUs. #19769

Closed
wants to merge 6 commits into from

Conversation

wschin
Copy link
Contributor

@wschin wschin commented Mar 4, 2024

Shape-related nodes don't only start with Shape or Size. In dynamo-captured ONNX model, it can starts with a graph input. A new transform is added to fallback all nodes which can be reversely traversed from a shape-like variable. Some shape-like variables are list below.

  • all inputs of Range
  • 2nd input of Reshape
  • 2nd input of Unsqueeze
  • 1st input of ConstantOfShape
  • 2nd-to-last inputs of Slice.

For example, the comment below explains the desired CPU ops for a Reshape.

def model(x, dim0, dim1, ...):
  # Should fallbakc to CPU
  dim_2 = dim0 + dim1
  # Should fallback to CPU
  new_shape = opset.Concat(dim0, dim2)
  # On GPU
  new_x = opset.Reshape(x, new_shape)
  return new_x

This PR fixes my llama model + AtenOp. The running time is reduced from 4.x sec to 0.6 sec. The side effect of this change to other graph transformers is still unclear, so it's off by default. To enable it, set ORT_AGGRESSIVE_CPU_FALLBACK=1. Ideally, we should fallback all small computation node (nodes with small inputs/outputs) to CPU, but shape information is not available for each of NodeArg. We should also improve shape inference in ORT in the future.

The old GetCpuPreferredNodes traverses the graph topologically from CPU-output generating nodes and tries to place downstream nodes on CPU when possible. This is different since this PR's traverses the graph reversely topologically starting with CPU-output consuming nodes.

@wschin wschin force-pushed the wechi/more-cpu-fallback-for-shape branch 3 times, most recently from 8da6b51 to ffa61d7 Compare March 5, 2024 00:47
Shape-related nodes don't only start with `Shape` or `Size`.
In dynamo-captured ONNX model, it can starts with a graph input.
A new transform is added to fallback `all` nodes which can be
reversely traversed from a `shape-like` variable. Some
`shape-like` variables are list below.
- all inputs of Range
- 2nd input of Reshape
- 2nd input of Unsqueeze
- 1st input of ConstantOfShape
- 2nd-to-last inputs of Slice.

Fix header

Remove unused variable

Versioning shape inputs

Fix
@wschin wschin force-pushed the wechi/more-cpu-fallback-for-shape branch from ffa61d7 to ed79ec7 Compare March 5, 2024 01:38
@wschin wschin force-pushed the wechi/more-cpu-fallback-for-shape branch from 08ac5f3 to f896fb8 Compare March 5, 2024 22:52
@wschin wschin force-pushed the wechi/more-cpu-fallback-for-shape branch 3 times, most recently from e9de2b8 to 5abdb86 Compare March 6, 2024 18:37
Fix typo

Write to fixed place

Remove unused import's

run it

Fix

Change test location
@wschin wschin force-pushed the wechi/more-cpu-fallback-for-shape branch from 5abdb86 to af9319b Compare March 6, 2024 19:08
@wschin wschin closed this Mar 6, 2024
@wschin wschin reopened this Mar 6, 2024
@wschin wschin requested a review from a team as a code owner March 6, 2024 23:45
@@ -39,6 +39,7 @@ steps:
timeoutInMinutes: 60

# Entry point for all ort training api tests
# TODO: move onnxscript installation to CI image.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure when it will be the right time as onnxscript is a relatively new tool.


std::unordered_map<std::string, std::unordered_map<int64_t, std::vector<size_t>>> shape_related_inputs_in_nodes = {
// 2nd input of Expand-13 is a shape-related input.
{"Expand", {{13 /* since version */, {1} /* shape inputs' indices */}}},
Copy link
Member

@hariharans29 hariharans29 Mar 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of reverse traversal from these pre-specified list of ops (which requires periodic maintenance - updating based on new ops added to the ONNX standard, op version revisions, shape input indices across op version revisions, etc.) - can the reverse traversal start from a provider assigned node requiring a specific input on CPU (usually any input needed on CPU by a provider node is "shape like") and this information is available in the kernel def of the node ? That seems like a more "automated" way of the pre-cooked list approach ?

snnn
snnn previously approved these changes Mar 7, 2024
// shape = onnx::Concat(s0, s1)
// reshaped = onnx::Reshape(x, shape)
// Then, the shape-producing node is Concat.
std::unordered_set<const Node*> shape_producing_nodes;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InlinedHashSet

// 2. finds `shape` is a shape-related variable since Reshape's 2nd input is a shape-related input,
// 3. and then records the producer of `shape` (i.e., `Concat`).
for (auto& input_index : shape_input_indices) {
auto input = node.InputDefs().at(input_index);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check if this is an iniitializer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference? From finding shape-related nodes' perspective, graph input and initializer are the same. I am not sure if ORT have different assumptions somewhere.

// Stop the traversal when a "Shape" node is found.
graph.ReverseDFSFrom(
start_nodes,
[&shape_related_node_indices](const Node* n) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there are nodes, where shape is just one of the outputs, but the rest of the computation should be done on device?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not aware of any examples. If you are looking for an op producing both of CPU and GPU outputs, attention could be a case when it wants to pass forward's random seed (int64 scalar) to backward.

@@ -39,6 +43,132 @@ static bool IsSmallInitializer(const onnxruntime::GraphViewer& graph, const Node
}
} // namespace

std::unordered_set<NodeIndex> GetShapeRelatedNodes(const onnxruntime::GraphViewer& viewer) {
// Conceptually, this function traverse from shape-consuming nodes
// to fallback all its upstream nodes to CPU. Consider a graph
Copy link
Member

@hariharans29 hariharans29 Mar 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add some TODOs to enhance this for situations where it won't work:

(1) There is no shape "consumer" at all (i.e.) the "shape like" output eventually becomes graph output (Rare corner case - but there are definitiely models like these)

(2) Cases where the shape subgraph is split across graph levels - main graph has some portion of the shape nodes and a subgraph has a portion of the shape nodes - in this case the "shape consumer" at the main graph level will be a subgraph containing node (If/Loop/Scan) - and the shape info may be consumed "explicitly" (as a graph input to If/Loop/Scan) or implicitly by the node (i.e.) not as an explicit graph input but due to some node in the subgraph referencing the main graph node output(s)

// 1st input of ConstantOfShape is a shape-related input.
{"ConstantOfShape", {{9, {0}}, {20, {0}}, {21, {0}}}},
// 2nd to 5th inputs of Slice-13 are shape-related inputs.
{"Slice", {{13, {1, 2, 3, 4}}}}};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if operator Range is inlined but it could be considered as consuming a shape as well.

to_stop);
}

return shape_related_node_indices;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if an shape input is on CUDA when this algorithm is moved to CPU?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will fallback the producer of the shape input and its upstream nodes to CPU.

@wschin
Copy link
Contributor Author

wschin commented Mar 13, 2024

Force-pushed and now I can't update this branch anymore. #19875 continues the work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants