Fix convergence for dolly+stage3 training (microsoft#17685)

### Fix convergence for dolly+stage3 training In [ZeROOffloadSubscriber](https://github.com/microsoft/onnxruntime/blob/216214b7d302cb504d1e5a647f65b6fe49c22dbb/orttraining/orttraining/python/training/utils/hooks/_zero_offload_subscriber.py#L359C7-L359C28), we defined some PythonOp, taking input and returning it inplace, for example: https://github.com/microsoft/onnxruntime/blob/216214b7d302cb504d1e5a647f65b6fe49c22dbb/orttraining/orttraining/python/training/utils/hooks/_zero_offload_subscriber.py#L223C20-L223C20. While it is possible, when ORT runs such a PythonOp, once it completes, it will release the input OrtValue, triggered the data erasing or overridden. But the PythonOp's returned value OrtValue are still pointing to that address, reading or writting on that may introduce a wrong result or even undefined behaviors. ``` /bert_ort/pengwa/py38/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_custom_autograd_function_runner.py:28: UserWarning: .rank-0: onnxruntime.training.utils.hooks._zero_offload_subscriber.ORTZeROOffloadPreForwardFunction->Backward: ONNX Op attribute 'tensor_reuse_map' doesn't indicate 8-th output is reusing any input, but detected inplace_map indicates it is reusing some input index. A clone will be done before returning to ORT, to align with ORT's NO Buffer reuse plan. Please update inplace_map explicitly to avoid such a copy. warnings.warn(f".rank-{get_rank()}: {message}") 0%|▏ | 1/1000 [00:04<1:15:08, 4.51s/it][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:44,023 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 14.1406, 'learning_rate': 0, 'epoch': 0.0} 0%|▏ | 1/1000 [00:04<1:15:08, 4.51s/it]Invalidate trace cache @ step 5: expected module 6, but got module 7 0%|▍ | 2/1000 [00:04<31:53, 1.92s/it][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:44,124 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0} 0%|▋ | 3/1000 [00:04<18:05, 1.09s/it][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:44,227 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0} 0%|▋ | 3/1000 [00:04<18:05, 1.09s/it][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:44,326 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0} 0%|█▏ | 5/1000 [00:04<08:44, 1.90it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:44,419 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0} 0%|█▏ | 5/1000 [00:04<08:44, 1.90it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:44,505 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0} 1%|█▋ | 7/1000 [00:05<05:28, 3.02it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:44,597 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0} 1%|█▋ | 7/1000 [00:05<05:28, 3.02it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:44,690 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0} 1%|██▏ | 9/1000 [00:05<03:57, 4.17it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:44,791 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0} 1%|██▏ | 9/1000 [00:05<03:57, 4.17it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:44,889 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0} 1%|██▋ | 11/1000 [00:05<03:06, 5.32it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:44,981 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0} 1%|██▋ | 11/1000 [00:05<03:06, 5.32it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:45,073 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.01} 1%|███▏ | 13/1000 [00:05<02:33, 6.42it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:45,166 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.01} 1%|███▏ | 13/1000 [00:05<02:33, 6.42it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:45,256 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.01} 2%|███▌ | 15/1000 [00:05<02:12, 7.43it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:45,348 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.01} 2%|███▌ | 15/1000 [00:05<02:12, 7.43it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:45,439 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.01} 2%|████ | 17/1000 [00:06<01:59, 8.22it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:45,535 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.01} 2%|████ | 17/1000 [00:06<01:59, 8.22it/s]Traceback (most recent call last): File "examples/onnxruntime/training/language-modeling/run_clm.py", line 600, in <module> main() File "examples/onnxruntime/training/language-modeling/run_clm.py", line 548, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/bert_ort/pengwa/optimum/optimum/onnxruntime/trainer.py", line 457, in train return inner_training_loop( File "/bert_ort/pengwa/optimum/optimum/onnxruntime/trainer.py", line 781, in _inner_training_loop self.deepspeed.step() File "/bert_ort/pengwa/deepspeed/deepspeed/runtime/engine.py", line 2084, in step self._take_model_step(lr_kwargs) File "/bert_ort/pengwa/deepspeed/deepspeed/runtime/engine.py", line 1990, in _take_model_step self.optimizer.step() File "/bert_ort/pengwa/deepspeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/bert_ort/pengwa/deepspeed/deepspeed/runtime/zero/stage3.py", line 1854, in step if self._overflow_check_and_loss_scale_update(): File "/bert_ort/pengwa/deepspeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/bert_ort/pengwa/deepspeed/deepspeed/runtime/zero/stage3.py", line 1788, in _overflow_check_and_loss_scale_update self._update_scale(self.overflow) File "/bert_ort/pengwa/deepspeed/deepspeed/runtime/zero/stage3.py", line 2132, in _update_scale self.loss_scaler.update_scale(has_overflow) File "/bert_ort/pengwa/deepspeed/deepspeed/runtime/fp16/loss_scaler.py", line 175, in update_scale raise Exception( Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run. 2%|████ | 17/1000 [00:06<06:07, 2.67it/s] [2023-09-25 08:30:51,075] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1065120) of binary: /bert_ort/pengwa/py38/bin/python Traceback (most recent call last): File "/bert_ort/pengwa/py38/bin/torchrun", line 8, in <module> sys.exit(main()) File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/torch/distributed/run.py", line 806, in main run(args) File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ examples/onnxruntime/training/language-modeling/run_clm.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-09-25_08:30:51 host : orttrainingdev10.internal.cloudapp.net rank : 0 (local_rank: 0) exitcode : 1 (pid: 1065120) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ (/bert_ort/pengwa/py38) [email protected]@orttrainingdev10:/bert_ort/pengwa/optim ``` ## The Fix For those output that are reusing input, but ORT is not aware of, we detected on the fly (the first iteration, by checking the output tensor addresses with input tensor addresses) , then do implicit copy before set it as PythonOp's output tensors. With this fix: (left: PyTorch, right: ORT) ![image](https://github.com/microsoft/onnxruntime/assets/10530022/0d72f431-2abd-4e52-af99-19974b85edde)
kleiti · Mar 22, 2024 · b36efed · b36efed
1 parent 26533f7
commit b36efed
Show file tree

Hide file tree

Showing 11 changed files with 657 additions and 155 deletions.
diff --git a/orttraining/orttraining/core/framework/torch/torch_proxy.cc b/orttraining/orttraining/core/framework/torch/torch_proxy.cc
@@ -10,9 +10,7 @@
 #include "orttraining/core/framework/torch/gil.h"
 #include "core/platform/env.h"
 
-namespace onnxruntime {
-namespace language_interop_ops {
-namespace torch {
+namespace onnxruntime::language_interop_ops::torch {
 
 void PythonObjectDeleter(PyObject* ptr) { Py_XDECREF(ptr); };
 
@@ -130,6 +128,18 @@ PyObject* CreateRequiresGradFlags(
   return flags;
 }
 
+PyObject* CreateInplaceMap(
+    const std::vector<int64_t>& inplace_map) {
+  PyObject* inplace_map_obj = Ort_PyList_New(inplace_map.size(), "inplace_map");
+
+  for (size_t output_index = 0; output_index < inplace_map.size(); ++output_index) {
+    PyObject* input_index = PyLong_FromLong(inplace_map[output_index]);
+    Ort_PyList_SetItem_NoIncref(inplace_map_obj, output_index, input_index, std::to_string(__LINE__));
+  }
+
+  return inplace_map_obj;
+}
+
 void InvokeRunner(
     PyObject* callback_runner,
     PyObject* args,
@@ -197,14 +207,15 @@ PythonObjectPtr CreatePythonCallArguments(
     const std::vector<void*>& obj_args,
     const std::vector<int64_t>& obj_indices,
     const bool is_training_mode,
-    const bool is_inplace,
-    const std::string& invoke_id) {
+    const std::vector<int64_t>& inplace_map,
+    const std::string& invoke_id,
+    const std::string& func_name) {
   ORT_ENFORCE(PyCallable_Check(callback), "Forward callback is not callable.");
   // The number of variables before those of
   // autograd.Function.apply and autograd.Function.backward.
   // The extra variables are used to configure the launch
   // forward and backward runners.
-  constexpr int64_t num_control_args = 6;
+  constexpr int64_t num_control_args = 7;
 
   // All arguments created for Python call will be destroyed along with PythonObjectPtr.
   PythonObjectPtr args(Ort_PyTuple_New(num_control_args + len, "forward_arguments_tuple"), PythonObjectDeleter);
@@ -216,11 +227,16 @@ PythonObjectPtr CreatePythonCallArguments(
   Ort_PyTuple_SetItem_NoIncref(args.get(), 2, tensor_flags, "tensor_flags");
   PyObject* is_training_mode_arg = is_training_mode ? Py_True : Py_False;
   Ort_PyTuple_SetItem_Incref(args.get(), 3, is_training_mode_arg, "is_training_mode");
-  PyObject* is_inplace_arg = is_inplace ? Py_True : Py_False;
-  Ort_PyTuple_SetItem_Incref(args.get(), 4, is_inplace_arg, "is_inplace_mode");
+
+  PyObject* inplace_map_arg = CreateInplaceMap(inplace_map);
+  Ort_PyTuple_SetItem_NoIncref(args.get(), 4, inplace_map_arg, "inplace_map");
+
   PyObject* kernel_invoke_id_arg = PyBytes_FromStringAndSize(invoke_id.c_str(), invoke_id.size());
   Ort_PyTuple_SetItem_NoIncref(args.get(), 5, kernel_invoke_id_arg, "kernel_invoke_id_arg");
 
+  PyObject* func_name_arg = PyBytes_FromStringAndSize(func_name.c_str(), func_name.size());
+  Ort_PyTuple_SetItem_NoIncref(args.get(), 6, func_name_arg, "func_name_arg");
+
   // Tensor inputs to call autograd.Function.apply or autograd.Function.backward.
   for (size_t i = 0; i < tensor_args.size(); ++i) {
     if (!tensor_args[i].has_value()) {
@@ -246,18 +262,19 @@ PythonObjectPtr CreatePythonCallArguments(
 }
 
 void Invoke(
+    const std::string& func_name,
     PyObject* runner,
     PyObject* callback,
     const std::vector<int64_t>& requires_grads,
     const std::vector<std::optional<OrtValue>>& tensor_args,
     const std::vector<int64_t>& tensor_indices,
     const std::vector<void*>& obj_args,
     const std::vector<int64_t>& obj_indices,
-    void** diff_ctx,
-    std::vector<OrtValue>& returned_ortvalues,
     const bool is_training_mode,
-    const bool is_inplace,
-    const std::string& invoke_id) {
+    const std::vector<int64_t>& inplace_map,
+    const std::string& invoke_id,
+    void** diff_ctx,
+    std::vector<OrtValue>& returned_ortvalues) {
   const auto len = tensor_args.size() + obj_args.size();
   CheckArguments(len, requires_grads, tensor_args, tensor_indices, obj_args, obj_indices);
   RefCountTracker::GetInstance().Reset();
@@ -271,8 +288,9 @@ void Invoke(
         obj_args,
         obj_indices,
         is_training_mode,
-        is_inplace,
-        invoke_id);
+        inplace_map,
+        invoke_id,
+        func_name);
 
     RefCountTracker::GetInstance().DumpDetails("Before Invoke Python Call");
     InvokeRunner(runner, args.get(), is_training_mode, diff_ctx, returned_ortvalues);
@@ -282,17 +300,18 @@ void Invoke(
 }
 
 void TorchProxy::Forward(
+    const std::string& func_name,
     void* callback,
     const std::vector<int64_t>& requires_grads,
     const std::vector<std::optional<OrtValue>>& tensor_args,
     const std::vector<int64_t>& tensor_indices,
     const std::vector<void*>& obj_args,
     const std::vector<int64_t>& obj_indices,
-    void** diff_ctx,
-    std::vector<OrtValue>& returned_ortvalues,
     const bool is_training_mode,
-    const bool is_inplace,
-    const std::string& invoke_id) {
+    const std::vector<int64_t>& inplace_map,
+    const std::string& invoke_id,
+    void** diff_ctx,
+    std::vector<OrtValue>& returned_ortvalues) {
   // Semantically, this lock uniquely takes the ownership of TorchProxy
   // so that there will be only one of TorchProxy::Forward TorchProxy::Backward
   // can be run at one time.
@@ -301,29 +320,31 @@ void TorchProxy::Forward(
   GilGuard guard;
   auto runner = OrtTorchFunctionPool::GetInstance().GetForwardRunner();
   Invoke(
+      func_name,
       runner,
       reinterpret_cast<PyObject*>(callback),
       requires_grads,
       tensor_args,
       tensor_indices,
       obj_args,
       obj_indices,
-      diff_ctx,
-      returned_ortvalues,
       is_training_mode,
-      is_inplace,
-      invoke_id);
+      inplace_map,
+      invoke_id,
+      diff_ctx,
+      returned_ortvalues);
 }
 
 void TorchProxy::Backward(
+    const std::string& func_name,
     void* callback,
     const std::vector<std::optional<OrtValue>>& tensor_args,
     const std::vector<int64_t>& tensor_indices,
     const std::vector<void*>& obj_args,
     const std::vector<int64_t>& obj_indices,
-    std::vector<OrtValue>& returned_ortvalues,
-    const bool is_inplace,
-    const std::string& invoke_id) {
+    const std::vector<int64_t>& inplace_map,
+    const std::string& invoke_id,
+    std::vector<OrtValue>& returned_ortvalues) {
   // Semantically, this lock uniquely takes the ownership of TorchProxy
   // so that there will be only one of TorchProxy::Forward TorchProxy::Backward
   // can be run at one time.
@@ -336,19 +357,19 @@ void TorchProxy::Backward(
   const auto all_input_count = tensor_args.size() + obj_args.size();
   const std::vector<int64_t> requires_grads(all_input_count, 0);
   Invoke(
+      func_name,
       runner,
       reinterpret_cast<PyObject*>(callback),
       requires_grads,
       tensor_args,
       tensor_indices,
       obj_args,
       obj_indices,
-      nullptr /* context to store */,
-      returned_ortvalues,
       true /* is_training_mode */,
-      is_inplace,
-      invoke_id);
+      inplace_map,
+      invoke_id,
+      nullptr /* context to store */,
+      returned_ortvalues);
 }
-}  // namespace torch
-}  // namespace language_interop_ops
-}  // namespace onnxruntime
+
+}  // namespace onnxruntime::language_interop_ops::torch
diff --git a/orttraining/orttraining/core/framework/torch/torch_proxy.h b/orttraining/orttraining/core/framework/torch/torch_proxy.h
@@ -37,27 +37,29 @@ class TorchProxy {
   };
 
   void Forward(
+      const std::string& func_name,
       void* callback,
       const std::vector<int64_t>& requires_grads,
       const std::vector<std::optional<OrtValue>>& tensor_args,
       const std::vector<int64_t>& tensor_indices,
       const std::vector<void*>& obj_args,
       const std::vector<int64_t>& obj_indices,
-      void** diff_ctx,
-      std::vector<OrtValue>& returned_ortvalues,
       const bool is_training_mode,
-      const bool is_inplace,
-      const std::string& invoke_id);
+      const std::vector<int64_t>& inplace_map,
+      const std::string& invoke_id,
+      void** diff_ctx,
+      std::vector<OrtValue>& returned_ortvalues);
 
   void Backward(
+      const std::string& func_name,
       void* callback,
       const std::vector<std::optional<OrtValue>>& tensor_args,
       const std::vector<int64_t>& tensor_indices,
       const std::vector<void*>& obj_args,
       const std::vector<int64_t>& obj_indices,
-      std::vector<OrtValue>& return_args,
-      const bool is_inplace,
-      const std::string& invoke_id);
+      const std::vector<int64_t>& inplace_map,
+      const std::string& invoke_id,
+      std::vector<OrtValue>& return_args);
 
  private:
   TorchProxy(){};

diff --git a/orttraining/orttraining/core/graph/gradient_builder.cc b/orttraining/orttraining/core/graph/gradient_builder.cc
@@ -1765,7 +1765,6 @@ IMPLEMENT_GRADIENT_BUILDER(GetPythonOpGradient) {
   ORT_ENFORCE(utils::HasString(src_attrs.at("func_name")));
   attrs.push_back(MakeAttribute("func_name", src_attrs.at("func_name").s()));
   attrs.push_back(MakeAttribute("output_convention", src_attrs.at("input_convention").s()));
-  attrs.push_back(MakeAttribute("inplace", src_attrs.at("inplace").i()));
 
   // input_tensor_types[i] store the type of autograd.Function.apply's ith output.
   // Note that PythonOpGrad's 0-th input is the Python context generated by PythonOp.

diff --git a/orttraining/orttraining/core/graph/training_op_defs.cc b/orttraining/orttraining/core/graph/training_op_defs.cc
@@ -3908,10 +3908,16 @@ Return true if all elements are true and false otherwise.
           AttributeProto::INTS)
       // Other attributes.
       .Attr(
-          "inplace",
-          "Indicate if the output should reuse input memory.",
-          AttributeProto::INT,
-          static_cast<int64_t>(0))
+          "tensor_reuse_map",
+          "A int array indicating whether output at each index is reusing specific input or not."
+          "If the given index is -1, it means the output is not reusing any input."
+          "For example, there are 2 tensor inputs and 3 tensor outputs (including ctx), "
+          "tensor_reuse_map = [-1, 1, 0] means"
+          "- the output 0 (ctx) don't reuse any input buffer."
+          "- the output 1 reuses the input 1."
+          "- the output 2 reuses the input 0.",
+          AttributeProto::INTS,
+          false)
       .Attr(
           "training_mode",
           "Indicate if the model is exported in training_mode, by default, False.",
@@ -4033,11 +4039,6 @@ Return true if all elements are true and false otherwise.
           "func_name",
           "Name of custom class.",
           AttributeProto::STRING)
-      .Attr(
-          "inplace",
-          "Indicate if the output should reuse input memory. Todo(pengwa): do we need it?",
-          AttributeProto::INT,
-          static_cast<int64_t>(0))
       .Attr(
           "input_tensor_types",
           "Input types of autograd.Function.backward (including only tensor inputs)."
@@ -4069,6 +4070,16 @@ Return true if all elements are true and false otherwise.
           "A string inidicating autograd.Function.backward outputs's type."
           "value 'c' - non-tensor output; value 'd' - tensor output.",
           AttributeProto::STRING)
+      .Attr(
+          "tensor_reuse_map",
+          "A int array indicating whether output at each index is reusing specific input or not."
+          "If the given index is -1, it means the output is not reusing any input."
+          "For example, there are 3 inputs (including ctx) and 2 outputs, tensor_reuse_map = [2, 1] means"
+          "- the output 0 reuses the input 2."
+          "- the output 1 reuses the input 1."
+          "Be noted: the input 0 is ctx.",
+          AttributeProto::INTS,
+          false)
       .Attr(
           "comment",
           "comment only for debugging purposes.",

diff --git a/orttraining/orttraining/python/training/ortmodule/_custom_autograd_function_exporter.py b/orttraining/orttraining/python/training/ortmodule/_custom_autograd_function_exporter.py
@@ -118,7 +118,6 @@ def _export_pt_1_10(g, n, *args, **kwargs):
                 "wrap exportable sub-nn.Module's as ORTModule."
             )
 
-        inplace = kwargs["inplace"]
         # TODO move to public API once the exporter team exposes that
         training_mode = None
         if get_runtime_pytorch_version() >= version.parse("1.12"):
@@ -260,7 +259,6 @@ def _export_pt_1_10(g, n, *args, **kwargs):
 
         attrs = {
             "func_name_s": func_full_qual_name,
-            "inplace_i": inplace,
             "input_convention_s": cconv,
             "outputs": n.outputsSize(),
             "input_tensor_types_i": input_tensor_types,