[Training] qat #18534

xll426 · 2023-11-21T10:24:33Z

Describe the issue

RuntimeError: /onnxruntime_src/orttraining/orttraining/core/optimizer/qdq_fusion.cc:25 int onnxruntime::{anonymous}::ReplaceOrCreateZeroPointInitializer(onnxruntime::Graph&, onnxruntime::Node&) zero_point_tensor_int != nullptr was false. Expected: zero point initializer with name input-0_zero_point to be present in the graph. Actual: not found.

To reproduce

_ = mnist_with_loss(*[output.name for output in onnx_model.graph.output])

Urgency

No response

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.16.3

PyTorch Version

1.13.1

Execution Provider

CUDA

Execution Provider Library Version

cuda11.6

xadupre · 2023-11-21T14:12:13Z

Is it possible to know more about how you got this error? I assumed you call function create_training_artifacts. Was it a quantized model?

xll426 · 2023-11-22T03:48:53Z

When I run the qat.py script directly, it reports this error.

def create_training_artifacts(model_path, artifacts_dir, model_prefix):
"""Using onnxblock, this function creates the training artifacts for the model at the path provided.

The artifacts created can be used to train the model using onnxruntime.training.api. The artifacts are:
1. The training graph
2. The eval graph
3. The optimizer graph
4. The checkpoint file
"""

class MNISTWithLoss(onnxblock.TrainingBlock):
    def __init__(self):
        super().__init__()
        self.loss = onnxblock.loss.CrossEntropyLoss()

    def build(self, output_name):
        return self.loss(output_name)

mnist_with_loss = MNISTWithLoss()
print(model_path)
onnx_model, eval_model, optimizer_model = onnx.load(model_path), None, None

# Build the training and eval graphs
logging.info("Using onnxblock to create the training artifacts.")


# def traverse_nodes(graph):
#     for node in graph.node:
#         print(node.name)
#         if node.name=="input-0_zero_point":
#             print("Node Name:", node.name)
#             print("Op Type:", node.op_type)
#             print("Input(s):", [input for input in node.input])
#             print("Output(s):", [output for output in node.output])
#             print("Attributes:")
#             for attribute in node.attribute:
#                 print(f"  {attribute.name}: {attribute}")
   





# with onnxblock.onnx_model(onnx_model) as model_accessor:
with onnxblock.base(onnx_model):
 
    # main_graph = onnx_model.graph

    # 遍历所有节
    # traverse_nodes(main_graph)



    _ = mnist_with_loss(*[output.name for output in onnx_model.graph.output])
    # eval_model = model_accessor.eval_model
    training_model, eval_model = mnist_with_loss.to_model_proto()

# Build the optimizer graph
optimizer = onnxblock.optim.AdamW()
# with onnxblock.onnx_model() as accessor:
with onnxblock.empty_base() as accessor:
    _ = optimizer(mnist_with_loss.parameters())
    # optimizer_model = accessor.model
    optimizer_model = optimizer.to_model_proto()

# Create the training artifacts
train_model_path = os.path.join(artifacts_dir, f"{model_prefix}_train.onnx")
logging.info(f"Saving the training model to {train_model_path}.")
onnx.save(onnx_model, train_model_path)
eval_model_path = os.path.join(artifacts_dir, f"{model_prefix}_eval.onnx")
logging.info(f"Saving the eval model to {eval_model_path}.")
onnx.save(eval_model, eval_model_path)
optimizer_model_path = os.path.join(artifacts_dir, f"{model_prefix}_optimizer.onnx")
logging.info(f"Saving the optimizer model to {optimizer_model_path}.")
onnx.save(optimizer_model, optimizer_model_path)
trainable_params, non_trainable_params = mnist_with_loss.parameters()
checkpoint_path = os.path.join(artifacts_dir, f"{model_prefix}_checkpoint.ckpt")
logging.info(f"Saving the checkpoint to {checkpoint_path}.")
onnxblock.save_checkpoint((trainable_params, non_trainable_params), checkpoint_path)

return train_model_path, eval_model_path, optimizer_model_path, checkpoint_path

xll426 · 2023-11-22T06:31:17Z

xll426 · 2023-11-22T06:34:35Z

it is a quantized model. We use this model for Quantization-Aware Training (QAT) https://github.com/microsoft/onnxruntime/tree/v1.16.2/orttraining/orttraining/test/python/qat_poc_example

baijumeswani · 2024-01-26T18:06:58Z

@xll426 QAT in ORT is currently in experimental phase. It is known that the feature is not complete yet. I will find some time to complete to fix the POC. Sorry about your experience.

baijumeswani · 2024-01-26T23:40:01Z

https://github.com/microsoft/onnxruntime/pull/19290nshould fix this. Sorry for the late response and fix.

xll426 added the training issues related to ONNX Runtime training; typically submitted using template label Nov 21, 2023

github-actions bot added the ep:CUDA issues related to the CUDA execution provider label Nov 21, 2023

baijumeswani mentioned this issue Jan 26, 2024

Bring QAT POC back to a functional state #19290

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Training] qat #18534

[Training] qat #18534

xll426 commented Nov 21, 2023

xadupre commented Nov 21, 2023

xll426 commented Nov 22, 2023

xll426 commented Nov 22, 2023

xll426 commented Nov 22, 2023

baijumeswani commented Jan 26, 2024

baijumeswani commented Jan 26, 2024

[Training] qat #18534

[Training] qat #18534

Comments

xll426 commented Nov 21, 2023

Describe the issue

To reproduce

Urgency

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

PyTorch Version

Execution Provider

Execution Provider Library Version

xadupre commented Nov 21, 2023

xll426 commented Nov 22, 2023

xll426 commented Nov 22, 2023

xll426 commented Nov 22, 2023

baijumeswani commented Jan 26, 2024

baijumeswani commented Jan 26, 2024