Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support LoRA and Control Net in Stable Diffusion demo #18593

Merged
merged 5 commits into from
Nov 28, 2023

Conversation

tianleiwu
Copy link
Contributor

@tianleiwu tianleiwu commented Nov 28, 2023

Description

(1) Export onnx model with LoRA weights for both SD 1.5 and SDXL
(2) Export onnx model with Control Net for both SD 1.5 and SDXL. For SD 1.5, it is allowed to use multiple control nets. For SDXL, at most one control net is supported right now.
(3) Add demo of LCM LoRA
(3) Add demo of control net.

Motivation and Context

@tianleiwu tianleiwu merged commit f13380f into main Nov 28, 2023
89 of 92 checks passed
@tianleiwu tianleiwu deleted the tlwu/sdxl_lora_controlnet branch November 28, 2023 23:46
wejoncy added a commit that referenced this pull request Dec 5, 2023
commit e066fca7770987c9c2c91babca9d74e95291e39f
Author: Adrian Lizarraga <[email protected]>
Date:   Mon Dec 4 17:54:58 2023 -0800

    [Quantization] Tensor quant overrides and QNN EP quantization configuration (#18465)
    Allows specifying a dictionary of tensor-level quantization overrides:
    ```
    TensorQuantOverrides = dictionary :
        Default is {}. Set tensor quantization overrides. The key is a tensor name and the value is a
        list of dictionaries. For per-tensor quantization, the list contains a single dictionary. For
        per-channel quantization, the list contains a dictionary for each channel in the tensor.
        Each dictionary contains optional overrides with the following keys and values.
              'quant_type' = QuantType : The tensor's quantization data type.
              'scale' =  Float         : The scale value to use. Must also specify `zero_point` if set.
              'zero_point' = Int       : The zero-point value to use. Must also specify `scale` is set.
              'symmetric' = Bool       : If the tensor should use symmetric quantization. Invalid if also
                                         set `scale` or `zero_point`.
              'reduce_range' = Bool    : If the quantization range should be reduced. Invalid if also
                                         set `scale` or `zero_point`.
              'rmax' = Float           : Override the maximum real tensor value in calibration data.
                                         Invalid if also set `scale` or `zero_point`.
              'rmin' = Float           : Override the minimum real tensor value in calibration data.
                                         Invalid if also set `scale` or `zero_point`.
    ```

    - All of the options are optional.
    - Some combinations are invalid.
    - Ex: `rmax` and `rmin` are unnecessary if the `zero_point` and `scale`
    are also specified.

    Example for per-tensor quantization overrides:
    ```Python3
    extra_options = {
        "TensorQuantOverrides": {
            "SIG_OUT": [{"scale": 1.0, "zero_point": 127}],
            "WGT": [{"quant_type": quantization.QuantType.QInt8, "symmetric": True, "reduce_range": True}],
            "BIAS": [{"quant_type": quantization.QuantType.QInt8, "symmetric": True, "reduce_range": True}],
        },
    }
    ```

    Example for per-channel quantization overrides (Conv weight and bias):
    ```Python3
    extra_options = {
        "TensorQuantOverrides": {
            "WGT": [
                {
                    "quant_type": quantization.QuantType.QUInt8,
                    "rmin": 0.0,
                    "rmax": 2.5,
                    "reduce_range": True,
                },
                {
                    "quant_type": quantization.QuantType.QUInt8,
                    "rmin": 0.2,
                    "rmax": 2.55,
                    "reduce_range": False,
                },
            ],
            "BIAS": [
                {"zero_point": 0, "scale": 0.000621},
                {"zero_point": 0, "scale": 0.23},
            ],
        },
    }
    ```
    Added a `quantization.execution_providers.qnn.get_qnn_qdq_config` method
    that inspects the model and returns suitable quantization
    configurations.

    Example usage:
    ```python3
    from quantization import quantize, QuantType
    from quantization.execution_providers.qnn import get_qnn_qdq_config

    qnn_config = get_qnn_qdq_config(input_model_path,
                                    data_reader,
                                    activation_type=QuantType.QUInt16,
                                    weight_type=QuantType.QUInt8)

    quantize(input_model_path,
             output_model_path,
             qnn_config)
    ```
    Make it possible to create more QDQ models that run on QNN EP.

    ---------

    Signed-off-by: adrianlizarraga <[email protected]>

commit 01b5c789177c2b062d4c4f9b6abdce12be9b3b64
Author: Tianlei Wu <[email protected]>
Date:   Mon Dec 4 16:03:47 2023 -0800

    Add SD-Turbo and refine diffusion demo (#18694)

    [SD-Turbo](https://huggingface.co/stabilityai/sd-turbo) is a fast
    generative text-to-image model that distilled from [Stable Diffusion
    2.1](https://huggingface.co/stabilityai/stable-diffusion-2-1). It is
    targeted for 512x512 resolution.

    1. Support sd-turbo model.
    1. Refiner ControlNet in demo
        +  Cache the ControlNet model so that it is downloaded only once.
    + Do not download default images in script. Instead update document to
    use wget to download example image.
    + Fix an issue of control image processing that causes shape mismatch in
    inference.
    1. Refine arguments:
    + Change argument --disable-refiner to --enable-refiner since refiner is
    not used in most cases
       + Rename --refiner-steps to --refiner_denoising_steps
       + Add abbreviations for most used arguments.
       + Add logic to set default arguments for different models.
    1. Refine torch model cache:
    + Share cached torch model among different engines to save disk space.
    + Only download fp16 model (previously, ORT_CUDA downloads fp32 model).
    1. Do not use vae slicing when image size is small.
    1. For LCM scheduler, allow guidance scale 1.0~2.0.
    2. Allow sdxl-turbo to use refiner

    Average latency in ms for SD-Turbo (FP16, EulerA, 512x512) on
    A100-SXM4-80GB.

    Batch | Steps | TRT 8.6 static | ORT_TRT static | ORT_CUDA static | TRT
    8.6 dynamic | ORT_TRT dynamic | ORT_CUDA dynamic
    -- | -- | -- | -- | -- | -- | -- | --
    1 | 1 | 32.07 | 30.55 | 32.89 | 36.41 | 38.30 | 34.83
    4 | 1 | 125.36 | 97.40 | 97.49 | 118.24 | 114.95 | 99.10
    1 | 4 | 62.29 | 60.24 | 62.50 | 72.49 | 77.82 | 67.66
    4 | 4 | 203.51 | 173.11 | 168.32 | 217.14 | 215.71 | 172.53

    * Dynamic engine is built for batch size 1 to 8, image size 512x512 to
    768x768, optimized for batch size 1 and 512x512

commit d514a960eefc19fb69d54497b6b582cfdf6e85f1
Author: Edward Chen <[email protected]>
Date:   Mon Dec 4 13:38:36 2023 -0800

    Remove "Python Checks" pipeline status from readme as that pipeline no longer exists. (#18697)

commit c02a3861451a29d7a517dd4aaa82c239d2f34d2d
Author: Caroline Zhu <[email protected]>
Date:   Mon Dec 4 13:37:14 2023 -0800

    [js/web/training] Implemented runEvalStep & runOptimizerStep (#18259)
    * implemented runEvalStep and runOptimizerStep
    * added hasEvalModel and hasOptimizerModel boolean fields in
    TrainingSession representation
    * added evalInputNames and evalOutputNames fields to
    TrainingSessionHandler & TrainingSession
    * removed the inputNamesEncoded and outputNamesEncoded fields from
    TrainingSessionHandler -- since none of the training methods require the
    input names and output names as parameters, there's no need to store
    them.
    * part of the work for implementing web bindings for training
    * previous PR: #18250

    ---------

    Co-authored-by: Ashwini Khade <[email protected]>

commit 5353adcde37a118bdd25882482fd584c5ed3f343
Author: Jiajia Qin <[email protected]>
Date:   Tue Dec 5 05:18:37 2023 +0800

    [js/webgpu] Use the naive convTranspose when in/out channels are both 1 (#18658)
    With this change, convTranspose with input0 [1, 18, 32, 1], input1 [1,
    1, 16, 16] becomes 0.59ms from 6.64ms.

commit a5b2291e0fe7c7d42f30154ccb20d6cde1380c3c
Author: trajep <[email protected]>
Date:   Tue Dec 5 04:26:50 2023 +0800

    [Transformer Optimization]Return model directly for unknown model type (#18642)

    This pull request is used to improves the handling of unsupported model
    types in the optimization process.

commit 2f8b86b93906d0dd0549aca22798c660aa10db91
Author: Deoksang Kim <[email protected]>
Date:   Sat Dec 2 09:48:55 2023 +0900

    Fix typo in the TensorShape (#17813)

    The function name in the log should be SizeToDimension

commit 92ee664f64e96a8cc7308302a3e4f67f95254d1f
Author: Jiajia Qin <[email protected]>
Date:   Sat Dec 2 07:35:35 2023 +0800

    [js/webgpu] Fix shader errors in indicesGet/Set when rank > 4 (#18661)
    Currently, for non-uniform variables, we still use `array<u32, N>` type
    instead of array<vec4<u32>, N1>`. So we can't always treat all variables
    with rank > 4 as uniforms to index.

    This PR fixes below errors:
    ```
    error(s) generated while compiling the shader:
    :5:44 error: index 4 out of bounds [0..1]
                 return uniforms.input_strides[4] * (outputIndices[4] % uniforms.input_shape[4])+uniforms.input_strides[3] * (outputIndices[3] % uniforms.input_shape[3])+uniforms.input_strides[2] * (outputIndices[2] % uniforms.input_shape[2])+uniforms.input_strides[1] * (outputIndices[1] % uniforms.input_shape[1])+uniforms.input_strides[0] * (outputIndices[0] % uniforms.input_shape[0]);
                                               ^
    FAILED #OpTest# - expand.jsonc [webgpu]Expand - Expand 5D - float32 Expand 5 - float32
    FAILED #OpTest# - expand.jsonc [webgpu]Expand - Expand 5D - float32 Expand 5 - shape < input.size()

commit eaaf27015e8d99c5a072caa40e0f4627f14a93e3
Author: Changming Sun <[email protected]>
Date:   Fri Dec 1 15:30:16 2023 -0800

    Remove EnvSetupScript parameter from win-ci.yml (#18662)
    To make the code more consistent. Now some TRT pipelines download TRT
    binaries on-the-fly, while other TRT pipelines use a preinstalled
    version. This PR make them the same.

commit 9c45fe4957ff3d027b5024abb170947db2cb0408
Author: Rachel Guo <[email protected]>
Date:   Fri Dec 1 14:47:46 2023 -0800

    Fix macos xcframework test stage codesign info (#18649)
    <!-- Describe your changes. -->

    Remove developement id and force codesign not required in the test macos
    target.
    <!-- - Why is this change required? What problem does it solve?
    - If it fixes an open issue, please link to the issue here. -->

    Fix failure happened in iOS_Full_xcframwork stage in
    Zip-Nuget-Java-NodeJS packaging pipeline.

    ---------

    Co-authored-by: rachguo <[email protected]>

commit a3538056314c10c1c4d5b769e86426434d486322
Author: Edward Chen <[email protected]>
Date:   Fri Dec 1 13:49:45 2023 -0800

    Fix Windows TVM CI workflow (#18667)

    Fix issue with installing LLVM dependency.

commit b22f49ff35b3c7b3ae339128e21898810e4c2919
Author: Edward Chen <[email protected]>
Date:   Fri Dec 1 09:41:25 2023 -0800

    Fix unit tests failures in build with contrib ops disabled (#18659)

    Fix unit tests failures in build with contrib ops disabled.
    - QDQTransformerTests.QDQPropagation_GH11605_Opset12_19
    - TransposeOptimizerTests.QnnTransposeNonConstBroadcastInput

commit fcea2cb7f184d608efa1e5c72f9e25072e82009d
Author: Bowen Bao <[email protected]>
Date:   Fri Dec 1 09:36:18 2023 -0800

    [Dort] Run type promotion pass to resolve dtype discrepancy (#18516)

    Fixes CI failures mentioned in #18507

    But we should not keep two separate dort impls in both pytorch and
    onnxruntime. They are out of sync.

commit 05a9c957647b3cae0d2ad305950c14bf5f305bc8
Author: snadampal <[email protected]>
Date:   Fri Dec 1 11:16:44 2023 -0600

    [DNNL] add Arm Compute Library (ACL) backend for dnnl execution provider (#15847)

    Add ACL as the DNNL runtime option for aarch64 platforms. Update
    makefile and the python wheel build script.
    <!-- Describe your changes. -->
    Add ACL as the DNNL runtime option for aarch64 platforms. Update
    makefile and the python wheel build script.
    <!-- - Why is this change required? What problem does it solve?
    - If it fixes an open issue, please link to the issue here. -->
    This is to enable the optimized ACL gemm kernels for dnnl execution
    provider on aarch64 platform.

commit d69842226b47e5336568103541b071447caeb9bf
Author: Jian Chen <[email protected]>
Date:   Fri Dec 1 07:57:46 2023 -0800

    Update the template files to correct stage to fix the python cuda 12 packaging pipeline (#18651)

commit 182c525416eb5cbace8df52b6809a77ffc91545d
Author: guyang3532 <[email protected]>
Date:   Fri Dec 1 19:27:50 2023 +0800

    Support MatMulBnb4 in PaddingElimination (#18646)

    Also support Cast pattern between input and embedding node for sparsity
    inspecting

commit ccfea559428b1374d0109bfaacc273ce11f4ef3c
Author: Hector Li <[email protected]>
Date:   Thu Nov 30 21:09:13 2023 -0800

    [QNN EP] Enable QNN HTP VTCM size setting (#18653)
    [QNN EP] Enable QNN HTP VTCM size setting

commit 9c9e6adeb2f31c73cebd7e92622c86f084858f68
Author: Tianlei Wu <[email protected]>
Date:   Thu Nov 30 18:19:31 2023 -0800

    Add SDXL Turbo to demo (#18627)

    * Add SDXL Turbo to the demo.
    * Change default scheduler to EulerA for XL or Turbo since DDIM does not
    work well with small steps.

    Example to run the model in demo (See README for instructions):
    ```
    python3 demo_txt2img_xl.py --version xl-turbo --height 512 --width 512 --denoising-steps 1 --scheduler UniPC "little cute gremlin sitting on a bed, cinematic"
    ```

commit c7732a78d7e815de489fed22cfee610a445b9ca2
Author: Wanming Lin <[email protected]>
Date:   Fri Dec 1 09:47:56 2023 +0800

    [WebNN EP] Fixed bug in op checking (#18638)

commit 73d9b035090a2bd4e56252dee10174d3f01e5f6f
Author: Xu Xing <[email protected]>
Date:   Fri Dec 1 09:10:33 2023 +0800

    [js/webgpu] Add multidimensional(>4) uniform support (#18546)

    This change removes the check of enableShapesUniforms. When all uses of
    this are removed, enableShapesUniforms can be removed too.

commit 73a2eb82eb9364b4dea8df2cd6a46affd008b15c
Author: Wanming Lin <[email protected]>
Date:   Fri Dec 1 08:19:22 2023 +0800

    Fixed bug in Flatten's axis (#18645)

    Flatten's axis is in the range [-r, r] rather than [-r, r-1].

commit 6781b6cf3d4708e32e6bd546afa5b2b785290270
Author: Jiajia Qin <[email protected]>
Date:   Fri Dec 1 07:47:08 2023 +0800

    [js/webgpu] add bool type for Expand/Gather (#18615)
    In [detr-resnet-50](https://huggingface.co/Xenova/detr-resnet-50) model,
    it uses expand with bool type running on cpu ep.

    | Kernel    | Shape | Provider |
    | -------- | ------- | ------- |
    | Expand | "input_type_shape" :
    [{"bool":[1,1,1,625]},{"int64":[4]}],"activation_size" :
    "657","output_type_shape" : [{"bool":[1,1,625,625]}] |
    CPUExecutionProvider |

    After this change, it will run on jsep.
    | Kernel    | Shape | Provider |
    | -------- | ------- | ------- |
    | Expand | "input_type_shape" :
    [{"bool":[1,1,1,625]},{"int64":[4]}],"activation_size" :
    "657","output_type_shape" : [{"bool":[1,1,625,625]}] |
    JsExecutionProvider |

commit efee9abdb72f73163943df80f0e6db1f5c23c42c
Author: Yi Zhang <[email protected]>
Date:   Fri Dec 1 07:44:44 2023 +0800

    Reduce downloads in Nuget-Java pipeline to reduce connection exception (#18635)
    1. Add a new stage to download java tools from https://oss.sonatype.org
    and publish them to pipeline artifact
    2. Remove downloads in other jobs, they get the java tools from pipeline
    artifact
    3. consolidate final_java_testing stages.
    Reduce downloads to reduce the connection error like below.

    ```
    --2023-11-28 07:16:31--  https://oss.sonatype.org/service/local/repositories/releases/content/org/junit/platform/junit-platform-console-standalone/1.6.2/junit-platform-console-standalone-1.6.2.jar
    Resolving oss.sonatype.org (oss.sonatype.org)... 3.227.40.198, 3.229.50.23
    Connecting to oss.sonatype.org (oss.sonatype.org)|3.227.40.198|:443... connected.
    HTTP request sent, awaiting response... 502 Bad Gateway
    2023-11-28 07:16:32 ERROR 502: Bad Gateway.
    ```

commit 4025bd8ebdda49331af45c7632cb5975fedf69c2
Author: zesongw <[email protected]>
Date:   Fri Dec 1 04:59:36 2023 +0800

    [WebNN EP] Fix bug of padding in Op ConvTranspose (#18577)

    Get the dimensions of H and W according to the layout.

commit b1e749e3beb8fe543500f7ba51ddc9754639525d
Author: Jiajia Qin <[email protected]>
Date:   Fri Dec 1 04:57:29 2023 +0800

    [js/webgpu] Add program name into webgpuProfiling info (#18640)
    Currently, we only print the kernelName, which is hard to distinguish
    which shader we actually used. For example, GroupedConv/Conv2DMatMul
    both belong to Conv kernel. It's not intuitive for profiling.

commit c5ea1547c6d1070e6b6296fbf8e6d681107b8c7f
Author: Dmitri Smirnov <[email protected]>
Date:   Thu Nov 30 10:50:24 2023 -0800

    Eliminate intermediate string conversion buffer. (#18608)
      Make use of unsafe string constructor that is able to convert native
      UTF-8 string straight into the string instance buffer.
    Reduce garbage,

commit e7f64f4510483bf0a94ce46478f02ead8d70e0d2
Author: Yulong Wang <[email protected]>
Date:   Thu Nov 30 09:50:47 2023 -0800

    [js/web] fix ESLint by excluding generated .js from tsconfig.json (#18634)
    ESLint will went into error sometimes.

    The root cause is because some large generated JavaScript file in the
    tsconfig's include path will cause TypeScript parser fail in a line of
    `string.match()` with a regex on a huge string (~8MB), causing the
    following error:
    ```
    RangeError: Maximum call stack size exceeded
    ```

    The solution is to remove the large files from the tsconfig's include
    path. Previously I excluded the `web/dist/` folder and this PR excludes
    `web/test/ort.test[.min].js`.

commit 23a91c8ba889d77589d6acf44fa9e9bce5fbb701
Author: Changming Sun <[email protected]>
Date:   Thu Nov 30 08:07:47 2023 -0800

    Fix warning C4003 in ORT python binding code (#18612)
    Fix warning C4003 in ORT python binding code.
    It's better to fix the warning instead of suppressing it.

commit 1b5675ff0fc7b2d9894ef06a7727efe0aad7cbd2
Author: Changming Sun <[email protected]>
Date:   Thu Nov 30 08:07:13 2023 -0800

    Update post-merge-jobs.yml: increase timeout value for the Ios job (#18602)

commit 148495ebc55827c8c521ea41493052ddbc428ab2
Author: Vincent Wang <[email protected]>
Date:   Thu Nov 30 20:17:22 2023 +0800

    [ORTModule] Use Default Topo-order for GraphViewer (#18410)

    ORT's default topo-order is a reversed DFS algorithm, while the
    priority-based topo-order is a forward BFS algorithm. It's likely that
    the default order is better than priority-based order on memory because
    tensor memory is more likely to be released right after it's consumed.

    Currently ORTModule uses priority-based order, for some models, it sorts
    lots of small Ops to the beginning, this introduces big CPU overhead at
    the beginning (see below screenshot), this PR is to use default order
    for training. The priority-based order is heavily used for some
    recompute optimization, so if there is recompute enabled, we will still
    use priority-based order.

    This PR also adds an optimization to the default order, which is to move
    all Shape/Size Ops to right after their parent nodes. This is to make
    sure the shape and size nodes are executed right after their parents so
    it's possible the input tensor memory can be released as soon as
    possible. This is especially important for non-CPU devices or for
    training case where some gradient graphs use only shape/size of tensors
    from forward.

    Profiling result:
    Before
    <img width="910" alt="截屏2023-11-13 12 09 02"
    src="https://github.com/microsoft/onnxruntime/assets/11661208/e54d5ead-274f-4725-923e-521bbcfce752">

    After
    <img width="910" alt="截屏2023-11-13 12 10 44"
    src="https://github.com/microsoft/onnxruntime/assets/11661208/f50d196d-11ac-43a2-9493-517e4552ffab">

commit e1d1033131114dc2634e664d009e061d900a9554
Author: Vincent Wang <[email protected]>
Date:   Thu Nov 30 18:32:36 2023 +0800

    [ORTModule] Remove Unused Arguments from Generated Triton Code (#18636)

    This PR:
    - Remove unused arguments from generated triton code,
    - Remove unnecessary mask for symbolic shape case from generated triton
    code.
    - Add doc for usage of ORTMODULE_TRITON_CONFIG_FILE.

commit 5c67a00d8e9ba3604593b6fe25a1e3da0c8ef65b
Author: George Wu <[email protected]>
Date:   Wed Nov 29 22:27:51 2023 -0800

    Revert "remove full protobuf requirement for tensorrt ep" (#18626)

    Reverts microsoft/onnxruntime#18413

    there's a timing issue here. we eventually want to get this change
    merged in but we need to update OSS onnx-tensorrt first.

commit c20488ced70488c9e95b6c11fdea309efe2fdc99
Author: Jambay Kinley <[email protected]>
Date:   Wed Nov 29 18:27:04 2023 -0800

    skip_infer for SkipGroupNorm in SymbolicShapeInference (#18630)
    <!-- Describe your changes. -->
    https://github.com/microsoft/onnxruntime/pull/18273 added
    `SkipGroupNorm` contrib op but it did not skip onnx shape inference for
    this op in `SymbolicShapeInference`.

    This leads to failed shape inference of the transformers optimized model
    with `enable_skip_group_norm=True`. Also results in an invalid float16
    model for the SD CUDA example.

    This PR adds `SkipGroupNorm` to `skip_infer` so that it skips onnx shape
    inference for this op and instead uses the relevant dispatcher.
    <!-- - Why is this change required? What problem does it solve?
    - If it fixes an open issue, please link to the issue here. -->
    Fix shape inference failure for models with `SkipGroupNorm` nodes.

commit 227dcb3a88eb8c36bfc5c0341156ce96291597ac
Author: Yang Gu <[email protected]>
Date:   Thu Nov 30 10:01:12 2023 +0800

    [js/webgpu] Log the key and program info for artifact (#18365)

    With uniform support, ideally we may just keep one artifact for each
    program to save the compilation time. This PR just logs the related
    info, including key and program name, so that we may understand better
    the situation.

commit 7335760424b052ff041285571cf52b77f9ebb009
Author: satyajandhyala <[email protected]>
Date:   Wed Nov 29 15:30:33 2023 -0800

    [JS/Web] Add uniforms to Einsum (#18531)
    Add uinforms to Einsum
    <!-- - Why is this change required? What problem does it solve?
    - If it fixes an open issue, please link to the issue here. -->
    Improve performance.

commit 483c490ec4db2d2b5001e42f5c842abfc9e379af
Author: Edward Chen <[email protected]>
Date:   Wed Nov 29 14:38:44 2023 -0800

    Refine error checks in onnxruntime/core/providers/coreml/model/model.mm. (#18620)

commit d2dfbf41795e72911643e2ffcadac069b72580bd
Author: Dmitri Smirnov <[email protected]>
Date:   Wed Nov 29 10:44:59 2023 -0800

    Add float16 type support to SplitToSequence and make code type independent (#18594)
    Add support for `float16` type to address the below issue.
    Re-work the code to make it type independent.
    This reduces binary size by ~11 K.

    ![image](https://github.com/microsoft/onnxruntime/assets/11303988/1a77c7bc-34a8-478c-a16a-abd94062c6c6)
    This PR addresses https://github.com/microsoft/onnxruntime/issues/18481

commit 68209307daadfe21a74a36d44c4c170b91141772
Author: Yi Zhang <[email protected]>
Date:   Thu Nov 30 02:32:42 2023 +0800

    Replace all Azure-Pipelines-EO-Windows2022-aiinfrat to Onnxruntime-Win-CPU-2022 (#18614)
    Replace all Azure-Pipelines-EO-Windows2022-aiinfrat to
    Onnxruntime-Win-CPU-2022
    Reduce the maintenance cost

commit 38b640c797613e2396f2975ccd4d8ff0e95a5baa
Author: Wanming Lin <[email protected]>
Date:   Thu Nov 30 00:00:23 2023 +0800

    [WebNN EP] Re-implement Unsqueeze, Squeeze, Flatten with WebNN's reshape (#18585)

    WebNN will not provide `unsqueeze`, `squeeze`, `flatten2d` ops, as it
    can be easily implemented by reshape.

commit 14a343441dcd530bec24e18e34c3c068993eb06c
Author: Edward Chen <[email protected]>
Date:   Tue Nov 28 17:14:20 2023 -0800

    Fix Objective-C static analysis build (#18606)

    - Patch abseil to fix a compile error about not finding `cxxabi.h`.
    - Fix some static analysis warnings.

commit e833d22f143f86529f4863b5da6cac4eb4a78bbb
Author: ivberg <[email protected]>
Date:   Tue Nov 28 16:58:51 2023 -0800

    Change QNN EP Profiling logs to output to CSV (#18201)
    Change QNN EP Profiling logs to output to CSV. Output is in a similar
    format to QNN SDK Tools (instead of to ORT logs)

    https://onnxruntime.ai/docs/execution-providers/QNN-ExecutionProvider.html#configuration-options
    (profiling_level)
    It is hard to read and interpret QNN profiling logs in the ORT logs.

    ---------

    Co-authored-by: Hector Li <[email protected]>

commit f13380f3d8d25df797be60b4899b43504a5576b5
Author: Tianlei Wu <[email protected]>
Date:   Tue Nov 28 15:46:42 2023 -0800

    Support LoRA and Control Net in Stable Diffusion demo (#18593)
    (1) Export onnx model with LoRA weights for both SD 1.5 and SDXL
    (2) Export onnx model with Control Net for both SD 1.5 and SDXL. For SD
    1.5, it is allowed to use multiple control nets. For SDXL, at most one
    control net is supported right now.
    (3) Add demo of LCM LoRA
    (3) Add demo of control net.

commit 50e6235af111e5113860dfd7a0ece55dc00316a0
Author: Yulong Wang <[email protected]>
Date:   Tue Nov 28 15:15:59 2023 -0800

    [js/web] allow ShaderHelper to use internal (non-I/O) variables (#18525)
    This PR includes a change that inspired from #18452 to resolve a
    requirement: a shader may depend on an instance of `IndicesHelper` to
    generate WGSL code snippet, but the IndicesHelper instance is not
    necessarily an input/output of the program. So the existing
    `declareVariables()` function does not work with this scenario.

    In order to support this requirement, I added this "use" function to
    `interface ShaderHelper`, which takes a helper-like object as parameter.
    The hidden implementation `ShaderHelperImpl` class will iterate the
    helpers and call `impl()` for each.

    @axinging @qjia7

commit a49f31b6705bdd8a9b9cd7b7b4a9bbc0ebba07a2
Author: Jian Chen <[email protected]>
Date:   Tue Nov 28 13:23:01 2023 -0800

    Remove drop-nuget artifact from all pipelines (#18592)
    Currently, the `drop-nuget` artifact only contains protoc.exe which is
    also part of the `drop-extra` artifact.
    <!-- - Why is this change required? What problem does it solve?
    - If it fixes an open issue, please link to the issue here. -->

commit e24733cfe9b3e0d40419942f2d6337925c351606
Author: Mike Guo <[email protected]>
Date:   Wed Nov 29 03:42:39 2023 +0800

    fix the Olive CI pipeline failure on Windows (#18464)

    Fix the https://aiinfra.visualstudio.com/Lotus/_build?definitionId=1046
    failure for Windows

commit 288b80d363bc120c8d3c0ca3c2fe4252e16f4c56
Author: Rachel Guo <[email protected]>
Date:   Tue Nov 28 10:11:53 2023 -0800

    Add MacOS build to ORT C Pod (#18550)
    <!-- Describe your changes. -->

    As title.

    1. Add macos build as an optionally enabled arch for pod and changes to
    exsiting build_ios_framework/assemble_c_pod scripts.
    2. Enable macos build arch in ios packaging pipeline (currently for
    variants other than Mobile) and check the output artifacts are correct.
    3. Write MacOS Test Target scheme in the test app and integrate into ios
    packaging CI testing pipeline.
    Currently the changes only apply to onnxruntime-c pod. as the original
    request was from ORT SPM which consumes the onnxruntime-c pod only as
    the binary target. TODO: could look into adding macos platform to objc
    pod as well.
    <!-- - Why is this change required? What problem does it solve?
    - If it fixes an open issue, please link to the issue here. -->
    Enable macos platform support in cocoapods. and also potentially produce
    binary target for enabling macos platform in SPM as well.

    Replace https://github.com/microsoft/onnxruntime/pull/18334

    ---------

    Co-authored-by: rachguo <[email protected]>
    Co-authored-by: rachguo <[email protected]>
    Co-authored-by: Edward Chen <[email protected]>

commit 05046e5452f7a1f47bb1f4c01ddfa86eb6fac77f
Author: Chen Fu <[email protected]>
Date:   Tue Nov 28 10:01:09 2023 -0800

    Adding unit test for sm80 prepack (#18514)
    Prepacking code for block q4 x fp16 GEMM cuda kernel, for SM80 hardware
    Preparing for addition of Q4 x FP16 GEMM kernel on Nvidia Ampere GPUs.
    This kernel requires sophisticated quantized weight rearrangement to
    speedup loading data to tensor-core. To facilitate the addition, this
    change includes the following:

    1. matrix_layout.h A new layout lib that facilitate iterating matrix
    elements and tiles that balance memory safety and performance.
    2. prepack_sm80.h Code for rearranging quantized weight, scales and
    offsets (aka. prepacking)
    3. blkq4_fp16_sm80_prepack_test.cc Unit tests that explicitly test the
    memory safety and correctness of the prepacking code.

    Currently the prepacking code runs on CPU with single threaded code. We
    run this on CPU in order to minimize GPU memory fragmentation. On the
    other hand, hopefully we get around to parallelize this part of the
    code. Should be straight forward with the unit tests in place.

commit 8d5ecc4dae0686d032a81c3633fdaf213572a722
Author: Adrian Lizarraga <[email protected]>
Date:   Tue Nov 28 09:46:47 2023 -0800

    [Quantization] Fix scale/zero-point for 16-bit QDQ Softmax (#18589)
    Sets the appropriate scale and zero-point values for 16-bit QDQ Softmax.
    Previously, the scale/zp were set to fixed values that were specific to 8-bit quantization.
    Generate more accurate 16-bit QDQ models that contain Softmax.

commit 0b7048e7d621b271b0ab4748e566f57d11b49be5
Author: Sheil Kumar <[email protected]>
Date:   Tue Nov 28 09:26:48 2023 -0800

    Update winml to use #cores - #soc cores by Default as the number of intraopthreads (#18384)

    Update winml to use #cores - #soc cores by Default as the number of
    intraopthreads

    ---------

    Co-authored-by: Sheil Kumar <[email protected]>

commit a6d872640764ea50ec460f7a717e5b369921f8b4
Author: Yi Zhang <[email protected]>
Date:   Wed Nov 29 01:04:25 2023 +0800

    Update ADO windows image to custom image (#18598)
    Update Azure-Pipelines-EO-Windows2022-aiinfra to
    onnxruntime-win-CPU-2022 in Nuget_Package_CPU.
    To make the debugging easier, use flex-downloadPipelineArtifact
    Azure-Pipelines-EO-Windows2022-aiinfra is using 1ES window-latest image.
    The pipeline might be failed by unexpected upgrade.
    Verified:
    https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=384425&view=results
    I think we should replace all Azure-Pipelines-EO-Windows2022-aiinfra.

commit 3ea27c29253aad7c02015e2af6d37dedafe2c9c3
Author: Jian Chen <[email protected]>
Date:   Tue Nov 28 09:03:46 2023 -0800

    Create a new Nuget Package pipeline for CUDA 12 (#18135)

commit 94a6020a7f59f22101653988a36bca02593eb816
Author: Xavier Dupré <[email protected]>
Date:   Tue Nov 28 03:56:00 2023 -0800

    Improve parallelization of TfIdfVectorizer, Reduce memory consumption (#18539)

    TfIdfVectorizer has two steps: first search for n-grams in the input,
    second, weight the results. The second step was not parallelized. The PR
    adresses that issue. Before two vectors were of the size of the output
    were allocated to compute the results. The first one, frequencies, was
    used as an intermediate vector between the two steps. This vector is now
    broken into multiple small vectors, one per thread. The memory
    consumption is then reduced for batches with a number of rows > the
    number of threads.
    Performance and memory consumption.

    For one model, the improvment is +15% faster (4 cores, model size is
    ~6Mb, batch size is 100). Here is another benchmark on
    a machine with 32 cores with different size of vocabularies and batch
    sizes. The tested TfIdfVectorizer only deals with unigram and processes
    sequences of 10 tokens (integers).

    ![image](https://github.com/microsoft/onnxruntime/assets/22452781/0bb9abe9-ed81-44da-b5c4-ad2a12f129bd)

commit 3f42fbad2e42cf03c01eb0428b06e24f4ad2d427
Author: Ran Gal <[email protected]>
Date:   Mon Nov 27 23:54:38 2023 -0800

    deleted the unused random_device variables because they caused a warning that was treated like an error. (#18543)

    deleted the unused random_device variables because they caused a warning
    that was treated like an error.

    **_Please check if the declaration is required for the random number
    generation. if so, there need to be a dummy reference to the variable or
    turning off the warning as error behavior._**
    <!-- Describe your changes. -->
    <!-- - Why is this change required? What problem does it solve?
    - If it fixes an open issue, please link to the issue here. -->

commit fc8631e2f11d85c84ab9cc711aacb9c589b6f71a
Author: Jiajia Qin <[email protected]>
Date:   Tue Nov 28 13:21:47 2023 +0800

    [js/web] Fix conv2dMatmul errors due to #18452 (#18562)
    Currently, all conv2dMatmul with inChannels = 3 and outChannels % 4 = 0
    will report compilation errors. Models, which include this kind of shape
    will be impacted, like mobilenetv2-12, resnet50 .

    The errors is introduced by #18452
    https://github.com/microsoft/onnxruntime/pull/18452/files#diff-8b24ea43aa11b1346c0c9e327f9bce6b37a93bd8f2bf8a6392b2b263972b7ea2R200,
    which accidentally pass `components` to `x`. But `x`'s components is
    `innerElementSize` not `components `. And when `innerElementSize` is 3,
    we should use `1` in current design.

commit b9fd9c5665c998fea8786a2e9fee2776e667845c
Author: cao lei <[email protected]>
Date:   Mon Nov 27 13:41:12 2023 -0800

    remove dead code in openvino EP (#18457)
    <!-- Describe your changes. -->
    Remove dead code in openvino EP
    <!-- - Why is this change required? What problem does it solve?
    - If it fixes an open issue, please link to the issue here. -->
    Remove dead code in openvino EP

commit dd355e39a063c124142f60d6cc14f6d48692e1f7
Author: Caroline Zhu <[email protected]>
Date:   Mon Nov 27 10:30:13 2023 -0800

    [js/web/training] Added parameters methods (#18250)
    * Implemented: `getParametersSize`, `getContiguousParameters`
    (equivalent to copyParametersToBuffer), and `loadParametersBuffer`
    (equivalent to copyParametersFromBuffer)
    * as part of these changes, getParametersSize was added to the
    TrainingSession interface so that users know what size buffer to create
    for loadParametersBuffer
    * The parameters methods in the interface were modified to take in a
    Float32Array instead
    * part of the work for implementing web bindings for training
    * enables federated learning in the web
    * previous  PR: #18006

    ---------

    Co-authored-by: Ashwini Khade <[email protected]>

commit a2fd8a6fc083f43d6535f5acd24219c140812c87
Author: Hector Li <[email protected]>
Date:   Fri Nov 24 20:41:27 2023 -0800

    [QNN EP] Return INVALID_GRAPH if failed to load from context binary (#18485)
    [QNN EP] Return INVALID_GRAPH if failed to load from context binary
    Make sure QNN EP return INVALID_GRAPH if error encountered with the
    context binary file

commit 2f608338cb46398fc3806cb6d1fd3ba7961b1a9f
Author: cloudhan <[email protected]>
Date:   Fri Nov 24 18:04:48 2023 +0800

    Setup default python formatter for new python plugin (#18563)

commit 7b2aefa85688a02a58c5dd7bddc90e7f81f44c3a
Author: Ted Themistokleous <[email protected]>
Date:   Fri Nov 24 05:04:23 2023 -0500

    undo hipify of __half to rocblas_half (#18573)

    Fixes build issue seen with newer ROCm releases

    Co-authored-by: Jeff Daily <[email protected]>

commit b9c935f6050b3a57e23dbb79e739489f25f6924a
Author: mindest <[email protected]>
Date:   Fri Nov 24 17:22:00 2023 +0800

    [ROCm] Some fixes in tunable (#18575)

    * Fix workspace size for hipBLASLt algos at 32M
    * Update according to API changes

commit 62f00ad8e7b7bbaf144e9af2bb19d9bf63dcd291
Author: Rachel Guo <[email protected]>
Date:   Thu Nov 23 14:26:57 2023 -0800

    [CoreML] Add Softmax and Split op support (#18358)
    <!-- Describe your changes. -->

    As title.
    <!-- - Why is this change required? What problem does it solve?
    - If it fixes an open issue, please link to the issue here. -->

    Added for yolov8 model missing operator support.
    https://github.com/microsoft/onnxruntime/issues/17654

    Now the model support info looks like:

    _CoreMLExecutionProvider::GetCapability, number of partitions supported
    by CoreML: 3 number of nodes in the graph: 233 number of nodes supported
    by CoreML: 230_

    (only missing 3 concat op support due to input 3d shape is not currently
    support in CoreML EP Concat).

    ---------

    Co-authored-by: rachguo <[email protected]>
    Co-authored-by: rachguo <[email protected]>
    Co-authored-by: Edward Chen <[email protected]>

commit 6f3c1f9dc9c08ec52c3c2e975e35308b08219494
Author: cloudhan <[email protected]>
Date:   Thu Nov 23 12:06:19 2023 +0800

    [ROCm] Update ck for GemmFloat8 (#18487)

commit 1c79897c90f959d30ed68c9b36d82be0024d806b
Author: Adrian Lizarraga <[email protected]>
Date:   Wed Nov 22 19:40:33 2023 -0800

    [QNN EP] Support LpNormalization (#18561)
    Add support for the ONNX LpNormalization operator (p == 2). This is
    translated to QNN's L2Norm operator.
    Support more models with QNN EP

commit 43a5147e015e105547aa0e6862462a352fa43c5f
Author: pengwa <[email protected]>
Date:   Thu Nov 23 11:39:00 2023 +0800

    Memory optimization refactor and refinement (#17481)

    Currently memory optimizer runs graph transformations and print
    recompute opportunities in INFO level, while ORT backend has many many
    INFO level logs making users hard to find those information. So we are
    looking for a Python binding API to retrieve the memory optimization
    opportunities instead of depending on the MemoryOptimizer's default
    logging.
    Then we can print ORTModule feature statistics using this information.
    Also, with such an API, we can create an ORT session created, where
    allocation plan is done, the analysis will consider buffer reuse as
    well. This can void giving some recomputation subgraphs that are reusing
    other subgraphs' output buffers.

    Check
    https://github.com/microsoft/onnxruntime/blob/pengwa/add_devinfo_level/docs/Memory_Optimizer.md
    for the new flow using `MemoryOptimizer`.

    This pull requests made following refactoring:
    1. Print the log in ORTModule Python script, along with ORTModule
    feature enabling stats. This is implemented by exposing an API
    `get_serialized_ortmodule_memory_stat` to retrieve the memory
    optimization opportunities.
    2. We are analyzing memory optimization opportunities considering ORT
    memory planning. This is done by firstly creating the execution graph
    without enabling MemoryOptimizer, then we call
    `execution_agent.get_serialized_ortmodule_memory_stat` which internally
    will consider the session memory allocation planner when analyzing
    memory optimization opportunity. As a direct result, the memory
    optimization opportunities can show those stashed activations that are
    reusing other buffers.
    3. Move recompute analysis logic from memory_optimizer.h/cc to
    recompute_analysis.h/cc.
    4. Abstract optimization strategies for their own implementation. This
    will make introducing new strategies (for example compression and
    decompression ) easier.

    New logging matrix (INFO Level), in WARNING level, the details will NOT
    show.
    ```
    2023-09-13 13:25:09,249 orttraining.rank-0 [WARNING] -
    ***** ONNX Runtime Training (ORTModule) is accelerating your model *****

    ORTModule is enabled with following features ON/OFF for [training] mode:

      ATen Executor         :   ON    :   Dispatch ATen operators to ORT's ATen executor
      Cast Propagation      :   ON    :   Level 1 enabled
      Custom Function       :   ON    :   Support custom torch.autograd.Function export and execution
      Memory Optimizer      :   ON    :   RecomputeConfig: Reshape+Where+BiasSoftmax+:1:-1,Cast+:1:-1, ProbeLevel: 1, available configs:
                                          Config                                                      Freq    Saving(B)       Saving Symbolic(Bytes)
       - Plan 1             :   ON    :   Reshape+Where+BiasSoftmax+:1:-1                             5       671,088,640     640.0*inputs_input_ids_dim0*inputs_input_ids_dim1**2
       - Plan 2             :   ON    :   Cast+:1:-1                                                  6       402,587,648     inputs_input_ids_dim0*inputs_input_ids_dim1*(384.0*inputs_input_ids_dim1 - 64.0)
       - Plan 3             :   OFF   :   Reshape+Where+:1:-1                                         1       134,217,728     128.0*inputs_input_ids_dim0*inputs_input_ids_dim1**2
       - Plan 4             :   OFF   :   BiasSoftmax+:1:-1                                           1       134,086,656     128.0*inputs_input_ids_dim0*inputs_input_ids_dim1*(inputs_input_ids_dim1 - 1)
       - Plan 5             :   OFF   :   BiasGelu+:1:-1                                              6       125,808,640     inputs_input_ids_dim0*(122880.0*inputs_input_ids_dim1 - 20480.0)
       - Plan 6             :   OFF   :   FusedMatMul+:1:-1                                           6       125,808,640     inputs_input_ids_dim0*(122880.0*inputs_input_ids_dim1 - 20480.0)
       - Plan 7             :   OFF   :   FusedMatMul+Add+FusedMatMul+Add+Add+Add+:1:-1               5       26,214,400      25600.0*inputs_input_ids_dim0*inputs_input_ids_dim1
       - Plan 8             :   OFF   :   Add+:1:-1                                                   1       5,237,760       5120.0*inputs_input_ids_dim0*(inputs_input_ids_dim1 - 1)
       - Plan 9             :   OFF   :   Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+:1:-1         1       4,096           4.0*inputs_input_ids_dim0*inputs_input_ids_dim1
       - Plan 10            :   OFF   :   Cast+:2:-1                                                  1       2,048           2.0*inputs_input_ids_dim0*inputs_input_ids_dim1
      Compute Optimizer     :   ON    :   Enable/Disable with env ORTMODULE_ENABLE_COMPUTE_OPTIMIZER=1/0
       - FLOPReduction      :   ON    :   Reduce FLOPs by upstreaming shrinking-sized ops
      Auto Fallback         :   ON    :   Fallback to PyTorch when encountering unsupported ops
      TritonOp Enabled      :   OFF   :   ORT will switch to Triton for executing some ops to further accelerate training.
      ZeRO Stage3 Support   :   OFF   :   Enable/Disable with env ORTMODULE_ENABLE_ZERO_STAGE3=1/0

    Total ORT initialization overhead is 10.73s where export takes 8.39s.
    Other overhead details:  graph builder init takes 0.06s, runtime detection takes 0.01s, graph building takes 0.31s, session creation takes 1.96s

    Versions: ONNX Runtime - 1.16.0+cu118, ONNX - 1.11.0

    Note 1: use comma to enable multiple plans at the same time.
      export ORTMODULE_MEMORY_OPT_CONFIG=<plan1 config>,<plan2 config>,...
    Note 2: saving is calculated based on the 1st batch symbolic dim values:
      inputs_input_ids_dim0=1,
      inputs_input_ids_dim1=1024,
      inputs_attention_mask_dim0=1,
      inputs_attention_mask_dim1=1024,
      inputs_labels_dim0=1,
      inputs_labels_dim1=1024,

    ************************************************************************
    ```

    If DEVINFO level is enabled, then more details about the memory
    optimizations are printed.
    ```

    MemoryInsight Summary - User config: BiasGelu+:1:-1,Cast+:2:-1
    ==========================================================================================================================================
    |Freq   | Memory Optimization Opportunities (Clustered by node-level activation patterns)                                                |
    |_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
    |3      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
    |       |                                                                                                                                |
    |       |>>Option 1     : Recompute subgraph FusedMatMul+Add+Reshape+                                                                    |
    |       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+Add+Reshape+:1:-1                         |
    |       |  Stashed Activations:                                                                                                          |
    |       |   - ReuseFreq :  Output 0(3),                                                                                                  |
    |       |   - Output 0  : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 32 x 240 x ], byte/elem: 2, 100% saved                        |
    |_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
    |2      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
    |       |                                                                                                                                |
    |       |>>Option 1     : Recompute subgraph Reshape+                                                                                    |
    |       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+:1:-1                                         |
    |       |  Stashed Activations:                                                                                                          |
    |       |   - ReuseFreq :  Output 0(2),                                                                                                  |
    |       |   - Output 0  : [ x 2560 x ], byte/elem: 2, 100% saved                                                                         |
    |_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
    |2      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
    |       |                                                                                                                                |
    |       |>>Option 1     : Recompute subgraph FusedMatMul+                                                                                |
    |       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+:1:-1                                     |
    |       |  Stashed Activations:                                                                                                          |
    |       |   - Output 0  : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 10240 x ], byte/elem: 2, 100% saved                           |
    |_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
    |2      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
    |       |                                                                                                                                |
    |       |>>Option 1     : Recompute subgraph Cast+                                                                                       |
    |       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Cast+:1:-1                                            |
    |       |  Stashed Activations:                                                                                                          |
    |       |   - Output 0  : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 2, 100% saved      |
    |_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
    |2      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
    |       |                                                                                                                                |
    |       |>>Option 1     : Recompute subgraph Reshape+Where+BiasSoftmax+                                                                  |
    |       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Where+BiasSoftmax+:1:-1                       |
    |       |  Stashed Activations:                                                                                                          |
    |       |   - Output 0  : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved      |
    |_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
    |2      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
    |       |                                                                                                                                |
    |       |>>Option 1     : Recompute subgraph BiasGelu+                                                                                   |
    |       |  Status       : Enabled, requested count=-1, actual applied count=2                                                            |
    |       |  Stashed Activations:                                                                                                          |
    |       |   - Output 0  : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 10240 x ], byte/elem: 2, 100% saved                           |
    |_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
    |2      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
    |       |                                                                                                                                |
    |       |>>Option 1     : Recompute subgraph FusedMatMul+Add+FusedMatMul+Add+Add+Add+                                                    |
    |       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+Add+FusedMatMul+Add+Add+Add+:1:-1         |
    |       |  Stashed Activations:                                                                                                          |
    |       |   - Output 0  : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 2560 x ], byte/elem: 2, 100% saved                            |
    |_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
    |1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
    |       |                                                                                                                                |
    |       |>>Option 1     : Recompute subgraph Reshape+Where+                                                                              |
    |       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Where+:1:-1                                   |
    |       |  Stashed Activations:                                                                                                          |
    |       |   - Output 0  : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved      |
    |_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
    |1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
    |       |                                                                                                                                |
    |       |>>Option 1     : Recompute subgraph FusedMatMul+                                                                                |
    |       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+:1:-1                                     |
    |       |  Stashed Activations:                                                                                                          |
    |       |   - Output 0  : [inputs_input_ids_dim0*(inputs_input_ids_dim1 - 1) x 10240 x ], byte/elem: 2, 100% saved                       |
    |_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
    |1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
    |       |                                                                                                                                |
    |       |>>Option 1     : Recompute subgraph Cast+                                                                                       |
    |       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Cast+:1:-1                                            |
    |       |  Stashed Activations:                                                                                                          |
    |       |   - Output 0  : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 - 1 x inputs_input_ids_dim1 x ], byte/elem: 2, 100% saved  |
    |_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
    |1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
    |       |                                                                                                                                |
    |       |>>Option 1     : Recompute subgraph Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+                                              |
    |       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+:1:-1   |
    |       |  Stashed Activations:                                                                                                          |
    |       |   - Output 0  : [inputs_input_ids_dim0 x 1 x 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved                           |
    |       |                                                                                                                                |
    |       |>>Option 2     : RecomputeWithCompromise subgraph Cast+                                                                         |
    |       |  Status       : Enabled, requested count=-1, actual applied count=1                                                            |
    |       |  Stashed Activations:                                                                                                          |
    |       |   - Output 0  : [inputs_input_ids_dim0 x 1 x 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 50% saved                            |
    |_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
    |1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
    |       |                                                                                                                                |
    |       |>>Option 1     : Recompute subgraph BiasSoftmax+                                                                                |
    |       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=BiasSoftmax+:1:-1                                     |
    |       |  Stashed Activations:                                                                                                          |
    |       |   - Output 0  : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 - 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved  |
    |_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
    |1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
    |       |                                                                                                                                |
    |       |>>Option 1     : Recompute subgraph BiasGelu+                                                                                   |
    |       |  Status       : Enabled, requested count=-1, actual applied count=1                                                            |
    |       |  Stashed Activations:                                                                                                          |
    |       |   - Output 0  : [inputs_input_ids_dim0*(inputs_input_ids_dim1 - 1) x 10240 x ], byte/elem: 2, 100% saved                       |
    |_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
    |1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
    |       |                                                                                                                                |
    |       |>>Option 1     : Recompute subgraph Add+                                                                                        |
    |       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Add+:1:-1                                             |
    |       |  Stashed Activations:                                                                                                          |
    |       |   - Output 0  : [inputs_input_ids_dim0*(inputs_input_ids_dim1 - 1) x 2560 x ], byte/elem: 2, 100% saved                        |
    |_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
    ==========================================================================================================================================
    Note: use comma as a separator for enabling more than one subgraphs.

    ************************************************************************

    ```
    <!-- - Why is this change required? What problem does it solve?
    - If it fixes an open issue, please link to the issue here. -->

commit 64dacc2892d31603a5723959d308bb9c4b05d0ea
Author: Jiajia Qin <[email protected]>
Date:   Thu Nov 23 07:58:06 2023 +0800

    [js/webgpu] Add BatchNormalization Op (#18468)
    This PR adds `BatchNormalization` with `float` support.

    Some Todos:
    1. all inputs don't have same data type. For example, x/y is float16,
    but bias/scale is float32 or double.
    2. training mode support.

    We see many models are using `BatchNormalization` ops. However, due to
    the missing in jsep, all of them run on cpu, which result very poor
    performance. With this PR's support, densenet-9 model becomes 20.29 ms
    from 250.69 ms.

commit fa106942a7962e68f1659cd65f5a7cdb498b8c03
Author: Xu Xing <[email protected]>
Date:   Thu Nov 23 06:42:55 2023 +0800

    [js/webgpu] Refactor matmul conv to support uniforms for matmul (#18452)

    This change refactored matmul/conv related programs to support shape
    uniforms. Currently only matmul shape uniforms are fully enabled.
    TODOs: add input dependencies for conv related programs, turn clipMax
    and clipMin to uniforms.

commit 42c6799c59b5770809a6b4df208d3da5a0270486
Author: Scott McKay <[email protected]>
Date:   Thu Nov 23 08:27:47 2023 +1000

    Update transpose optimization to be more QDQ aware (#18444)
    <!-- Describe your changes. -->
    Rework some aspects of the transpose optimizer to ensure we have valid
    QDQ node units when it is done.

    Conceptually we need to let individual Transpose nodes move through the
    graph when optimizing. That can invalidate existing QDQ node units or
    require new ones. We can fix this after inserting new nodes, or when
    transpose optimization finishes moving Transpose nodes.

    Fix when inserting new node
    - TransposeInputs can add an Unsqueeze (to broadcast) and Transpose to a
    node's inputs
    - if there was a DQ node providing the input, add a Q -> DQ after
    inserting the Unsqueeze/Transpose to make a QDQ node unit for the new
    node.
    - Unsqueeze/Transpose don't change data, so we can copy the
    type/scale/zero point from the existing DQ

    Fixes when transpose optimization completes moving Transpose nodes
    - Remove empty DQ -> Q pairs if the type/scale/zero point match
    - Pushing a Transpose through may have resulted in an existing
    Transpose/Reshape being cancelled and removed leaving an empty QDQ node
    unit
      - the Transpose being moved may have started in a QDQ node unit
    - Transpose that got blocked inside existing QDQ node unit
    - e.g. if we hit a DQ -> MatMul -> Q node unit the Transpose gets
    blocked after the DQ
    - insert a Q -> DQ after the Transpose to put it in a QDQ node unit and
    repair the original QDQ node unit
    - Transpose moves past a DQ providing a graph output
      - insert a Q -> DQ so the Transpose is in a QDQ node unit

    This replaces the existing phase 2 logic which flipped a DQ -> Transpose
    to fix a broken QDQ node unit. The new approach should handle more
    scenarios and hopefully produce a better graph.

    Additionally the logic to handle updates to shared initializers that
    feed DQ nodes was simplified (i.e. largely removed). When we update the
    shared initializer a Squeeze (if broadcast) and Transpose is added
    between the initializer and the DQ for other usages of it. We only need
    to check for this pattern in EstimateTransposeValueCost by looking past
    a DQ node. We do not need to track the individual DQ nodes leading to an
    updated shared initializer.
    <!-- - Why is this change required? What problem does it solve?
    - If it fixes an open issue, please link to the issue here. -->
    Initially to fix QNN issue with non-const input being transpose and the
    QDQ node units being broken.

commit 841f7ed3e0c393b22b1631c090c61b20fc62f876
Author: satyajandhyala <[email protected]>
Date:   Wed Nov 22 14:14:24 2023 -0800

    [[JS/Web]Added uniform to Expand op. (#18558)
    <!-- Describe your changes. -->
    Added Uniforms to Expand operator kernel
    <!-- - Why is this change required? What problem does it solve?
    - If it fixes an open issue, please link to the issue here. -->
    Improve performance

commit 1c555c5fc11d673df9db4f08ebf389c9929e85c0
Author: Arthur Islamov <[email protected]>
Date:   Thu Nov 23 00:12:07 2023 +0400

    [JS/Web] Resize & BiasSplitGelu fp16 support (#18536)
    Resize and BiasSplitGelu fp16 support on WebGPU

commit 3f0ebd673622d3663011ae33fc6070f1f2ea3af3
Author: Xavier Dupré <[email protected]>
Date:   Wed Nov 22 18:15:24 2023 +0100

    Fix opset import in GemmFloat8 python unit tests (#18489)
    The unit test are failing if a development version of onnx is used. The
    opset are set to 19.

commit 32fabb555501a020751b6123de94c7fc14086f2b
Author: Xavier Dupré <[email protected]>
Date:   Wed Nov 22 18:15:11 2023 +0100

    Fix opset version of the optimizer in function generate_artifacts (#18300)
    `generate_artifacts` generates 4 graphs for training. All graphs should
    share the same opset version, the one coming from the model to train,
    but the optimizer is left un…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants