-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SkipGroupNorm fusion and SDXL Pipeline Update #18273
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
"Skip GroupNorm fusion since bias shape is expected to be [32], Got %s", str(instance_norm_bias.shape) | ||
) | ||
instance_norm_bias = self.model.get_constant_value(instance_norm.input[2]) | ||
if instance_norm_bias is None or instance_norm_scale.shape != instance_norm_scale.shape: |
Check warning
Code scanning / CodeQL
Comparison of identical values Warning
Comparison of identical values; use cmath.isnan() if testing for not-a-number.
wangyems
previously approved these changes
Nov 3, 2023
tianleiwu
force-pushed
the
tlwu/skip_group_norm_fusion_script
branch
from
November 4, 2023 00:24
9f910a2
to
42f3a67
Compare
kunal-vaishnavi
previously approved these changes
Nov 4, 2023
tianleiwu
changed the title
SkipGroupNorm fusion for SDXL
SkipGroupNorm fusion and SDXL Pipeline Update
Nov 6, 2023
kunal-vaishnavi
approved these changes
Nov 7, 2023
tianleiwu
added a commit
that referenced
this pull request
Nov 7, 2023
Update a few optimizations for Stable Diffusion XL: (1) Add SkipGroupNorm fusion (2) Remvoe GroupNorm fusion limits. Previously, we only fuse GroupNorm when channels is one of `320, 640, 960, 1280, 1920, 2560, 128, 256, 512` so some GroupNorm in refiner was not fused. (3) Tune SkipLayerNormalization to use vectorized kernel for hidden size 320, 640 and 1280. Pipeline Improvements: (4) Enable cuda graph for unetxl. (5) Change optimization to generate optimized fp32 model with ORT, then convert to fp16. Otherwise, fp16 model might be invalid. (6) Add option to enable-vae-slicing. Bug fixes: (a) Fix vae decode in SD demo. (b) Fix UnipPC add_noise missing a parameter. (c) EulerA exception in SDXL demo. Disable it for now. (d) Batch size > 4 has error in VAE without slicing. Force to enable vae slicing when batch size > 4. #### Performance Test on A100-SXM4-80GB Description about the experiment in results: *Baseline*: removed GroupNorm fusion limits; CUDA graph is enabled in Clip and VAE, but not in Clip2 and UNet. *UNetCG*: Enable Cuda Graph on UNet *SLN*: Tune SkipLayerNormalization *SGN*: Add SkipGroupNorm fusion The latency (ms) of generating an image of size 1024x1024 with 30 steps base model and 9 steps of refiner model: | Baseline | UNetCG| UNetCG+SLN | UNetCG+SLN+SGN -- | -- | -- | -- | -- Base Clip | 3.74 | 3.70 | 3.88 | 3.81 Base Unet x30 | 2567.73 | 2510.69 | 2505.09 | 2499.99 Refiner Clip | 7.59 | 7.42 | 7.41 | 7.58 Refiner Unet x 9 | 814.43 | 803.03 | 802.20 | 799.06 Refiner VAE Decoder | 84.62 | 85.18 | 85.24 | 87.43 E2E | 3480.56 | 3412.05 | 3405.77 | 3400.23 We can see that enable cuda graph brought major gain (around 68ms). SLN Tuning has about 7ms gain. SkipGroupNorm fusion has 5ms gain. SkipGroupNorm fusion won't reduce latency much, while it also has benefit of reducing memory usage, so it is recommended to enable it. ### Motivation and Context Additional optimizations upon previous work in #17536.
jambayk
added a commit
that referenced
this pull request
Nov 30, 2023
### Description <!-- Describe your changes. --> #18273 added `SkipGroupNorm` contrib op but it did not skip onnx shape inference for this op in `SymbolicShapeInference`. This leads to failed shape inference of the transformers optimized model with `enable_skip_group_norm=True`. Also results in an invalid float16 model for the SD CUDA example. This PR adds `SkipGroupNorm` to `skip_infer` so that it skips onnx shape inference for this op and instead uses the relevant dispatcher. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix shape inference failure for models with `SkipGroupNorm` nodes.
wejoncy
added a commit
that referenced
this pull request
Dec 5, 2023
commit e066fca7770987c9c2c91babca9d74e95291e39f Author: Adrian Lizarraga <[email protected]> Date: Mon Dec 4 17:54:58 2023 -0800 [Quantization] Tensor quant overrides and QNN EP quantization configuration (#18465) Allows specifying a dictionary of tensor-level quantization overrides: ``` TensorQuantOverrides = dictionary : Default is {}. Set tensor quantization overrides. The key is a tensor name and the value is a list of dictionaries. For per-tensor quantization, the list contains a single dictionary. For per-channel quantization, the list contains a dictionary for each channel in the tensor. Each dictionary contains optional overrides with the following keys and values. 'quant_type' = QuantType : The tensor's quantization data type. 'scale' = Float : The scale value to use. Must also specify `zero_point` if set. 'zero_point' = Int : The zero-point value to use. Must also specify `scale` is set. 'symmetric' = Bool : If the tensor should use symmetric quantization. Invalid if also set `scale` or `zero_point`. 'reduce_range' = Bool : If the quantization range should be reduced. Invalid if also set `scale` or `zero_point`. 'rmax' = Float : Override the maximum real tensor value in calibration data. Invalid if also set `scale` or `zero_point`. 'rmin' = Float : Override the minimum real tensor value in calibration data. Invalid if also set `scale` or `zero_point`. ``` - All of the options are optional. - Some combinations are invalid. - Ex: `rmax` and `rmin` are unnecessary if the `zero_point` and `scale` are also specified. Example for per-tensor quantization overrides: ```Python3 extra_options = { "TensorQuantOverrides": { "SIG_OUT": [{"scale": 1.0, "zero_point": 127}], "WGT": [{"quant_type": quantization.QuantType.QInt8, "symmetric": True, "reduce_range": True}], "BIAS": [{"quant_type": quantization.QuantType.QInt8, "symmetric": True, "reduce_range": True}], }, } ``` Example for per-channel quantization overrides (Conv weight and bias): ```Python3 extra_options = { "TensorQuantOverrides": { "WGT": [ { "quant_type": quantization.QuantType.QUInt8, "rmin": 0.0, "rmax": 2.5, "reduce_range": True, }, { "quant_type": quantization.QuantType.QUInt8, "rmin": 0.2, "rmax": 2.55, "reduce_range": False, }, ], "BIAS": [ {"zero_point": 0, "scale": 0.000621}, {"zero_point": 0, "scale": 0.23}, ], }, } ``` Added a `quantization.execution_providers.qnn.get_qnn_qdq_config` method that inspects the model and returns suitable quantization configurations. Example usage: ```python3 from quantization import quantize, QuantType from quantization.execution_providers.qnn import get_qnn_qdq_config qnn_config = get_qnn_qdq_config(input_model_path, data_reader, activation_type=QuantType.QUInt16, weight_type=QuantType.QUInt8) quantize(input_model_path, output_model_path, qnn_config) ``` Make it possible to create more QDQ models that run on QNN EP. --------- Signed-off-by: adrianlizarraga <[email protected]> commit 01b5c789177c2b062d4c4f9b6abdce12be9b3b64 Author: Tianlei Wu <[email protected]> Date: Mon Dec 4 16:03:47 2023 -0800 Add SD-Turbo and refine diffusion demo (#18694) [SD-Turbo](https://huggingface.co/stabilityai/sd-turbo) is a fast generative text-to-image model that distilled from [Stable Diffusion 2.1](https://huggingface.co/stabilityai/stable-diffusion-2-1). It is targeted for 512x512 resolution. 1. Support sd-turbo model. 1. Refiner ControlNet in demo + Cache the ControlNet model so that it is downloaded only once. + Do not download default images in script. Instead update document to use wget to download example image. + Fix an issue of control image processing that causes shape mismatch in inference. 1. Refine arguments: + Change argument --disable-refiner to --enable-refiner since refiner is not used in most cases + Rename --refiner-steps to --refiner_denoising_steps + Add abbreviations for most used arguments. + Add logic to set default arguments for different models. 1. Refine torch model cache: + Share cached torch model among different engines to save disk space. + Only download fp16 model (previously, ORT_CUDA downloads fp32 model). 1. Do not use vae slicing when image size is small. 1. For LCM scheduler, allow guidance scale 1.0~2.0. 2. Allow sdxl-turbo to use refiner Average latency in ms for SD-Turbo (FP16, EulerA, 512x512) on A100-SXM4-80GB. Batch | Steps | TRT 8.6 static | ORT_TRT static | ORT_CUDA static | TRT 8.6 dynamic | ORT_TRT dynamic | ORT_CUDA dynamic -- | -- | -- | -- | -- | -- | -- | -- 1 | 1 | 32.07 | 30.55 | 32.89 | 36.41 | 38.30 | 34.83 4 | 1 | 125.36 | 97.40 | 97.49 | 118.24 | 114.95 | 99.10 1 | 4 | 62.29 | 60.24 | 62.50 | 72.49 | 77.82 | 67.66 4 | 4 | 203.51 | 173.11 | 168.32 | 217.14 | 215.71 | 172.53 * Dynamic engine is built for batch size 1 to 8, image size 512x512 to 768x768, optimized for batch size 1 and 512x512 commit d514a960eefc19fb69d54497b6b582cfdf6e85f1 Author: Edward Chen <[email protected]> Date: Mon Dec 4 13:38:36 2023 -0800 Remove "Python Checks" pipeline status from readme as that pipeline no longer exists. (#18697) commit c02a3861451a29d7a517dd4aaa82c239d2f34d2d Author: Caroline Zhu <[email protected]> Date: Mon Dec 4 13:37:14 2023 -0800 [js/web/training] Implemented runEvalStep & runOptimizerStep (#18259) * implemented runEvalStep and runOptimizerStep * added hasEvalModel and hasOptimizerModel boolean fields in TrainingSession representation * added evalInputNames and evalOutputNames fields to TrainingSessionHandler & TrainingSession * removed the inputNamesEncoded and outputNamesEncoded fields from TrainingSessionHandler -- since none of the training methods require the input names and output names as parameters, there's no need to store them. * part of the work for implementing web bindings for training * previous PR: #18250 --------- Co-authored-by: Ashwini Khade <[email protected]> commit 5353adcde37a118bdd25882482fd584c5ed3f343 Author: Jiajia Qin <[email protected]> Date: Tue Dec 5 05:18:37 2023 +0800 [js/webgpu] Use the naive convTranspose when in/out channels are both 1 (#18658) With this change, convTranspose with input0 [1, 18, 32, 1], input1 [1, 1, 16, 16] becomes 0.59ms from 6.64ms. commit a5b2291e0fe7c7d42f30154ccb20d6cde1380c3c Author: trajep <[email protected]> Date: Tue Dec 5 04:26:50 2023 +0800 [Transformer Optimization]Return model directly for unknown model type (#18642) This pull request is used to improves the handling of unsupported model types in the optimization process. commit 2f8b86b93906d0dd0549aca22798c660aa10db91 Author: Deoksang Kim <[email protected]> Date: Sat Dec 2 09:48:55 2023 +0900 Fix typo in the TensorShape (#17813) The function name in the log should be SizeToDimension commit 92ee664f64e96a8cc7308302a3e4f67f95254d1f Author: Jiajia Qin <[email protected]> Date: Sat Dec 2 07:35:35 2023 +0800 [js/webgpu] Fix shader errors in indicesGet/Set when rank > 4 (#18661) Currently, for non-uniform variables, we still use `array<u32, N>` type instead of array<vec4<u32>, N1>`. So we can't always treat all variables with rank > 4 as uniforms to index. This PR fixes below errors: ``` error(s) generated while compiling the shader: :5:44 error: index 4 out of bounds [0..1] return uniforms.input_strides[4] * (outputIndices[4] % uniforms.input_shape[4])+uniforms.input_strides[3] * (outputIndices[3] % uniforms.input_shape[3])+uniforms.input_strides[2] * (outputIndices[2] % uniforms.input_shape[2])+uniforms.input_strides[1] * (outputIndices[1] % uniforms.input_shape[1])+uniforms.input_strides[0] * (outputIndices[0] % uniforms.input_shape[0]); ^ FAILED #OpTest# - expand.jsonc [webgpu]Expand - Expand 5D - float32 Expand 5 - float32 FAILED #OpTest# - expand.jsonc [webgpu]Expand - Expand 5D - float32 Expand 5 - shape < input.size() commit eaaf27015e8d99c5a072caa40e0f4627f14a93e3 Author: Changming Sun <[email protected]> Date: Fri Dec 1 15:30:16 2023 -0800 Remove EnvSetupScript parameter from win-ci.yml (#18662) To make the code more consistent. Now some TRT pipelines download TRT binaries on-the-fly, while other TRT pipelines use a preinstalled version. This PR make them the same. commit 9c45fe4957ff3d027b5024abb170947db2cb0408 Author: Rachel Guo <[email protected]> Date: Fri Dec 1 14:47:46 2023 -0800 Fix macos xcframework test stage codesign info (#18649) <!-- Describe your changes. --> Remove developement id and force codesign not required in the test macos target. <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix failure happened in iOS_Full_xcframwork stage in Zip-Nuget-Java-NodeJS packaging pipeline. --------- Co-authored-by: rachguo <[email protected]> commit a3538056314c10c1c4d5b769e86426434d486322 Author: Edward Chen <[email protected]> Date: Fri Dec 1 13:49:45 2023 -0800 Fix Windows TVM CI workflow (#18667) Fix issue with installing LLVM dependency. commit b22f49ff35b3c7b3ae339128e21898810e4c2919 Author: Edward Chen <[email protected]> Date: Fri Dec 1 09:41:25 2023 -0800 Fix unit tests failures in build with contrib ops disabled (#18659) Fix unit tests failures in build with contrib ops disabled. - QDQTransformerTests.QDQPropagation_GH11605_Opset12_19 - TransposeOptimizerTests.QnnTransposeNonConstBroadcastInput commit fcea2cb7f184d608efa1e5c72f9e25072e82009d Author: Bowen Bao <[email protected]> Date: Fri Dec 1 09:36:18 2023 -0800 [Dort] Run type promotion pass to resolve dtype discrepancy (#18516) Fixes CI failures mentioned in #18507 But we should not keep two separate dort impls in both pytorch and onnxruntime. They are out of sync. commit 05a9c957647b3cae0d2ad305950c14bf5f305bc8 Author: snadampal <[email protected]> Date: Fri Dec 1 11:16:44 2023 -0600 [DNNL] add Arm Compute Library (ACL) backend for dnnl execution provider (#15847) Add ACL as the DNNL runtime option for aarch64 platforms. Update makefile and the python wheel build script. <!-- Describe your changes. --> Add ACL as the DNNL runtime option for aarch64 platforms. Update makefile and the python wheel build script. <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> This is to enable the optimized ACL gemm kernels for dnnl execution provider on aarch64 platform. commit d69842226b47e5336568103541b071447caeb9bf Author: Jian Chen <[email protected]> Date: Fri Dec 1 07:57:46 2023 -0800 Update the template files to correct stage to fix the python cuda 12 packaging pipeline (#18651) commit 182c525416eb5cbace8df52b6809a77ffc91545d Author: guyang3532 <[email protected]> Date: Fri Dec 1 19:27:50 2023 +0800 Support MatMulBnb4 in PaddingElimination (#18646) Also support Cast pattern between input and embedding node for sparsity inspecting commit ccfea559428b1374d0109bfaacc273ce11f4ef3c Author: Hector Li <[email protected]> Date: Thu Nov 30 21:09:13 2023 -0800 [QNN EP] Enable QNN HTP VTCM size setting (#18653) [QNN EP] Enable QNN HTP VTCM size setting commit 9c9e6adeb2f31c73cebd7e92622c86f084858f68 Author: Tianlei Wu <[email protected]> Date: Thu Nov 30 18:19:31 2023 -0800 Add SDXL Turbo to demo (#18627) * Add SDXL Turbo to the demo. * Change default scheduler to EulerA for XL or Turbo since DDIM does not work well with small steps. Example to run the model in demo (See README for instructions): ``` python3 demo_txt2img_xl.py --version xl-turbo --height 512 --width 512 --denoising-steps 1 --scheduler UniPC "little cute gremlin sitting on a bed, cinematic" ``` commit c7732a78d7e815de489fed22cfee610a445b9ca2 Author: Wanming Lin <[email protected]> Date: Fri Dec 1 09:47:56 2023 +0800 [WebNN EP] Fixed bug in op checking (#18638) commit 73d9b035090a2bd4e56252dee10174d3f01e5f6f Author: Xu Xing <[email protected]> Date: Fri Dec 1 09:10:33 2023 +0800 [js/webgpu] Add multidimensional(>4) uniform support (#18546) This change removes the check of enableShapesUniforms. When all uses of this are removed, enableShapesUniforms can be removed too. commit 73a2eb82eb9364b4dea8df2cd6a46affd008b15c Author: Wanming Lin <[email protected]> Date: Fri Dec 1 08:19:22 2023 +0800 Fixed bug in Flatten's axis (#18645) Flatten's axis is in the range [-r, r] rather than [-r, r-1]. commit 6781b6cf3d4708e32e6bd546afa5b2b785290270 Author: Jiajia Qin <[email protected]> Date: Fri Dec 1 07:47:08 2023 +0800 [js/webgpu] add bool type for Expand/Gather (#18615) In [detr-resnet-50](https://huggingface.co/Xenova/detr-resnet-50) model, it uses expand with bool type running on cpu ep. | Kernel | Shape | Provider | | -------- | ------- | ------- | | Expand | "input_type_shape" : [{"bool":[1,1,1,625]},{"int64":[4]}],"activation_size" : "657","output_type_shape" : [{"bool":[1,1,625,625]}] | CPUExecutionProvider | After this change, it will run on jsep. | Kernel | Shape | Provider | | -------- | ------- | ------- | | Expand | "input_type_shape" : [{"bool":[1,1,1,625]},{"int64":[4]}],"activation_size" : "657","output_type_shape" : [{"bool":[1,1,625,625]}] | JsExecutionProvider | commit efee9abdb72f73163943df80f0e6db1f5c23c42c Author: Yi Zhang <[email protected]> Date: Fri Dec 1 07:44:44 2023 +0800 Reduce downloads in Nuget-Java pipeline to reduce connection exception (#18635) 1. Add a new stage to download java tools from https://oss.sonatype.org and publish them to pipeline artifact 2. Remove downloads in other jobs, they get the java tools from pipeline artifact 3. consolidate final_java_testing stages. Reduce downloads to reduce the connection error like below. ``` --2023-11-28 07:16:31-- https://oss.sonatype.org/service/local/repositories/releases/content/org/junit/platform/junit-platform-console-standalone/1.6.2/junit-platform-console-standalone-1.6.2.jar Resolving oss.sonatype.org (oss.sonatype.org)... 3.227.40.198, 3.229.50.23 Connecting to oss.sonatype.org (oss.sonatype.org)|3.227.40.198|:443... connected. HTTP request sent, awaiting response... 502 Bad Gateway 2023-11-28 07:16:32 ERROR 502: Bad Gateway. ``` commit 4025bd8ebdda49331af45c7632cb5975fedf69c2 Author: zesongw <[email protected]> Date: Fri Dec 1 04:59:36 2023 +0800 [WebNN EP] Fix bug of padding in Op ConvTranspose (#18577) Get the dimensions of H and W according to the layout. commit b1e749e3beb8fe543500f7ba51ddc9754639525d Author: Jiajia Qin <[email protected]> Date: Fri Dec 1 04:57:29 2023 +0800 [js/webgpu] Add program name into webgpuProfiling info (#18640) Currently, we only print the kernelName, which is hard to distinguish which shader we actually used. For example, GroupedConv/Conv2DMatMul both belong to Conv kernel. It's not intuitive for profiling. commit c5ea1547c6d1070e6b6296fbf8e6d681107b8c7f Author: Dmitri Smirnov <[email protected]> Date: Thu Nov 30 10:50:24 2023 -0800 Eliminate intermediate string conversion buffer. (#18608) Make use of unsafe string constructor that is able to convert native UTF-8 string straight into the string instance buffer. Reduce garbage, commit e7f64f4510483bf0a94ce46478f02ead8d70e0d2 Author: Yulong Wang <[email protected]> Date: Thu Nov 30 09:50:47 2023 -0800 [js/web] fix ESLint by excluding generated .js from tsconfig.json (#18634) ESLint will went into error sometimes. The root cause is because some large generated JavaScript file in the tsconfig's include path will cause TypeScript parser fail in a line of `string.match()` with a regex on a huge string (~8MB), causing the following error: ``` RangeError: Maximum call stack size exceeded ``` The solution is to remove the large files from the tsconfig's include path. Previously I excluded the `web/dist/` folder and this PR excludes `web/test/ort.test[.min].js`. commit 23a91c8ba889d77589d6acf44fa9e9bce5fbb701 Author: Changming Sun <[email protected]> Date: Thu Nov 30 08:07:47 2023 -0800 Fix warning C4003 in ORT python binding code (#18612) Fix warning C4003 in ORT python binding code. It's better to fix the warning instead of suppressing it. commit 1b5675ff0fc7b2d9894ef06a7727efe0aad7cbd2 Author: Changming Sun <[email protected]> Date: Thu Nov 30 08:07:13 2023 -0800 Update post-merge-jobs.yml: increase timeout value for the Ios job (#18602) commit 148495ebc55827c8c521ea41493052ddbc428ab2 Author: Vincent Wang <[email protected]> Date: Thu Nov 30 20:17:22 2023 +0800 [ORTModule] Use Default Topo-order for GraphViewer (#18410) ORT's default topo-order is a reversed DFS algorithm, while the priority-based topo-order is a forward BFS algorithm. It's likely that the default order is better than priority-based order on memory because tensor memory is more likely to be released right after it's consumed. Currently ORTModule uses priority-based order, for some models, it sorts lots of small Ops to the beginning, this introduces big CPU overhead at the beginning (see below screenshot), this PR is to use default order for training. The priority-based order is heavily used for some recompute optimization, so if there is recompute enabled, we will still use priority-based order. This PR also adds an optimization to the default order, which is to move all Shape/Size Ops to right after their parent nodes. This is to make sure the shape and size nodes are executed right after their parents so it's possible the input tensor memory can be released as soon as possible. This is especially important for non-CPU devices or for training case where some gradient graphs use only shape/size of tensors from forward. Profiling result: Before <img width="910" alt="截屏2023-11-13 12 09 02" src="https://github.com/microsoft/onnxruntime/assets/11661208/e54d5ead-274f-4725-923e-521bbcfce752"> After <img width="910" alt="截屏2023-11-13 12 10 44" src="https://github.com/microsoft/onnxruntime/assets/11661208/f50d196d-11ac-43a2-9493-517e4552ffab"> commit e1d1033131114dc2634e664d009e061d900a9554 Author: Vincent Wang <[email protected]> Date: Thu Nov 30 18:32:36 2023 +0800 [ORTModule] Remove Unused Arguments from Generated Triton Code (#18636) This PR: - Remove unused arguments from generated triton code, - Remove unnecessary mask for symbolic shape case from generated triton code. - Add doc for usage of ORTMODULE_TRITON_CONFIG_FILE. commit 5c67a00d8e9ba3604593b6fe25a1e3da0c8ef65b Author: George Wu <[email protected]> Date: Wed Nov 29 22:27:51 2023 -0800 Revert "remove full protobuf requirement for tensorrt ep" (#18626) Reverts microsoft/onnxruntime#18413 there's a timing issue here. we eventually want to get this change merged in but we need to update OSS onnx-tensorrt first. commit c20488ced70488c9e95b6c11fdea309efe2fdc99 Author: Jambay Kinley <[email protected]> Date: Wed Nov 29 18:27:04 2023 -0800 skip_infer for SkipGroupNorm in SymbolicShapeInference (#18630) <!-- Describe your changes. --> https://github.com/microsoft/onnxruntime/pull/18273 added `SkipGroupNorm` contrib op but it did not skip onnx shape inference for this op in `SymbolicShapeInference`. This leads to failed shape inference of the transformers optimized model with `enable_skip_group_norm=True`. Also results in an invalid float16 model for the SD CUDA example. This PR adds `SkipGroupNorm` to `skip_infer` so that it skips onnx shape inference for this op and instead uses the relevant dispatcher. <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix shape inference failure for models with `SkipGroupNorm` nodes. commit 227dcb3a88eb8c36bfc5c0341156ce96291597ac Author: Yang Gu <[email protected]> Date: Thu Nov 30 10:01:12 2023 +0800 [js/webgpu] Log the key and program info for artifact (#18365) With uniform support, ideally we may just keep one artifact for each program to save the compilation time. This PR just logs the related info, including key and program name, so that we may understand better the situation. commit 7335760424b052ff041285571cf52b77f9ebb009 Author: satyajandhyala <[email protected]> Date: Wed Nov 29 15:30:33 2023 -0800 [JS/Web] Add uniforms to Einsum (#18531) Add uinforms to Einsum <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Improve performance. commit 483c490ec4db2d2b5001e42f5c842abfc9e379af Author: Edward Chen <[email protected]> Date: Wed Nov 29 14:38:44 2023 -0800 Refine error checks in onnxruntime/core/providers/coreml/model/model.mm. (#18620) commit d2dfbf41795e72911643e2ffcadac069b72580bd Author: Dmitri Smirnov <[email protected]> Date: Wed Nov 29 10:44:59 2023 -0800 Add float16 type support to SplitToSequence and make code type independent (#18594) Add support for `float16` type to address the below issue. Re-work the code to make it type independent. This reduces binary size by ~11 K. ![image](https://github.com/microsoft/onnxruntime/assets/11303988/1a77c7bc-34a8-478c-a16a-abd94062c6c6) This PR addresses https://github.com/microsoft/onnxruntime/issues/18481 commit 68209307daadfe21a74a36d44c4c170b91141772 Author: Yi Zhang <[email protected]> Date: Thu Nov 30 02:32:42 2023 +0800 Replace all Azure-Pipelines-EO-Windows2022-aiinfrat to Onnxruntime-Win-CPU-2022 (#18614) Replace all Azure-Pipelines-EO-Windows2022-aiinfrat to Onnxruntime-Win-CPU-2022 Reduce the maintenance cost commit 38b640c797613e2396f2975ccd4d8ff0e95a5baa Author: Wanming Lin <[email protected]> Date: Thu Nov 30 00:00:23 2023 +0800 [WebNN EP] Re-implement Unsqueeze, Squeeze, Flatten with WebNN's reshape (#18585) WebNN will not provide `unsqueeze`, `squeeze`, `flatten2d` ops, as it can be easily implemented by reshape. commit 14a343441dcd530bec24e18e34c3c068993eb06c Author: Edward Chen <[email protected]> Date: Tue Nov 28 17:14:20 2023 -0800 Fix Objective-C static analysis build (#18606) - Patch abseil to fix a compile error about not finding `cxxabi.h`. - Fix some static analysis warnings. commit e833d22f143f86529f4863b5da6cac4eb4a78bbb Author: ivberg <[email protected]> Date: Tue Nov 28 16:58:51 2023 -0800 Change QNN EP Profiling logs to output to CSV (#18201) Change QNN EP Profiling logs to output to CSV. Output is in a similar format to QNN SDK Tools (instead of to ORT logs) https://onnxruntime.ai/docs/execution-providers/QNN-ExecutionProvider.html#configuration-options (profiling_level) It is hard to read and interpret QNN profiling logs in the ORT logs. --------- Co-authored-by: Hector Li <[email protected]> commit f13380f3d8d25df797be60b4899b43504a5576b5 Author: Tianlei Wu <[email protected]> Date: Tue Nov 28 15:46:42 2023 -0800 Support LoRA and Control Net in Stable Diffusion demo (#18593) (1) Export onnx model with LoRA weights for both SD 1.5 and SDXL (2) Export onnx model with Control Net for both SD 1.5 and SDXL. For SD 1.5, it is allowed to use multiple control nets. For SDXL, at most one control net is supported right now. (3) Add demo of LCM LoRA (3) Add demo of control net. commit 50e6235af111e5113860dfd7a0ece55dc00316a0 Author: Yulong Wang <[email protected]> Date: Tue Nov 28 15:15:59 2023 -0800 [js/web] allow ShaderHelper to use internal (non-I/O) variables (#18525) This PR includes a change that inspired from #18452 to resolve a requirement: a shader may depend on an instance of `IndicesHelper` to generate WGSL code snippet, but the IndicesHelper instance is not necessarily an input/output of the program. So the existing `declareVariables()` function does not work with this scenario. In order to support this requirement, I added this "use" function to `interface ShaderHelper`, which takes a helper-like object as parameter. The hidden implementation `ShaderHelperImpl` class will iterate the helpers and call `impl()` for each. @axinging @qjia7 commit a49f31b6705bdd8a9b9cd7b7b4a9bbc0ebba07a2 Author: Jian Chen <[email protected]> Date: Tue Nov 28 13:23:01 2023 -0800 Remove drop-nuget artifact from all pipelines (#18592) Currently, the `drop-nuget` artifact only contains protoc.exe which is also part of the `drop-extra` artifact. <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> commit e24733cfe9b3e0d40419942f2d6337925c351606 Author: Mike Guo <[email protected]> Date: Wed Nov 29 03:42:39 2023 +0800 fix the Olive CI pipeline failure on Windows (#18464) Fix the https://aiinfra.visualstudio.com/Lotus/_build?definitionId=1046 failure for Windows commit 288b80d363bc120c8d3c0ca3c2fe4252e16f4c56 Author: Rachel Guo <[email protected]> Date: Tue Nov 28 10:11:53 2023 -0800 Add MacOS build to ORT C Pod (#18550) <!-- Describe your changes. --> As title. 1. Add macos build as an optionally enabled arch for pod and changes to exsiting build_ios_framework/assemble_c_pod scripts. 2. Enable macos build arch in ios packaging pipeline (currently for variants other than Mobile) and check the output artifacts are correct. 3. Write MacOS Test Target scheme in the test app and integrate into ios packaging CI testing pipeline. Currently the changes only apply to onnxruntime-c pod. as the original request was from ORT SPM which consumes the onnxruntime-c pod only as the binary target. TODO: could look into adding macos platform to objc pod as well. <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Enable macos platform support in cocoapods. and also potentially produce binary target for enabling macos platform in SPM as well. Replace https://github.com/microsoft/onnxruntime/pull/18334 --------- Co-authored-by: rachguo <[email protected]> Co-authored-by: rachguo <[email protected]> Co-authored-by: Edward Chen <[email protected]> commit 05046e5452f7a1f47bb1f4c01ddfa86eb6fac77f Author: Chen Fu <[email protected]> Date: Tue Nov 28 10:01:09 2023 -0800 Adding unit test for sm80 prepack (#18514) Prepacking code for block q4 x fp16 GEMM cuda kernel, for SM80 hardware Preparing for addition of Q4 x FP16 GEMM kernel on Nvidia Ampere GPUs. This kernel requires sophisticated quantized weight rearrangement to speedup loading data to tensor-core. To facilitate the addition, this change includes the following: 1. matrix_layout.h A new layout lib that facilitate iterating matrix elements and tiles that balance memory safety and performance. 2. prepack_sm80.h Code for rearranging quantized weight, scales and offsets (aka. prepacking) 3. blkq4_fp16_sm80_prepack_test.cc Unit tests that explicitly test the memory safety and correctness of the prepacking code. Currently the prepacking code runs on CPU with single threaded code. We run this on CPU in order to minimize GPU memory fragmentation. On the other hand, hopefully we get around to parallelize this part of the code. Should be straight forward with the unit tests in place. commit 8d5ecc4dae0686d032a81c3633fdaf213572a722 Author: Adrian Lizarraga <[email protected]> Date: Tue Nov 28 09:46:47 2023 -0800 [Quantization] Fix scale/zero-point for 16-bit QDQ Softmax (#18589) Sets the appropriate scale and zero-point values for 16-bit QDQ Softmax. Previously, the scale/zp were set to fixed values that were specific to 8-bit quantization. Generate more accurate 16-bit QDQ models that contain Softmax. commit 0b7048e7d621b271b0ab4748e566f57d11b49be5 Author: Sheil Kumar <[email protected]> Date: Tue Nov 28 09:26:48 2023 -0800 Update winml to use #cores - #soc cores by Default as the number of intraopthreads (#18384) Update winml to use #cores - #soc cores by Default as the number of intraopthreads --------- Co-authored-by: Sheil Kumar <[email protected]> commit a6d872640764ea50ec460f7a717e5b369921f8b4 Author: Yi Zhang <[email protected]> Date: Wed Nov 29 01:04:25 2023 +0800 Update ADO windows image to custom image (#18598) Update Azure-Pipelines-EO-Windows2022-aiinfra to onnxruntime-win-CPU-2022 in Nuget_Package_CPU. To make the debugging easier, use flex-downloadPipelineArtifact Azure-Pipelines-EO-Windows2022-aiinfra is using 1ES window-latest image. The pipeline might be failed by unexpected upgrade. Verified: https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=384425&view=results I think we should replace all Azure-Pipelines-EO-Windows2022-aiinfra. commit 3ea27c29253aad7c02015e2af6d37dedafe2c9c3 Author: Jian Chen <[email protected]> Date: Tue Nov 28 09:03:46 2023 -0800 Create a new Nuget Package pipeline for CUDA 12 (#18135) commit 94a6020a7f59f22101653988a36bca02593eb816 Author: Xavier Dupré <[email protected]> Date: Tue Nov 28 03:56:00 2023 -0800 Improve parallelization of TfIdfVectorizer, Reduce memory consumption (#18539) TfIdfVectorizer has two steps: first search for n-grams in the input, second, weight the results. The second step was not parallelized. The PR adresses that issue. Before two vectors were of the size of the output were allocated to compute the results. The first one, frequencies, was used as an intermediate vector between the two steps. This vector is now broken into multiple small vectors, one per thread. The memory consumption is then reduced for batches with a number of rows > the number of threads. Performance and memory consumption. For one model, the improvment is +15% faster (4 cores, model size is ~6Mb, batch size is 100). Here is another benchmark on a machine with 32 cores with different size of vocabularies and batch sizes. The tested TfIdfVectorizer only deals with unigram and processes sequences of 10 tokens (integers). ![image](https://github.com/microsoft/onnxruntime/assets/22452781/0bb9abe9-ed81-44da-b5c4-ad2a12f129bd) commit 3f42fbad2e42cf03c01eb0428b06e24f4ad2d427 Author: Ran Gal <[email protected]> Date: Mon Nov 27 23:54:38 2023 -0800 deleted the unused random_device variables because they caused a warning that was treated like an error. (#18543) deleted the unused random_device variables because they caused a warning that was treated like an error. **_Please check if the declaration is required for the random number generation. if so, there need to be a dummy reference to the variable or turning off the warning as error behavior._** <!-- Describe your changes. --> <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> commit fc8631e2f11d85c84ab9cc711aacb9c589b6f71a Author: Jiajia Qin <[email protected]> Date: Tue Nov 28 13:21:47 2023 +0800 [js/web] Fix conv2dMatmul errors due to #18452 (#18562) Currently, all conv2dMatmul with inChannels = 3 and outChannels % 4 = 0 will report compilation errors. Models, which include this kind of shape will be impacted, like mobilenetv2-12, resnet50 . The errors is introduced by #18452 https://github.com/microsoft/onnxruntime/pull/18452/files#diff-8b24ea43aa11b1346c0c9e327f9bce6b37a93bd8f2bf8a6392b2b263972b7ea2R200, which accidentally pass `components` to `x`. But `x`'s components is `innerElementSize` not `components `. And when `innerElementSize` is 3, we should use `1` in current design. commit b9fd9c5665c998fea8786a2e9fee2776e667845c Author: cao lei <[email protected]> Date: Mon Nov 27 13:41:12 2023 -0800 remove dead code in openvino EP (#18457) <!-- Describe your changes. --> Remove dead code in openvino EP <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Remove dead code in openvino EP commit dd355e39a063c124142f60d6cc14f6d48692e1f7 Author: Caroline Zhu <[email protected]> Date: Mon Nov 27 10:30:13 2023 -0800 [js/web/training] Added parameters methods (#18250) * Implemented: `getParametersSize`, `getContiguousParameters` (equivalent to copyParametersToBuffer), and `loadParametersBuffer` (equivalent to copyParametersFromBuffer) * as part of these changes, getParametersSize was added to the TrainingSession interface so that users know what size buffer to create for loadParametersBuffer * The parameters methods in the interface were modified to take in a Float32Array instead * part of the work for implementing web bindings for training * enables federated learning in the web * previous PR: #18006 --------- Co-authored-by: Ashwini Khade <[email protected]> commit a2fd8a6fc083f43d6535f5acd24219c140812c87 Author: Hector Li <[email protected]> Date: Fri Nov 24 20:41:27 2023 -0800 [QNN EP] Return INVALID_GRAPH if failed to load from context binary (#18485) [QNN EP] Return INVALID_GRAPH if failed to load from context binary Make sure QNN EP return INVALID_GRAPH if error encountered with the context binary file commit 2f608338cb46398fc3806cb6d1fd3ba7961b1a9f Author: cloudhan <[email protected]> Date: Fri Nov 24 18:04:48 2023 +0800 Setup default python formatter for new python plugin (#18563) commit 7b2aefa85688a02a58c5dd7bddc90e7f81f44c3a Author: Ted Themistokleous <[email protected]> Date: Fri Nov 24 05:04:23 2023 -0500 undo hipify of __half to rocblas_half (#18573) Fixes build issue seen with newer ROCm releases Co-authored-by: Jeff Daily <[email protected]> commit b9c935f6050b3a57e23dbb79e739489f25f6924a Author: mindest <[email protected]> Date: Fri Nov 24 17:22:00 2023 +0800 [ROCm] Some fixes in tunable (#18575) * Fix workspace size for hipBLASLt algos at 32M * Update according to API changes commit 62f00ad8e7b7bbaf144e9af2bb19d9bf63dcd291 Author: Rachel Guo <[email protected]> Date: Thu Nov 23 14:26:57 2023 -0800 [CoreML] Add Softmax and Split op support (#18358) <!-- Describe your changes. --> As title. <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Added for yolov8 model missing operator support. https://github.com/microsoft/onnxruntime/issues/17654 Now the model support info looks like: _CoreMLExecutionProvider::GetCapability, number of partitions supported by CoreML: 3 number of nodes in the graph: 233 number of nodes supported by CoreML: 230_ (only missing 3 concat op support due to input 3d shape is not currently support in CoreML EP Concat). --------- Co-authored-by: rachguo <[email protected]> Co-authored-by: rachguo <[email protected]> Co-authored-by: Edward Chen <[email protected]> commit 6f3c1f9dc9c08ec52c3c2e975e35308b08219494 Author: cloudhan <[email protected]> Date: Thu Nov 23 12:06:19 2023 +0800 [ROCm] Update ck for GemmFloat8 (#18487) commit 1c79897c90f959d30ed68c9b36d82be0024d806b Author: Adrian Lizarraga <[email protected]> Date: Wed Nov 22 19:40:33 2023 -0800 [QNN EP] Support LpNormalization (#18561) Add support for the ONNX LpNormalization operator (p == 2). This is translated to QNN's L2Norm operator. Support more models with QNN EP commit 43a5147e015e105547aa0e6862462a352fa43c5f Author: pengwa <[email protected]> Date: Thu Nov 23 11:39:00 2023 +0800 Memory optimization refactor and refinement (#17481) Currently memory optimizer runs graph transformations and print recompute opportunities in INFO level, while ORT backend has many many INFO level logs making users hard to find those information. So we are looking for a Python binding API to retrieve the memory optimization opportunities instead of depending on the MemoryOptimizer's default logging. Then we can print ORTModule feature statistics using this information. Also, with such an API, we can create an ORT session created, where allocation plan is done, the analysis will consider buffer reuse as well. This can void giving some recomputation subgraphs that are reusing other subgraphs' output buffers. Check https://github.com/microsoft/onnxruntime/blob/pengwa/add_devinfo_level/docs/Memory_Optimizer.md for the new flow using `MemoryOptimizer`. This pull requests made following refactoring: 1. Print the log in ORTModule Python script, along with ORTModule feature enabling stats. This is implemented by exposing an API `get_serialized_ortmodule_memory_stat` to retrieve the memory optimization opportunities. 2. We are analyzing memory optimization opportunities considering ORT memory planning. This is done by firstly creating the execution graph without enabling MemoryOptimizer, then we call `execution_agent.get_serialized_ortmodule_memory_stat` which internally will consider the session memory allocation planner when analyzing memory optimization opportunity. As a direct result, the memory optimization opportunities can show those stashed activations that are reusing other buffers. 3. Move recompute analysis logic from memory_optimizer.h/cc to recompute_analysis.h/cc. 4. Abstract optimization strategies for their own implementation. This will make introducing new strategies (for example compression and decompression ) easier. New logging matrix (INFO Level), in WARNING level, the details will NOT show. ``` 2023-09-13 13:25:09,249 orttraining.rank-0 [WARNING] - ***** ONNX Runtime Training (ORTModule) is accelerating your model ***** ORTModule is enabled with following features ON/OFF for [training] mode: ATen Executor : ON : Dispatch ATen operators to ORT's ATen executor Cast Propagation : ON : Level 1 enabled Custom Function : ON : Support custom torch.autograd.Function export and execution Memory Optimizer : ON : RecomputeConfig: Reshape+Where+BiasSoftmax+:1:-1,Cast+:1:-1, ProbeLevel: 1, available configs: Config Freq Saving(B) Saving Symbolic(Bytes) - Plan 1 : ON : Reshape+Where+BiasSoftmax+:1:-1 5 671,088,640 640.0*inputs_input_ids_dim0*inputs_input_ids_dim1**2 - Plan 2 : ON : Cast+:1:-1 6 402,587,648 inputs_input_ids_dim0*inputs_input_ids_dim1*(384.0*inputs_input_ids_dim1 - 64.0) - Plan 3 : OFF : Reshape+Where+:1:-1 1 134,217,728 128.0*inputs_input_ids_dim0*inputs_input_ids_dim1**2 - Plan 4 : OFF : BiasSoftmax+:1:-1 1 134,086,656 128.0*inputs_input_ids_dim0*inputs_input_ids_dim1*(inputs_input_ids_dim1 - 1) - Plan 5 : OFF : BiasGelu+:1:-1 6 125,808,640 inputs_input_ids_dim0*(122880.0*inputs_input_ids_dim1 - 20480.0) - Plan 6 : OFF : FusedMatMul+:1:-1 6 125,808,640 inputs_input_ids_dim0*(122880.0*inputs_input_ids_dim1 - 20480.0) - Plan 7 : OFF : FusedMatMul+Add+FusedMatMul+Add+Add+Add+:1:-1 5 26,214,400 25600.0*inputs_input_ids_dim0*inputs_input_ids_dim1 - Plan 8 : OFF : Add+:1:-1 1 5,237,760 5120.0*inputs_input_ids_dim0*(inputs_input_ids_dim1 - 1) - Plan 9 : OFF : Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+:1:-1 1 4,096 4.0*inputs_input_ids_dim0*inputs_input_ids_dim1 - Plan 10 : OFF : Cast+:2:-1 1 2,048 2.0*inputs_input_ids_dim0*inputs_input_ids_dim1 Compute Optimizer : ON : Enable/Disable with env ORTMODULE_ENABLE_COMPUTE_OPTIMIZER=1/0 - FLOPReduction : ON : Reduce FLOPs by upstreaming shrinking-sized ops Auto Fallback : ON : Fallback to PyTorch when encountering unsupported ops TritonOp Enabled : OFF : ORT will switch to Triton for executing some ops to further accelerate training. ZeRO Stage3 Support : OFF : Enable/Disable with env ORTMODULE_ENABLE_ZERO_STAGE3=1/0 Total ORT initialization overhead is 10.73s where export takes 8.39s. Other overhead details: graph builder init takes 0.06s, runtime detection takes 0.01s, graph building takes 0.31s, session creation takes 1.96s Versions: ONNX Runtime - 1.16.0+cu118, ONNX - 1.11.0 Note 1: use comma to enable multiple plans at the same time. export ORTMODULE_MEMORY_OPT_CONFIG=<plan1 config>,<plan2 config>,... Note 2: saving is calculated based on the 1st batch symbolic dim values: inputs_input_ids_dim0=1, inputs_input_ids_dim1=1024, inputs_attention_mask_dim0=1, inputs_attention_mask_dim1=1024, inputs_labels_dim0=1, inputs_labels_dim1=1024, ************************************************************************ ``` If DEVINFO level is enabled, then more details about the memory optimizations are printed. ``` MemoryInsight Summary - User config: BiasGelu+:1:-1,Cast+:2:-1 ========================================================================================================================================== |Freq | Memory Optimization Opportunities (Clustered by node-level activation patterns) | |_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ | |3 |For each row options are mutually exclusive, only one of them can be enabled. | | | | | |>>Option 1 : Recompute subgraph FusedMatMul+Add+Reshape+ | | | Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+Add+Reshape+:1:-1 | | | Stashed Activations: | | | - ReuseFreq : Output 0(3), | | | - Output 0 : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 32 x 240 x ], byte/elem: 2, 100% saved | |_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ | |2 |For each row options are mutually exclusive, only one of them can be enabled. | | | | | |>>Option 1 : Recompute subgraph Reshape+ | | | Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+:1:-1 | | | Stashed Activations: | | | - ReuseFreq : Output 0(2), | | | - Output 0 : [ x 2560 x ], byte/elem: 2, 100% saved | |_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ | |2 |For each row options are mutually exclusive, only one of them can be enabled. | | | | | |>>Option 1 : Recompute subgraph FusedMatMul+ | | | Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+:1:-1 | | | Stashed Activations: | | | - Output 0 : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 10240 x ], byte/elem: 2, 100% saved | |_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ | |2 |For each row options are mutually exclusive, only one of them can be enabled. | | | | | |>>Option 1 : Recompute subgraph Cast+ | | | Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Cast+:1:-1 | | | Stashed Activations: | | | - Output 0 : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 2, 100% saved | |_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ | |2 |For each row options are mutually exclusive, only one of them can be enabled. | | | | | |>>Option 1 : Recompute subgraph Reshape+Where+BiasSoftmax+ | | | Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Where+BiasSoftmax+:1:-1 | | | Stashed Activations: | | | - Output 0 : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved | |_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ | |2 |For each row options are mutually exclusive, only one of them can be enabled. | | | | | |>>Option 1 : Recompute subgraph BiasGelu+ | | | Status : Enabled, requested count=-1, actual applied count=2 | | | Stashed Activations: | | | - Output 0 : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 10240 x ], byte/elem: 2, 100% saved | |_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ | |2 |For each row options are mutually exclusive, only one of them can be enabled. | | | | | |>>Option 1 : Recompute subgraph FusedMatMul+Add+FusedMatMul+Add+Add+Add+ | | | Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+Add+FusedMatMul+Add+Add+Add+:1:-1 | | | Stashed Activations: | | | - Output 0 : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 2560 x ], byte/elem: 2, 100% saved | |_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ | |1 |For each row options are mutually exclusive, only one of them can be enabled. | | | | | |>>Option 1 : Recompute subgraph Reshape+Where+ | | | Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Where+:1:-1 | | | Stashed Activations: | | | - Output 0 : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved | |_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ | |1 |For each row options are mutually exclusive, only one of them can be enabled. | | | | | |>>Option 1 : Recompute subgraph FusedMatMul+ | | | Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+:1:-1 | | | Stashed Activations: | | | - Output 0 : [inputs_input_ids_dim0*(inputs_input_ids_dim1 - 1) x 10240 x ], byte/elem: 2, 100% saved | |_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ | |1 |For each row options are mutually exclusive, only one of them can be enabled. | | | | | |>>Option 1 : Recompute subgraph Cast+ | | | Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Cast+:1:-1 | | | Stashed Activations: | | | - Output 0 : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 - 1 x inputs_input_ids_dim1 x ], byte/elem: 2, 100% saved | |_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ | |1 |For each row options are mutually exclusive, only one of them can be enabled. | | | | | |>>Option 1 : Recompute subgraph Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+ | | | Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+:1:-1 | | | Stashed Activations: | | | - Output 0 : [inputs_input_ids_dim0 x 1 x 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved | | | | | |>>Option 2 : RecomputeWithCompromise subgraph Cast+ | | | Status : Enabled, requested count=-1, actual applied count=1 | | | Stashed Activations: | | | - Output 0 : [inputs_input_ids_dim0 x 1 x 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 50% saved | |_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ | |1 |For each row options are mutually exclusive, only one of them can be enabled. | | | | | |>>Option 1 : Recompute subgraph BiasSoftmax+ | | | Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=BiasSoftmax+:1:-1 | | | Stashed Activations: | | | - Output 0 : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 - 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved | |_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ | |1 |For each row options are mutually exclusive, only one of them can be enabled. | | | | | |>>Option 1 : Recompute subgraph BiasGelu+ | | | Status : Enabled, requested count=-1, actual applied count=1 | | | Stashed Activations: | | | - Output 0 : [inputs_input_ids_dim0*(inputs_input_ids_dim1 - 1) x 10240 x ], byte/elem: 2, 100% saved | |_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ | |1 |For each row options are mutually exclusive, only one of them can be enabled. | | | | | |>>Option 1 : Recompute subgraph Add+ | | | Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Add+:1:-1 | | | Stashed Activations: | | | - Output 0 : [inputs_input_ids_dim0*(inputs_input_ids_dim1 - 1) x 2560 x ], byte/elem: 2, 100% saved | |_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ | ========================================================================================================================================== Note: use comma as a separator for enabling more than one subgraphs. ************************************************************************ ``` <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> commit 64dacc2892d31603a5723959d308bb9c4b05d0ea Author: Jiajia Qin <[email protected]> Date: Thu Nov 23 07:58:06 2023 +0800 [js/webgpu] Add BatchNormalization Op (#18468) This PR adds `BatchNormalization` with `float` support. Some Todos: 1. all inputs don't have same data type. For example, x/y is float16, but bias/scale is float32 or double. 2. training mode support. We see many models are using `BatchNormalization` ops. However, due to the missing in jsep, all of them run on cpu, which result very poor performance. With this PR's support, densenet-9 model becomes 20.29 ms from 250.69 ms. commit fa106942a7962e68f1659cd65f5a7cdb498b8c03 Author: Xu Xing <[email protected]> Date: Thu Nov 23 06:42:55 2023 +0800 [js/webgpu] Refactor matmul conv to support uniforms for matmul (#18452) This change refactored matmul/conv related programs to support shape uniforms. Currently only matmul shape uniforms are fully enabled. TODOs: add input dependencies for conv related programs, turn clipMax and clipMin to uniforms. commit 42c6799c59b5770809a6b4df208d3da5a0270486 Author: Scott McKay <[email protected]> Date: Thu Nov 23 08:27:47 2023 +1000 Update transpose optimization to be more QDQ aware (#18444) <!-- Describe your changes. --> Rework some aspects of the transpose optimizer to ensure we have valid QDQ node units when it is done. Conceptually we need to let individual Transpose nodes move through the graph when optimizing. That can invalidate existing QDQ node units or require new ones. We can fix this after inserting new nodes, or when transpose optimization finishes moving Transpose nodes. Fix when inserting new node - TransposeInputs can add an Unsqueeze (to broadcast) and Transpose to a node's inputs - if there was a DQ node providing the input, add a Q -> DQ after inserting the Unsqueeze/Transpose to make a QDQ node unit for the new node. - Unsqueeze/Transpose don't change data, so we can copy the type/scale/zero point from the existing DQ Fixes when transpose optimization completes moving Transpose nodes - Remove empty DQ -> Q pairs if the type/scale/zero point match - Pushing a Transpose through may have resulted in an existing Transpose/Reshape being cancelled and removed leaving an empty QDQ node unit - the Transpose being moved may have started in a QDQ node unit - Transpose that got blocked inside existing QDQ node unit - e.g. if we hit a DQ -> MatMul -> Q node unit the Transpose gets blocked after the DQ - insert a Q -> DQ after the Transpose to put it in a QDQ node unit and repair the original QDQ node unit - Transpose moves past a DQ providing a graph output - insert a Q -> DQ so the Transpose is in a QDQ node unit This replaces the existing phase 2 logic which flipped a DQ -> Transpose to fix a broken QDQ node unit. The new approach should handle more scenarios and hopefully produce a better graph. Additionally the logic to handle updates to shared initializers that feed DQ nodes was simplified (i.e. largely removed). When we update the shared initializer a Squeeze (if broadcast) and Transpose is added between the initializer and the DQ for other usages of it. We only need to check for this pattern in EstimateTransposeValueCost by looking past a DQ node. We do not need to track the individual DQ nodes leading to an updated shared initializer. <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Initially to fix QNN issue with non-const input being transpose and the QDQ node units being broken. commit 841f7ed3e0c393b22b1631c090c61b20fc62f876 Author: satyajandhyala <[email protected]> Date: Wed Nov 22 14:14:24 2023 -0800 [[JS/Web]Added uniform to Expand op. (#18558) <!-- Describe your changes. --> Added Uniforms to Expand operator kernel <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Improve performance commit 1c555c5fc11d673df9db4f08ebf389c9929e85c0 Author: Arthur Islamov <[email protected]> Date: Thu Nov 23 00:12:07 2023 +0400 [JS/Web] Resize & BiasSplitGelu fp16 support (#18536) Resize and BiasSplitGelu fp16 support on WebGPU commit 3f0ebd673622d3663011ae33fc6070f1f2ea3af3 Author: Xavier Dupré <[email protected]> Date: Wed Nov 22 18:15:24 2023 +0100 Fix opset import in GemmFloat8 python unit tests (#18489) The unit test are failing if a development version of onnx is used. The opset are set to 19. commit 32fabb555501a020751b6123de94c7fc14086f2b Author: Xavier Dupré <[email protected]> Date: Wed Nov 22 18:15:11 2023 +0100 Fix opset version of the optimizer in function generate_artifacts (#18300) `generate_artifacts` generates 4 graphs for training. All graphs should share the same opset version, the one coming from the model to train, but the optimizer is left un…
kleiti
pushed a commit
to kleiti/onnxruntime
that referenced
this pull request
Mar 22, 2024
Update a few optimizations for Stable Diffusion XL: (1) Add SkipGroupNorm fusion (2) Remvoe GroupNorm fusion limits. Previously, we only fuse GroupNorm when channels is one of `320, 640, 960, 1280, 1920, 2560, 128, 256, 512` so some GroupNorm in refiner was not fused. (3) Tune SkipLayerNormalization to use vectorized kernel for hidden size 320, 640 and 1280. Pipeline Improvements: (4) Enable cuda graph for unetxl. (5) Change optimization to generate optimized fp32 model with ORT, then convert to fp16. Otherwise, fp16 model might be invalid. (6) Add option to enable-vae-slicing. Bug fixes: (a) Fix vae decode in SD demo. (b) Fix UnipPC add_noise missing a parameter. (c) EulerA exception in SDXL demo. Disable it for now. (d) Batch size > 4 has error in VAE without slicing. Force to enable vae slicing when batch size > 4. #### Performance Test on A100-SXM4-80GB Description about the experiment in results: *Baseline*: removed GroupNorm fusion limits; CUDA graph is enabled in Clip and VAE, but not in Clip2 and UNet. *UNetCG*: Enable Cuda Graph on UNet *SLN*: Tune SkipLayerNormalization *SGN*: Add SkipGroupNorm fusion The latency (ms) of generating an image of size 1024x1024 with 30 steps base model and 9 steps of refiner model: | Baseline | UNetCG| UNetCG+SLN | UNetCG+SLN+SGN -- | -- | -- | -- | -- Base Clip | 3.74 | 3.70 | 3.88 | 3.81 Base Unet x30 | 2567.73 | 2510.69 | 2505.09 | 2499.99 Refiner Clip | 7.59 | 7.42 | 7.41 | 7.58 Refiner Unet x 9 | 814.43 | 803.03 | 802.20 | 799.06 Refiner VAE Decoder | 84.62 | 85.18 | 85.24 | 87.43 E2E | 3480.56 | 3412.05 | 3405.77 | 3400.23 We can see that enable cuda graph brought major gain (around 68ms). SLN Tuning has about 7ms gain. SkipGroupNorm fusion has 5ms gain. SkipGroupNorm fusion won't reduce latency much, while it also has benefit of reducing memory usage, so it is recommended to enable it. ### Motivation and Context Additional optimizations upon previous work in microsoft#17536.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Update a few optimizations for Stable Diffusion XL:
(1) Add SkipGroupNorm fusion
(2) Remvoe GroupNorm fusion limits. Previously, we only fuse GroupNorm when channels is one of
320, 640, 960, 1280, 1920, 2560, 128, 256, 512
so some GroupNorm in refiner was not fused.(3) Tune SkipLayerNormalization to use vectorized kernel for hidden size 320, 640 and 1280.
Pipeline Improvements:
(4) Enable cuda graph for unetxl.
(5) Change optimization to generate optimized fp32 model with ORT, then convert to fp16. Otherwise, fp16 model might be invalid.
(6) Add option to enable-vae-slicing.
Bug fixes:
(a) Fix vae decode in SD demo.
(b) Fix UnipPC add_noise missing a parameter.
(c) EulerA exception in SDXL demo. Disable it for now.
(d) Batch size > 4 has error in VAE without slicing. Force to enable vae slicing when batch size > 4.
Performance Test on A100-SXM4-80GB
Description about the experiment in results:
Baseline: removed GroupNorm fusion limits; CUDA graph is enabled in Clip and VAE, but not in Clip2 and UNet.
UNetCG: Enable Cuda Graph on UNet
SLN: Tune SkipLayerNormalization
SGN: Add SkipGroupNorm fusion
The latency (ms) of generating an image of size 1024x1024 with 30 steps base model and 9 steps of refiner model:
We can see that enable cuda graph brought major gain (around 68ms). SLN Tuning has about 7ms gain. SkipGroupNorm fusion has 5ms gain.
SkipGroupNorm fusion won't reduce latency much, while it also has benefit of reducing memory usage, so it is recommended to enable it.
Motivation and Context
Additional optimizations upon previous work in #17536.