Optimize BinaryElementWise and BiasGeluGrad kernels for ROCm #2

hubertlu-tw · 2022-02-18T03:59:57Z

Description:
Optimized _BinaryElementWise, _BinaryElementWiseRhsPerChannelBatchN, and BiasGeluGradDxKernel for AMD GPUs.

Motivation and Context

Why is this change required? What problem does it solve?

The 4% runtime difference is observed in TNLRv4 on MI100 vs. V100. The cause of discrepancy is from the following three kernels: _BinaryElementWise, _BinaryElementWiseRhsPerChannelBatchN, and BiasGeluGradDxKernel. Given by the reproducer, we are able to see the performance improvement in terms of original sub-optimal kernels by 15.94%, 19.28%, and 27.97%, respectively.

…reml model schema (microsoft#10424) * remove coremltools submodule * update cgmanifest * Copy proto files directly from coremltools

…span (microsoft#10426)

* reducesum bf16 support * bf16 for add/sub/mul/div * fix build * bf16 for Cast * bf16 for softmax Co-authored-by: root <[email protected]>

…#10379) Add a test for a missing optional input to Loop.

Enable transpose optimizer and infrastructure it depends on in a minimal extended build.

* Add qdq conv to NNAPI * fix build warning * addressed CR comments * fix a minor bug in my previous merge

Make sure quantize_statict would insert DQ -> Q before ArgMax.

* expand model tests name * skip cpu/cuda for trt when running onnxruntime_test_all * only run trt ep for c++ unit test * Update CMAKE_CUDA_ARCHITECTURES for T4 * Use new t4 agent pool * Update YAML for run T4 on Windows * revert code * Update CMAKE_CUDA_ARCHITECTURES * fix wrong value * Remove cpu/cuda directly in model tests * add only CMAKE_CUDA_ARCHITECTURES=75 * remove expanding model test name to see difference * revert code * Add fallback execution provider for unit test * Add fallback execution provider for unit test (cont) * add conditional to add fackback cuda ep * Reduction op takes much longer time for TRT 8.2, so we test smaller range of inputs * use M60 * revert code * revert code * add comments * Modify code and add comment * modify comment * update comment * add comment

* move table names to one location * remove session metadata * reload trt inputs * fix posting names * Update linux-gpu-tensorrt-daily-perf-pipeline.yml for Azure Pipelines * remove comments * Split up anubis job and perf run * add trt environ variables * No embedded links

…ild docker images correctly (microsoft#10445) * fix build errors for the migraphx and rocm dockerfile * add the numpy package in the migraphx and rocm dockerfile

Co-authored-by: Weixing Zhang <[email protected]>

) Add binding external allocation Add negative tests Add missing return status check

* Add NNAPI support of QDQ Resize * minor update to UT * fix build break * fix android UT failure * address cr comments

* add qgemm in quantization tool * add qdq support for QGemm * fix build break * fix OperatorKernels.md

* Add BeamSearch op schema * Add ONNX conversion for beams search * remove attention_mask and change input order * add option to run baseline * add check data type NULL * applies VerifyNodeAndOpMatch to subgraph * update input_ids shape * Add node name for Cast node * expose API for topk * parse parameters * Add beam search scorer * output results * fix typo * use c++ template and format python * fix build pipeline errors * symbolic shape infer of input onnx * output scores * add kernel def hash * Handle vocab_mask; move CheckSubgraph * undo insert_cast_transformer.cc and fusion_utils.py * fix typo * fix merge * update doc * add repetition penalty * refactoring: add GptSubgraph class * move BeamSearchState from .h to .cc file * adjust logits processor order * add batch generation example * fix repetition penalty for dup words in sequence * Add test * Add no repeat ngram processor * refactoring: move logits processor to classes * fix build warning * show latency * use allocator in beam state * use allocator in sequences * fix build error * move next_positions to beam state * Changes for prefix matching * removing debugs * removing more debugs * clean up * clean up * cpu doc updated * Updated docs * updated prefix_vocab_mask dimension in convert script * changes to support bxs prefix_vocab_mask in beamsearchop kernel * doc update * OperatorKernels.md updated * matching docs from artifacts * minor change in logits processor * Addressing comments * Updated the prefix vocab mask usage properly Co-authored-by: Tianlei Wu <[email protected]>

…pes when partitioning (microsoft#10452) * wip * wip * wip * save * address pr comments * address pr comments Co-authored-by: rachguo <[email protected]>

* Fuse Clip->Q to Q * Remove unused variable argmax_node * Remove braces around scalar initializer * Move GetClipConstantMinMax under ORT_MINIMAL_BUILD * Consider epsilon so we can fuse more cases

…ft#10453) * Support specifying execution providers. * Change default provider setting to None. * Add support for bert_perf_test script. * Fall back to ROCM/CUDA EP for MIGraphX/Tensorrt EP. * Assert fall back EPs are included. * Add model class AutoModelForCausalLM and other minor updates. Co-authored-by: Yao Zhang <[email protected]>

* bf16 support * bf16 support * UTs * fix build * fix UTs Co-authored-by: root <[email protected]>

* align with cpu on no input data * review comments and add tests Co-authored-by: Ubuntu <wy@linux-v100.aidmrjtolptuzevavgwhrapqcd.jx.internal.cloudapp.net>

This patch fixes the error "requested alignment X is larger than Y" in older GCC's https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89357

* support tnlr based offensive V4 model * Update onnx_model_tnlr.py Co-authored-by: Ubuntu <wy@linux-v100.aidmrjtolptuzevavgwhrapqcd.jx.internal.cloudapp.net>

…osoft#9486) * Contrib ops for TRT plugins: EfficientNMS and Pyramid ROI Align * Contrib ops for TRT plugins: Multilevel Crop and Resize

* wip * save * address pr comments * update * revert minor changes Co-authored-by: rachguo <[email protected]>

… as new provider option (microsoft#10450) * modify code for add additional field in OrtTensorRTProviderOptionsV2 * add include file * fix typo * fix bug * add comment * fix code * revert change

* bf16 support * minor clean up * UTs * fix build * UTs * UTs * merge commit 6b5504c * minor * ROCm code cleanup * fix build * fix build * minor Co-authored-by: Ethan Tao <[email protected]@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net> Co-authored-by: root <[email protected]>

Co-authored-by: Aishwarya Bhandare <[email protected]@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>

* Optimize cuComputePartGradGammaBeta kernel for MI100 Co-authored-by: root <[email protected]> Co-authored-by: Jeff Daily <[email protected]>

….cc (microsoft#10507)

…t#10474) * Pipeline for ONNX Runtime react native * Fix a test failure * test with custom built binaries * add onnxruntime-common package back * don't bob build when bootstrap * revise Android test * rename example to e2e * remove onnxruntime packages from package.json * remove release-it package * upgrade gradle version to the same as CI * add a pipeline for react native * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * android and ios mobile build for react native e2e * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * use android aar package template * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * use android aar package template * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * publish ios test results * add e2e tests and publish a npm package * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * remove aar from npm package * wait for view displayed * change a waiting logic * increase wait time for app launching * give more time to launch an app * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * disable metro server on testing * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * test ios simulator launching * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * fix iOS e2e test * use a publishing version of npm packages * make pretty * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * make only one onnxruntime-common package after packaging * make a powershell script of packaging universal * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Add a warning for file changes during a test * clean up * fix lint errors * fix js npm packaging * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * Update mac-react-native-ci-pipeline.yml for Azure Pipelines * resolve comments * fix a typo

…nference sessions (microsoft#10481) * Changed file-scope static variables to automatic variables or function-scope static const. * Reduce load time overhead by using constexpr. * Use node indices instead of node names to track inserted, deleted and changed nodes. Co-authored-by: Satya Jandhyala <[email protected]>

Formatting related changes

…t#10512) There's a circular dependency between onnxruntime_util and onnxruntime_framework. Remove onnxruntime_util's dependency on onnxruntime_framework.

…fallbacking CPU happens (microsoft#10522) Co-authored-by: Weixing Zhang <[email protected]>

* Fix UT * UT * UTs * enable ROCm UT * fix build attempt * minor * fix UT * fix UT * fix UTs Co-authored-by: Ethan Tao <[email protected]@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net> Co-authored-by: root <[email protected]>

* Squashed commit of the following: commit 1238049 Author: Guoyu Wang <[email protected]> Date: Mon Feb 7 12:59:04 2022 -0800 Add qdq mul support commit 9cadda7 Merge: 7a32847 0f5d0a0 Author: Guoyu Wang <[email protected]> Date: Mon Feb 7 11:24:47 2022 -0800 Merge remote-tracking branch 'origin/master' into gwang-msft/qdq_mul commit 7a32847 Author: Guoyu Wang <[email protected]> Date: Mon Feb 7 00:41:30 2022 -0800 move test case to util commit c1a8f0d Author: Guoyu Wang <[email protected]> Date: Fri Feb 4 13:04:26 2022 -0800 update input/output check commit a6f0a0d Author: Guoyu Wang <[email protected]> Date: Thu Feb 3 18:37:21 2022 -0800 update quantized io check functions commit 87f4d1d Merge: 7849f07 97b8f6f Author: Guoyu Wang <[email protected]> Date: Wed Feb 2 17:22:58 2022 -0800 Merge remote-tracking branch 'origin/master' into gwang-msft/qdq_mul commit 7849f07 Author: Guoyu Wang <[email protected]> Date: Wed Feb 2 17:22:55 2022 -0800 minor update commit 7196cdf Author: Guoyu Wang <[email protected]> Date: Wed Feb 2 10:50:10 2022 -0800 init change commit 84c0077 Merge: a8c7dce 7318361 Author: Guoyu Wang <[email protected]> Date: Tue Feb 1 18:21:17 2022 -0800 Merge remote-tracking branch 'origin/master' into gwang-msft/qdq_mul commit a8c7dce Merge: 55e536c ef7b4dc Author: Guoyu Wang <[email protected]> Date: Tue Feb 1 13:51:04 2022 -0800 Merge remote-tracking branch 'origin/master' into gwang-msft/qdq_mul commit 55e536c Author: Guoyu Wang <[email protected]> Date: Tue Feb 1 11:44:34 2022 -0800 address cr comments commit d460f5b Author: Guoyu Wang <[email protected]> Date: Tue Feb 1 00:33:54 2022 -0800 fix android UT failure commit 52146cf Author: Guoyu Wang <[email protected]> Date: Mon Jan 31 16:01:13 2022 -0800 fix build break commit ec6d07d Author: Guoyu Wang <[email protected]> Date: Mon Jan 31 15:41:52 2022 -0800 minor update to UT commit 8ec8490 Author: Guoyu Wang <[email protected]> Date: Mon Jan 31 15:01:30 2022 -0800 Add NNAPI support of QDQ Resize * Update qdq add/mul test case, fix build break * Address CR comments * Add QLinearMul support * remove unused params * Address CR comments * wip * save * minor fix * fix * fix build * address pr comments * fix wrong ut tests * address comments * minor update * fix addinitializersskip Co-authored-by: Guoyu Wang <[email protected]> Co-authored-by: rachguo <[email protected]>

Bumps [karma](https://github.com/karma-runner/karma) from 6.3.2 to 6.3.14. - [Release notes](https://github.com/karma-runner/karma/releases) - [Changelog](https://github.com/karma-runner/karma/blob/master/CHANGELOG.md) - [Commits](karma-runner/karma@v6.3.2...v6.3.14) --- updated-dependencies: - dependency-name: karma dependency-type: direct:development ... Signed-off-by: dependabot[bot] <[email protected]>

…10534)

Update QDQ propagation transformer to insert new QDQ nodes instead of moving the existing one. This creates a more consistent `DQ -> op -> Q` pattern for other components to recognize. Upgrade this transformer to a basic level optimization as it yields a valid ONNX graph.

This is a preparation change for a bigger goal. On ARM64 CPUs with Big.Little, different cores are always the same architecture but different micro-architecture. Specifically, it is often that the little core has narrow memory buses that makes 128b load very slow. While if we always use 64b load in our kernels, the code will run slower on big cores. As a result, we need to run different code on different cores to achieve better performance. This change constructs a manifold that pivot based on the core micro-architecture of the current core, so that we can develop and call different kernels accordingly. Co-authored-by: Chen Fu <[email protected]>

…icrosoft#10260) * update java API for STVM EP. Issue is from PR#10019 * use_stvm -> use_tvm * rename stvm worktree * STVMAllocator -> TVMAllocator * StvmExecutionProviderInfo -> TvmExecutionProviderInfo * stvm -> tvm for cpu_targets. resolve onnxruntime::tvm and origin tvm namespaces conflict * STVMRunner -> TVMRunner * StvmExecutionProvider -> TvmExecutionProvider * tvm::env_vars * StvmProviderFactory -> TvmProviderFactory * rename factory funcs * StvmCPUDataTransfer -> TvmCPUDataTransfer * small clean * STVMFuncState -> TVMFuncState * USE_TVM -> NUPHAR_USE_TVM * USE_STVM -> USE_TVM * python API: providers.stvm -> providers.tvm. clean TVM_EP.md * clean build scripts #1 * clean build scripts, java frontend and others #2 * once more clean #3 * fix build of nuphar tvm test * final transfer stvm namespace to onnxruntime::tvm * rename stvm->tvm * NUPHAR_USE_TVM -> USE_NUPHAR_TVM * small fixes for correct CI tests * clean after rebase. Last renaming stvm to tvm, separate TVM and Nuphar in cmake and build files * update CUDA support for TVM EP * roll back CudaNN home check * ERROR for not positive input shape dimension instead of WARNING * update documentation for CUDA * small corrections after review * update GPU description * update GPU description * misprints were fixed * cleaned up error msgs Co-authored-by: Valery Chernov <[email protected]> Co-authored-by: KJlaccHoeUM9l <[email protected]> Co-authored-by: Thierry Moreau <[email protected]>

### Description Release OrtEnv before main function returns. Before this change, OrtEnv is deleted when C/C++ runtime destructs all global variables in ONNX Runtime's core framework. The callstack is like this: ``` * frame #0: 0x00007fffee39f5a6 libonnxruntime.so.1.16.0`onnxruntime::Environment::~Environment(this=0x00007fffee39fbf2) at environment.h:20:7 frame #1: 0x00007fffee39f614 libonnxruntime.so.1.16.0`std::default_delete<onnxruntime::Environment>::operator()(this=0x00007ffff4c30e50, __ptr=0x0000000005404b00) const at unique_ptr.h:85:2 frame #2: 0x00007fffee39edca libonnxruntime.so.1.16.0`std::unique_ptr<onnxruntime::Environment, std::default_delete<onnxruntime::Environment>>::~unique_ptr(this=0x5404b00) at unique_ptr.h:361:17 frame #3: 0x00007fffee39e2ab libonnxruntime.so.1.16.0`OrtEnv::~OrtEnv(this=0x00007ffff4c30e50) at ort_env.cc:43:1 frame #4: 0x00007fffee39fa96 libonnxruntime.so.1.16.0`std::default_delete<OrtEnv>::operator()(this=0x00007fffefff8f78, __ptr=0x00007ffff4c30e50) const at unique_ptr.h:85:2 frame #5: 0x00007fffee39f394 libonnxruntime.so.1.16.0`std::unique_ptr<OrtEnv, std::default_delete<OrtEnv>>::~unique_ptr(this=0x7ffff4c30e50) at unique_ptr.h:361:17 frame #6: 0x00007ffff78574b5 libc.so.6`__run_exit_handlers + 261 frame #7: 0x00007ffff7857630 libc.so.6`exit + 32 frame #8: 0x00007ffff783feb7 libc.so.6`__libc_start_call_main + 135 frame #9: 0x00007ffff783ff60 libc.so.6`__libc_start_main@@GLIBC_2.34 + 128 frame #10: 0x0000000000abbdee node`_start + 46 ``` After this change, OrtEnv will be deleted before the main function returns and nodejs is still alive.

### Description Security fuzz test with address sanitizer found several bugs

### Description Add [Lean Attention](https://arxiv.org/abs/2405.10480) and the integration with MultiHeadAttention operator for LLM in GPU. LeanAttention speeds up self-attention for the token-generation phase (decode-phase) of decoder-only transformer models, especially on long context lengths. - [x] Initial implementation of Lean Attention (by Srikant Bharadwaj) - [x] Integration with MultiHeadAttention operator - [x] Add parity tests - [x] Add benchmark #### Implementation Details (1) Lean Attention is enabled in build for Linux, and disabled for Windows (2) Lean Attention is disabled by default. Need enable it through cuda provider option sdpa_kernel, or use environment variable `ORT_ENABLE_LEAN_ATTENTION=1` (3) It only works for token-generation (sequence_length==1, past_sequence_length > 0). (4) Like flash attention, it only works in Ampere or newer GPU. We can revisit #1 and #2 after comparing with DecoderMaskedMultiHeadAttention and XQA kernels. #### Benchmark ``` cd onnxruntime/test/python/transformers /bin/bash benchmark_mha.sh lean ``` Example outputs in H100: Note that past and present does not share buffer for MHA for now, so we can see low tflops. The relative ratio will change after buffer sharing is enabled. But we expect that the order (kernel A is faster than B) will remain the same after buffer sharing is enabled. Note that common settings `sequence_length=1; causal=True;attn_bias=None;cuda_graph=False` are not shown in the below table. batch_size | past_sequence_length | num_heads | head_size | average_latency | tflops | kernel -- | -- | -- | -- | -- | -- | -- 1 | 512 | 16 | 64 | 0.000059 | 0.0178 | ort:flash 1 | 512 | 16 | 64 | 0.000068 | 0.0155 | ort:efficient 1 | 512 | 16 | 64 | 0.000065 | 0.0161 | ort:math 1 | 512 | 16 | 64 | 0.000060 | 0.0176 | ort:lean 1 | 512 | 32 | 128 | 0.000062 | 0.0674 | ort:flash 1 | 512 | 32 | 128 | 0.000064 | 0.0661 | ort:efficient 1 | 512 | 32 | 128 | 0.000067 | 0.0625 | ort:math 1 | 512 | 32 | 128 | 0.000062 | 0.0678 | ort:lean 1 | 1024 | 16 | 64 | 0.000061 | 0.0345 | ort:flash 1 | 1024 | 16 | 64 | 0.000086 | 0.0244 | ort:efficient 1 | 1024 | 16 | 64 | 0.000065 | 0.0322 | ort:math 1 | 1024 | 16 | 64 | 0.000063 | 0.0332 | ort:lean 1 | 1024 | 32 | 128 | 0.000075 | 0.1125 | ort:flash 1 | 1024 | 32 | 128 | 0.000088 | 0.0951 | ort:efficient 1 | 1024 | 32 | 128 | 0.000079 | 0.1068 | ort:math 1 | 1024 | 32 | 128 | 0.000072 | 0.1171 | ort:lean 1 | 2048 | 16 | 64 | 0.000069 | 0.0606 | ort:flash 1 | 2048 | 16 | 64 | 0.000125 | 0.0336 | ort:efficient 1 | 2048 | 16 | 64 | 0.000064 | 0.0655 | ort:lean 1 | 2048 | 32 | 128 | 0.000098 | 0.1720 | ort:flash 1 | 2048 | 32 | 128 | 0.000132 | 0.1270 | ort:efficient 1 | 2048 | 32 | 128 | 0.000092 | 0.1828 | ort:lean 1 | 4096 | 16 | 64 | 0.000076 | 0.1097 | ort:flash 1 | 4096 | 16 | 64 | 0.000207 | 0.0406 | ort:efficient 1 | 4096 | 16 | 64 | 0.000069 | 0.1209 | ort:lean 1 | 4096 | 32 | 128 | 0.000140 | 0.2394 | ort:flash 1 | 4096 | 32 | 128 | 0.000213 | 0.1575 | ort:efficient 1 | 4096 | 32 | 128 | 0.000139 | 0.2419 | ort:lean 1 | 8192 | 16 | 64 | 0.000104 | 0.1609 | ort:flash 1 | 8192 | 16 | 64 | 0.000392 | 0.0428 | ort:efficient 1 | 8192 | 16 | 64 | 0.000093 | 0.1809 | ort:lean 1 | 8192 | 32 | 128 | 0.000212 | 0.3160 | ort:flash 1 | 8192 | 32 | 128 | 0.000360 | 0.1866 | ort:efficient 1 | 8192 | 32 | 128 | 0.000212 | 0.3162 | ort:lean 1 | 16384 | 16 | 64 | 0.000139 | 0.2410 | ort:flash 1 | 16384 | 16 | 64 | 0.000731 | 0.0459 | ort:efficient 1 | 16384 | 16 | 64 | 0.000136 | 0.2465 | ort:lean 1 | 16384 | 32 | 128 | 0.000361 | 0.3722 | ort:flash 1 | 16384 | 32 | 128 | 0.000667 | 0.2014 | ort:efficient 1 | 16384 | 32 | 128 | 0.000357 | 0.3765 | ort:lean 1 | 32768 | 16 | 64 | 0.000210 | 0.3194 | ort:flash 1 | 32768 | 16 | 64 | 0.001428 | 0.0470 | ort:efficient 1 | 32768 | 16 | 64 | 0.000209 | 0.3211 | ort:lean 1 | 32768 | 32 | 128 | 0.000659 | 0.4074 | ort:flash 1 | 32768 | 32 | 128 | 0.001270 | 0.2114 | ort:efficient 1 | 32768 | 32 | 128 | 0.000651 | 0.4123 | ort:lean 1 | 65536 | 16 | 64 | 0.000355 | 0.3785 | ort:flash 1 | 65536 | 16 | 64 | 0.002736 | 0.0491 | ort:efficient 1 | 65536 | 16 | 64 | 0.000349 | 0.3845 | ort:lean 1 | 65536 | 32 | 128 | 0.001251 | 0.4290 | ort:flash 1 | 65536 | 32 | 128 | 0.002480 | 0.2165 | ort:efficient 1 | 65536 | 32 | 128 | 0.001239 | 0.4333 | ort:lean 4 | 512 | 16 | 64 | 0.000063 | 0.0665 | ort:flash 4 | 512 | 16 | 64 | 0.000069 | 0.0607 | ort:efficient 4 | 512 | 16 | 64 | 0.000066 | 0.0634 | ort:math 4 | 512 | 16 | 64 | 0.000062 | 0.0674 | ort:lean 4 | 512 | 32 | 128 | 0.000100 | 0.1677 | ort:flash 4 | 512 | 32 | 128 | 0.000099 | 0.1703 | ort:efficient 4 | 512 | 32 | 128 | 0.000108 | 0.1557 | ort:math 4 | 512 | 32 | 128 | 0.000092 | 0.1818 | ort:lean 4 | 1024 | 16 | 64 | 0.000077 | 0.1094 | ort:flash 4 | 1024 | 16 | 64 | 0.000099 | 0.0850 | ort:efficient 4 | 1024 | 16 | 64 | 0.000081 | 0.1038 | ort:math 4 | 1024 | 16 | 64 | 0.000072 | 0.1161 | ort:lean 4 | 1024 | 32 | 128 | 0.000143 | 0.2343 | ort:flash 4 | 1024 | 32 | 128 | 0.000137 | 0.2447 | ort:efficient 4 | 1024 | 32 | 128 | 0.000150 | 0.2245 | ort:math 4 | 1024 | 32 | 128 | 0.000135 | 0.2496 | ort:lean 4 | 2048 | 16 | 64 | 0.000096 | 0.1757 | ort:flash 4 | 2048 | 16 | 64 | 0.000156 | 0.1078 | ort:efficient 4 | 2048 | 16 | 64 | 0.000089 | 0.1892 | ort:lean 4 | 2048 | 32 | 128 | 0.000223 | 0.3010 | ort:flash 4 | 2048 | 32 | 128 | 0.000217 | 0.3101 | ort:efficient 4 | 2048 | 32 | 128 | 0.000209 | 0.3209 | ort:lean 4 | 4096 | 16 | 64 | 0.000137 | 0.2448 | ort:flash 4 | 4096 | 16 | 64 | 0.000256 | 0.1312 | ort:efficient 4 | 4096 | 16 | 64 | 0.000133 | 0.2530 | ort:lean 4 | 4096 | 32 | 128 | 0.000389 | 0.3450 | ort:flash 4 | 4096 | 32 | 128 | 0.000376 | 0.3574 | ort:efficient 4 | 4096 | 32 | 128 | 0.000354 | 0.3794 | ort:lean 4 | 8192 | 16 | 64 | 0.000210 | 0.3198 | ort:flash 4 | 8192 | 16 | 64 | 0.000453 | 0.1480 | ort:efficient 4 | 8192 | 16 | 64 | 0.000206 | 0.3260 | ort:lean 4 | 8192 | 32 | 128 | 0.000725 | 0.3705 | ort:flash 4 | 8192 | 32 | 128 | 0.000693 | 0.3874 | ort:efficient 4 | 8192 | 32 | 128 | 0.000653 | 0.4114 | ort:lean 4 | 16384 | 16 | 64 | 0.000355 | 0.3782 | ort:flash 4 | 16384 | 16 | 64 | 0.000849 | 0.1581 | ort:efficient 4 | 16384 | 16 | 64 | 0.000346 | 0.3874 | ort:lean 4 | 16384 | 32 | 128 | 0.001395 | 0.3848 | ort:flash 4 | 16384 | 32 | 128 | 0.001337 | 0.4017 | ort:efficient 4 | 16384 | 32 | 128 | 0.001252 | 0.4288 | ort:lean 4 | 32768 | 16 | 64 | 0.000647 | 0.4146 | ort:flash 4 | 32768 | 16 | 64 | 0.001649 | 0.1628 | ort:efficient 4 | 32768 | 16 | 64 | 0.000639 | 0.4204 | ort:lean 4 | 32768 | 32 | 128 | 0.002721 | 0.3947 | ort:flash 4 | 32768 | 32 | 128 | 0.002601 | 0.4128 | ort:efficient 4 | 32768 | 32 | 128 | 0.002434 | 0.4411 | ort:lean 4 | 65536 | 16 | 64 | 0.001231 | 0.4361 | ort:flash 4 | 65536 | 16 | 64 | 0.003238 | 0.1658 | ort:efficient 4 | 65536 | 16 | 64 | 0.001217 | 0.4412 | ort:lean 4 | 65536 | 32 | 128 | 0.005357 | 0.4009 | ort:flash 4 | 65536 | 32 | 128 | 0.005118 | 0.4196 | ort:efficient 4 | 65536 | 32 | 128 | 0.004781 | 0.4492 | ort:lean 16 | 512 | 16 | 64 | 0.000098 | 0.1724 | ort:flash 16 | 512 | 16 | 64 | 0.000104 | 0.1616 | ort:efficient 16 | 512 | 16 | 64 | 0.000118 | 0.1420 | ort:math 16 | 512 | 16 | 64 | 0.000087 | 0.1926 | ort:lean 16 | 512 | 32 | 128 | 0.000220 | 0.3062 | ort:flash 16 | 512 | 32 | 128 | 0.000208 | 0.3237 | ort:efficient 16 | 512 | 32 | 128 | 0.000237 | 0.2838 | ort:math 16 | 512 | 32 | 128 | 0.000209 | 0.3216 | ort:lean 16 | 1024 | 16 | 64 | 0.000136 | 0.2465 | ort:flash 16 | 1024 | 16 | 64 | 0.000150 | 0.2235 | ort:efficient 16 | 1024 | 16 | 64 | 0.000148 | 0.2266 | ort:math 16 | 1024 | 16 | 64 | 0.000129 | 0.2611 | ort:lean 16 | 1024 | 32 | 128 | 0.000367 | 0.3663 | ort:flash 16 | 1024 | 32 | 128 | 0.000351 | 0.3829 | ort:efficient 16 | 1024 | 32 | 128 | 0.000400 | 0.3357 | ort:math 16 | 1024 | 32 | 128 | 0.000349 | 0.3853 | ort:lean 16 | 2048 | 16 | 64 | 0.000209 | 0.3206 | ort:flash 16 | 2048 | 16 | 64 | 0.000243 | 0.2762 | ort:efficient 16 | 2048 | 16 | 64 | 0.000201 | 0.3338 | ort:lean 16 | 2048 | 32 | 128 | 0.000671 | 0.4002 | ort:flash 16 | 2048 | 32 | 128 | 0.000645 | 0.4163 | ort:efficient 16 | 2048 | 32 | 128 | 0.000642 | 0.4185 | ort:lean 16 | 4096 | 16 | 64 | 0.000360 | 0.3732 | ort:flash 16 | 4096 | 16 | 64 | 0.000425 | 0.3162 | ort:efficient 16 | 4096 | 16 | 64 | 0.000341 | 0.3933 | ort:lean 16 | 4096 | 32 | 128 | 0.001292 | 0.4156 | ort:flash 16 | 4096 | 32 | 128 | 0.001251 | 0.4291 | ort:efficient 16 | 4096 | 32 | 128 | 0.001241 | 0.4327 | ort:lean 16 | 8192 | 16 | 64 | 0.000666 | 0.4030 | ort:flash 16 | 8192 | 16 | 64 | 0.000804 | 0.3339 | ort:efficient 16 | 8192 | 16 | 64 | 0.000627 | 0.4283 | ort:lean 16 | 8192 | 32 | 128 | 0.002541 | 0.4226 | ort:flash 16 | 8192 | 32 | 128 | 0.002454 | 0.4376 | ort:efficient 16 | 8192 | 32 | 128 | 0.002438 | 0.4405 | ort:lean 16 | 16384 | 16 | 64 | 0.001292 | 0.4156 | ort:flash 16 | 16384 | 16 | 64 | 0.001571 | 0.3417 | ort:efficient 16 | 16384 | 16 | 64 | 0.001217 | 0.4411 | ort:lean 16 | 16384 | 32 | 128 | 0.005042 | 0.4260 | ort:flash 16 | 16384 | 32 | 128 | 0.004859 | 0.4420 | ort:efficient 16 | 16384 | 32 | 128 | 0.004827 | 0.4449 | ort:lean 16 | 32768 | 16 | 64 | 0.002537 | 0.4233 | ort:flash 16 | 32768 | 16 | 64 | 0.003103 | 0.3461 | ort:efficient 16 | 32768 | 16 | 64 | 0.002385 | 0.4501 | ort:lean 16 | 32768 | 32 | 128 | 0.009961 | 0.4312 | ort:flash 16 | 32768 | 32 | 128 | 0.009605 | 0.4472 | ort:efficient 16 | 32768 | 32 | 128 | 0.009524 | 0.4510 | ort:lean 16 | 65536 | 16 | 64 | 0.005019 | 0.4279 | ort:flash 16 | 65536 | 16 | 64 | 0.006133 | 0.3502 | ort:efficient 16 | 65536 | 16 | 64 | 0.004703 | 0.4566 | ort:lean 16 | 65536 | 32 | 128 | 0.019746 | 0.4350 | ort:flash 16 | 65536 | 32 | 128 | 0.019027 | 0.4515 | ort:efficient 16 | 65536 | 32 | 128 | 0.018864 | 0.4554 | ort:lean ### Motivation and Context

guoyu-wang and others added 30 commits January 28, 2022 12:48

Remove coremltools submodule *security vulnerability* and copy the co…

5f0ba31

…reml model schema (microsoft#10424) * remove coremltools submodule * update cgmanifest * Copy proto files directly from coremltools

Remove cbegin and cend calls which do not exist in std::span or gsl::…

b02f4ec

…span (microsoft#10426)

[ROCm] BFloat16 support (microsoft#10416)

85cbe83

* reducesum bf16 support * bf16 for add/sub/mul/div * fix build * bf16 for Cast * bf16 for softmax Co-authored-by: root <[email protected]>

Allow for an optional subgraph input to have no type info. (microsoft…

baa1767

…#10379) Add a test for a missing optional input to Loop.

Enable transpose optimizer in minimal extended build (microsoft#10349)

c43c169

Enable transpose optimizer and infrastructure it depends on in a minimal extended build.

[NNAPI QDQ] Add QDQ Conv support (microsoft#10418)

68262cc

* Add qdq conv to NNAPI * fix build warning * addressed CR comments * fix a minor bug in my previous merge

Add test quantization of ArgMax for TensorRT (microsoft#10325)

ef7b4dc

Make sure quantize_statict would insert DQ -> Q before ArgMax.

Update rocm_ep and migraphx_ep to rocm4.5.2 and fix dockerfiles to bu…

062129a

…ild docker images correctly (microsoft#10445) * fix build errors for the migraphx and rocm dockerfile * add the numpy package in the migraphx and rocm dockerfile

support rocm/migraphx EP in perftest tool (microsoft#10449)

3c96760

Co-authored-by: Weixing Zhang <[email protected]>

Allow users to bind arbitrary memory using raw pointers (microsoft#10428

91b8ad5

) Add binding external allocation Add negative tests Add missing return status check

[NNAPI QDQ] Add QDQ Resize support (microsoft#10442)

7318361

* Add NNAPI support of QDQ Resize * minor update to UT * fix build break * fix android UT failure * address cr comments

add qdq support for QGemm (microsoft#10414)

1aa0789

* add qgemm in quantization tool * add qdq support for QGemm * fix build break * fix OperatorKernels.md

upgrade react-native packages to latest (microsoft#10454)

6076a26

Add logic to NNAPI EP to exclude pre-processing involving dynamic sha…

97b8f6f

…pes when partitioning (microsoft#10452) * wip * wip * wip * save * address pr comments * address pr comments Co-authored-by: rachguo <[email protected]>

Fuse Clip->Q to Q (microsoft#10434)

a405658

* Fuse Clip->Q to Q * Remove unused variable argmax_node * Remove braces around scalar initializer * Move GetClipConstantMinMax under ORT_MINIMAL_BUILD * Consider epsilon so we can fuse more cases

[ROCm] BFloat16 support (microsoft#10447)

63198a6

* bf16 support * bf16 support * UTs * fix build * fix UTs Co-authored-by: root <[email protected]>

Transformer model CUDA EP align with CPU on corner case (microsoft#9889)

bb09acf

* align with cpu on no input data * review comments and add tests Co-authored-by: Ubuntu <wy@linux-v100.aidmrjtolptuzevavgwhrapqcd.jx.internal.cloudapp.net>

cmake: disable 'attributes' error to fix the build with GCC < 9.x

6bbf016

This patch fixes the error "requested alignment X is larger than Y" in older GCC's https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89357

Update orttraining-linux-ci-pipeline.yml (microsoft#10462)

4f13c8a

Support fusion for TNLR based model (microsoft#10432)

0d09dd5

* support tnlr based offensive V4 model * Update onnx_model_tnlr.py Co-authored-by: Ubuntu <wy@linux-v100.aidmrjtolptuzevavgwhrapqcd.jx.internal.cloudapp.net>

Contrib ops for TRT plugins: EfficientNMS and Pyramid ROI Align (micr…

d0ab881

…osoft#9486) * Contrib ops for TRT plugins: EfficientNMS and Pyramid ROI Align * Contrib ops for TRT plugins: Multilevel Crop and Resize

[NNAPI QDQ] Add QDQ AveragePool op support (microsoft#10464)

927f1f1

* wip * save * address pr comments * update * revert minor changes Co-authored-by: rachguo <[email protected]>

Make user capable of adding new field in OrtTensorRTProviderOptionsV2…

0f5d0a0

… as new provider option (microsoft#10450) * modify code for add additional field in OrtTensorRTProviderOptionsV2 * add include file * fix typo * fix bug * add comment * fix code * revert change

fix unit test of quant gemm (microsoft#10469)

c696da3

gradient and test (microsoft#10455)

7e5d68e

Co-authored-by: Aishwarya Bhandare <[email protected]@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>

snnn and others added 23 commits February 9, 2022 09:31

Reorganize contrib op schemas (microsoft#10494)

7a2bf3c

Optimize cuComputePartGradGammaBeta kernel for MI100 (microsoft#10475)

c9fbd0b

* Optimize cuComputePartGradGammaBeta kernel for MI100 Co-authored-by: root <[email protected]> Co-authored-by: Jeff Daily <[email protected]>

Move QAttention/QEmbedLayerNormalization op defs to quantization_defs…

6f3ade5

….cc (microsoft#10507)

Add TRT ep perf benchmark (microsoft#10470)

4d6d4df

Add NHWC CONV contrib op (microsoft#10506)

3185680

Fix fomatting. (microsoft#10520)

a27aaba

Formatting related changes

Remove onnxruntime_util dependency on onnxruntime_framework (microsof…

f92e47e

…t#10512) There's a circular dependency between onnxruntime_util and onnxruntime_framework. Remove onnxruntime_util's dependency on onnxruntime_framework.

The transformer of memcpy is needed for ROCm EP and MIGraphX EP when …

2002a96

…fallbacking CPU happens (microsoft#10522) Co-authored-by: Weixing Zhang <[email protected]>

Fix C# pipeline build error (microsoft#10524)

318d31e

Remove unneeded code in UpsampleBilinear (microsoft#10544)

3f37609

Return a Status instead of throw an exception in GetAttrs (microsoft#…

270dec7

…10534)

Introduce load balancing dataset samplers (microsoft#10163)

7691e7e

Optimize elementwise and biasgelugrad kernels for AMD

c52ebae

Clean up for BiasGeluGradDxKernel

98014fe

Clean up for BinaryElementWiseImpl and BinaryElementWiseNoBroadcastImpl

70b6ad6

Clean up for BinaryElementWiseImpl and BinaryElementWiseNoBroadcastImpl

35882e5

hubertlu-tw requested a review from jeffdaily February 18, 2022 03:59

hubertlu-tw self-assigned this Feb 18, 2022

hubertlu-tw changed the title ~~Rocm/optim elementwise biasgelugrad~~ Optimize BinaryElementWise and BiasGeluGrad kernels for ROCm Feb 18, 2022

xinyazhang pushed a commit that referenced this pull request Aug 22, 2024

Security fuzz address sanitizer fix Bug #2 and #3 (microsoft#21528)

48fb8a7

### Description Security fuzz test with address sanitizer found several bugs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize BinaryElementWise and BiasGeluGrad kernels for ROCm #2

Optimize BinaryElementWise and BiasGeluGrad kernels for ROCm #2

hubertlu-tw commented Feb 18, 2022

Optimize BinaryElementWise and BiasGeluGrad kernels for ROCm #2

Are you sure you want to change the base?

Optimize BinaryElementWise and BiasGeluGrad kernels for ROCm #2

Conversation

hubertlu-tw commented Feb 18, 2022