Allow configuration template to disable some SIMD. #3

jslap-ubi · 2024-04-05T20:55:37Z

Description

Motivation and Context

### Description Security fuzz test with address sanitizer found several bugs

### Description DecoderMaskedMultiHeadAttention CPU kernel.

### Description Add a new pipeline to publish ROCM package to ADO ### Motivation and Context  ### Test Link https://dev.azure.com/aiinfra/Lotus/_build?definitionId=1615

### Description 1. Add Gemm, MatMul, Softmax, AveragePool and Resize F16 kernels This PR has included all changes in microsoft#22378 [AB#51066](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/51066) [AB#51026](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/51026) 2. Matrix B must be const and martrix A and B dim_size shoule NOT bigger than 2 in XNNPack, so I added 2 tests in matmul_test.cc to make sure it's really tested. (that is, compute() must be called.) ### Motivation and Context

- Work around Xcode 16 iOS test build issue: `error: Multiple commands produce '.../PlugIns'`. - Fix link error in iOS static framework test. - Update build.py to check for the right kind of build before running iOS tests on the simulator. - Update Xcode 16 build images to 'macos-15' because that's the only image that will have Xcode 16 soon. See actions/runner-images#10703.

Move suggest fixes to a separate CI workflow so that it is triggered only on PRs and does not fail the main branch.

### Description Support OV2024.4 Refactor tensor initialization check for external weights Support loading OV Config OVEP: Tensor Caching fix, Fix accuracy issues Refactor device memory implementation to make it more generic ### Motivation and Context The changes are required to fix accuracy issues, support loading of OV config, support OV2024.4 --------- Co-authored-by: Eric Crawford <[email protected]> Co-authored-by: saurabhkale17 <[email protected]> Co-authored-by: Javier E. Martinez <[email protected]> Co-authored-by: sfatimar <[email protected]> Co-authored-by: ankitm3k <[email protected]> Co-authored-by: Preetha Veeramalai <[email protected]> Co-authored-by: n1harika <[email protected]> Co-authored-by: jatinwadhwa921 <[email protected]>

@PatriceVignola

### Description Request and create DML EP and its data transfer. Use to copy on device. The PR includes changes to fix issues in DML provider. ### Motivation and Context This enables Lora users to run it with DML which is important for GenAI. Co-authored-by: @PatriceVignola --------- Co-authored-by: Patrice Vignola <[email protected]>

### Description Add [Lean Attention](https://arxiv.org/abs/2405.10480) and the integration with MultiHeadAttention operator for LLM in GPU. LeanAttention speeds up self-attention for the token-generation phase (decode-phase) of decoder-only transformer models, especially on long context lengths. - [x] Initial implementation of Lean Attention (by Srikant Bharadwaj) - [x] Integration with MultiHeadAttention operator - [x] Add parity tests - [x] Add benchmark #### Implementation Details (1) Lean Attention is enabled in build for Linux, and disabled for Windows (2) Lean Attention is disabled by default. Need enable it through cuda provider option sdpa_kernel, or use environment variable `ORT_ENABLE_LEAN_ATTENTION=1` (3) It only works for token-generation (sequence_length==1, past_sequence_length > 0). (4) Like flash attention, it only works in Ampere or newer GPU. We can revisit #1 and #2 after comparing with DecoderMaskedMultiHeadAttention and XQA kernels. #### Benchmark ``` cd onnxruntime/test/python/transformers /bin/bash benchmark_mha.sh lean ``` Example outputs in H100: Note that past and present does not share buffer for MHA for now, so we can see low tflops. The relative ratio will change after buffer sharing is enabled. But we expect that the order (kernel A is faster than B) will remain the same after buffer sharing is enabled. Note that common settings `sequence_length=1; causal=True;attn_bias=None;cuda_graph=False` are not shown in the below table. batch_size | past_sequence_length | num_heads | head_size | average_latency | tflops | kernel -- | -- | -- | -- | -- | -- | -- 1 | 512 | 16 | 64 | 0.000059 | 0.0178 | ort:flash 1 | 512 | 16 | 64 | 0.000068 | 0.0155 | ort:efficient 1 | 512 | 16 | 64 | 0.000065 | 0.0161 | ort:math 1 | 512 | 16 | 64 | 0.000060 | 0.0176 | ort:lean 1 | 512 | 32 | 128 | 0.000062 | 0.0674 | ort:flash 1 | 512 | 32 | 128 | 0.000064 | 0.0661 | ort:efficient 1 | 512 | 32 | 128 | 0.000067 | 0.0625 | ort:math 1 | 512 | 32 | 128 | 0.000062 | 0.0678 | ort:lean 1 | 1024 | 16 | 64 | 0.000061 | 0.0345 | ort:flash 1 | 1024 | 16 | 64 | 0.000086 | 0.0244 | ort:efficient 1 | 1024 | 16 | 64 | 0.000065 | 0.0322 | ort:math 1 | 1024 | 16 | 64 | 0.000063 | 0.0332 | ort:lean 1 | 1024 | 32 | 128 | 0.000075 | 0.1125 | ort:flash 1 | 1024 | 32 | 128 | 0.000088 | 0.0951 | ort:efficient 1 | 1024 | 32 | 128 | 0.000079 | 0.1068 | ort:math 1 | 1024 | 32 | 128 | 0.000072 | 0.1171 | ort:lean 1 | 2048 | 16 | 64 | 0.000069 | 0.0606 | ort:flash 1 | 2048 | 16 | 64 | 0.000125 | 0.0336 | ort:efficient 1 | 2048 | 16 | 64 | 0.000064 | 0.0655 | ort:lean 1 | 2048 | 32 | 128 | 0.000098 | 0.1720 | ort:flash 1 | 2048 | 32 | 128 | 0.000132 | 0.1270 | ort:efficient 1 | 2048 | 32 | 128 | 0.000092 | 0.1828 | ort:lean 1 | 4096 | 16 | 64 | 0.000076 | 0.1097 | ort:flash 1 | 4096 | 16 | 64 | 0.000207 | 0.0406 | ort:efficient 1 | 4096 | 16 | 64 | 0.000069 | 0.1209 | ort:lean 1 | 4096 | 32 | 128 | 0.000140 | 0.2394 | ort:flash 1 | 4096 | 32 | 128 | 0.000213 | 0.1575 | ort:efficient 1 | 4096 | 32 | 128 | 0.000139 | 0.2419 | ort:lean 1 | 8192 | 16 | 64 | 0.000104 | 0.1609 | ort:flash 1 | 8192 | 16 | 64 | 0.000392 | 0.0428 | ort:efficient 1 | 8192 | 16 | 64 | 0.000093 | 0.1809 | ort:lean 1 | 8192 | 32 | 128 | 0.000212 | 0.3160 | ort:flash 1 | 8192 | 32 | 128 | 0.000360 | 0.1866 | ort:efficient 1 | 8192 | 32 | 128 | 0.000212 | 0.3162 | ort:lean 1 | 16384 | 16 | 64 | 0.000139 | 0.2410 | ort:flash 1 | 16384 | 16 | 64 | 0.000731 | 0.0459 | ort:efficient 1 | 16384 | 16 | 64 | 0.000136 | 0.2465 | ort:lean 1 | 16384 | 32 | 128 | 0.000361 | 0.3722 | ort:flash 1 | 16384 | 32 | 128 | 0.000667 | 0.2014 | ort:efficient 1 | 16384 | 32 | 128 | 0.000357 | 0.3765 | ort:lean 1 | 32768 | 16 | 64 | 0.000210 | 0.3194 | ort:flash 1 | 32768 | 16 | 64 | 0.001428 | 0.0470 | ort:efficient 1 | 32768 | 16 | 64 | 0.000209 | 0.3211 | ort:lean 1 | 32768 | 32 | 128 | 0.000659 | 0.4074 | ort:flash 1 | 32768 | 32 | 128 | 0.001270 | 0.2114 | ort:efficient 1 | 32768 | 32 | 128 | 0.000651 | 0.4123 | ort:lean 1 | 65536 | 16 | 64 | 0.000355 | 0.3785 | ort:flash 1 | 65536 | 16 | 64 | 0.002736 | 0.0491 | ort:efficient 1 | 65536 | 16 | 64 | 0.000349 | 0.3845 | ort:lean 1 | 65536 | 32 | 128 | 0.001251 | 0.4290 | ort:flash 1 | 65536 | 32 | 128 | 0.002480 | 0.2165 | ort:efficient 1 | 65536 | 32 | 128 | 0.001239 | 0.4333 | ort:lean 4 | 512 | 16 | 64 | 0.000063 | 0.0665 | ort:flash 4 | 512 | 16 | 64 | 0.000069 | 0.0607 | ort:efficient 4 | 512 | 16 | 64 | 0.000066 | 0.0634 | ort:math 4 | 512 | 16 | 64 | 0.000062 | 0.0674 | ort:lean 4 | 512 | 32 | 128 | 0.000100 | 0.1677 | ort:flash 4 | 512 | 32 | 128 | 0.000099 | 0.1703 | ort:efficient 4 | 512 | 32 | 128 | 0.000108 | 0.1557 | ort:math 4 | 512 | 32 | 128 | 0.000092 | 0.1818 | ort:lean 4 | 1024 | 16 | 64 | 0.000077 | 0.1094 | ort:flash 4 | 1024 | 16 | 64 | 0.000099 | 0.0850 | ort:efficient 4 | 1024 | 16 | 64 | 0.000081 | 0.1038 | ort:math 4 | 1024 | 16 | 64 | 0.000072 | 0.1161 | ort:lean 4 | 1024 | 32 | 128 | 0.000143 | 0.2343 | ort:flash 4 | 1024 | 32 | 128 | 0.000137 | 0.2447 | ort:efficient 4 | 1024 | 32 | 128 | 0.000150 | 0.2245 | ort:math 4 | 1024 | 32 | 128 | 0.000135 | 0.2496 | ort:lean 4 | 2048 | 16 | 64 | 0.000096 | 0.1757 | ort:flash 4 | 2048 | 16 | 64 | 0.000156 | 0.1078 | ort:efficient 4 | 2048 | 16 | 64 | 0.000089 | 0.1892 | ort:lean 4 | 2048 | 32 | 128 | 0.000223 | 0.3010 | ort:flash 4 | 2048 | 32 | 128 | 0.000217 | 0.3101 | ort:efficient 4 | 2048 | 32 | 128 | 0.000209 | 0.3209 | ort:lean 4 | 4096 | 16 | 64 | 0.000137 | 0.2448 | ort:flash 4 | 4096 | 16 | 64 | 0.000256 | 0.1312 | ort:efficient 4 | 4096 | 16 | 64 | 0.000133 | 0.2530 | ort:lean 4 | 4096 | 32 | 128 | 0.000389 | 0.3450 | ort:flash 4 | 4096 | 32 | 128 | 0.000376 | 0.3574 | ort:efficient 4 | 4096 | 32 | 128 | 0.000354 | 0.3794 | ort:lean 4 | 8192 | 16 | 64 | 0.000210 | 0.3198 | ort:flash 4 | 8192 | 16 | 64 | 0.000453 | 0.1480 | ort:efficient 4 | 8192 | 16 | 64 | 0.000206 | 0.3260 | ort:lean 4 | 8192 | 32 | 128 | 0.000725 | 0.3705 | ort:flash 4 | 8192 | 32 | 128 | 0.000693 | 0.3874 | ort:efficient 4 | 8192 | 32 | 128 | 0.000653 | 0.4114 | ort:lean 4 | 16384 | 16 | 64 | 0.000355 | 0.3782 | ort:flash 4 | 16384 | 16 | 64 | 0.000849 | 0.1581 | ort:efficient 4 | 16384 | 16 | 64 | 0.000346 | 0.3874 | ort:lean 4 | 16384 | 32 | 128 | 0.001395 | 0.3848 | ort:flash 4 | 16384 | 32 | 128 | 0.001337 | 0.4017 | ort:efficient 4 | 16384 | 32 | 128 | 0.001252 | 0.4288 | ort:lean 4 | 32768 | 16 | 64 | 0.000647 | 0.4146 | ort:flash 4 | 32768 | 16 | 64 | 0.001649 | 0.1628 | ort:efficient 4 | 32768 | 16 | 64 | 0.000639 | 0.4204 | ort:lean 4 | 32768 | 32 | 128 | 0.002721 | 0.3947 | ort:flash 4 | 32768 | 32 | 128 | 0.002601 | 0.4128 | ort:efficient 4 | 32768 | 32 | 128 | 0.002434 | 0.4411 | ort:lean 4 | 65536 | 16 | 64 | 0.001231 | 0.4361 | ort:flash 4 | 65536 | 16 | 64 | 0.003238 | 0.1658 | ort:efficient 4 | 65536 | 16 | 64 | 0.001217 | 0.4412 | ort:lean 4 | 65536 | 32 | 128 | 0.005357 | 0.4009 | ort:flash 4 | 65536 | 32 | 128 | 0.005118 | 0.4196 | ort:efficient 4 | 65536 | 32 | 128 | 0.004781 | 0.4492 | ort:lean 16 | 512 | 16 | 64 | 0.000098 | 0.1724 | ort:flash 16 | 512 | 16 | 64 | 0.000104 | 0.1616 | ort:efficient 16 | 512 | 16 | 64 | 0.000118 | 0.1420 | ort:math 16 | 512 | 16 | 64 | 0.000087 | 0.1926 | ort:lean 16 | 512 | 32 | 128 | 0.000220 | 0.3062 | ort:flash 16 | 512 | 32 | 128 | 0.000208 | 0.3237 | ort:efficient 16 | 512 | 32 | 128 | 0.000237 | 0.2838 | ort:math 16 | 512 | 32 | 128 | 0.000209 | 0.3216 | ort:lean 16 | 1024 | 16 | 64 | 0.000136 | 0.2465 | ort:flash 16 | 1024 | 16 | 64 | 0.000150 | 0.2235 | ort:efficient 16 | 1024 | 16 | 64 | 0.000148 | 0.2266 | ort:math 16 | 1024 | 16 | 64 | 0.000129 | 0.2611 | ort:lean 16 | 1024 | 32 | 128 | 0.000367 | 0.3663 | ort:flash 16 | 1024 | 32 | 128 | 0.000351 | 0.3829 | ort:efficient 16 | 1024 | 32 | 128 | 0.000400 | 0.3357 | ort:math 16 | 1024 | 32 | 128 | 0.000349 | 0.3853 | ort:lean 16 | 2048 | 16 | 64 | 0.000209 | 0.3206 | ort:flash 16 | 2048 | 16 | 64 | 0.000243 | 0.2762 | ort:efficient 16 | 2048 | 16 | 64 | 0.000201 | 0.3338 | ort:lean 16 | 2048 | 32 | 128 | 0.000671 | 0.4002 | ort:flash 16 | 2048 | 32 | 128 | 0.000645 | 0.4163 | ort:efficient 16 | 2048 | 32 | 128 | 0.000642 | 0.4185 | ort:lean 16 | 4096 | 16 | 64 | 0.000360 | 0.3732 | ort:flash 16 | 4096 | 16 | 64 | 0.000425 | 0.3162 | ort:efficient 16 | 4096 | 16 | 64 | 0.000341 | 0.3933 | ort:lean 16 | 4096 | 32 | 128 | 0.001292 | 0.4156 | ort:flash 16 | 4096 | 32 | 128 | 0.001251 | 0.4291 | ort:efficient 16 | 4096 | 32 | 128 | 0.001241 | 0.4327 | ort:lean 16 | 8192 | 16 | 64 | 0.000666 | 0.4030 | ort:flash 16 | 8192 | 16 | 64 | 0.000804 | 0.3339 | ort:efficient 16 | 8192 | 16 | 64 | 0.000627 | 0.4283 | ort:lean 16 | 8192 | 32 | 128 | 0.002541 | 0.4226 | ort:flash 16 | 8192 | 32 | 128 | 0.002454 | 0.4376 | ort:efficient 16 | 8192 | 32 | 128 | 0.002438 | 0.4405 | ort:lean 16 | 16384 | 16 | 64 | 0.001292 | 0.4156 | ort:flash 16 | 16384 | 16 | 64 | 0.001571 | 0.3417 | ort:efficient 16 | 16384 | 16 | 64 | 0.001217 | 0.4411 | ort:lean 16 | 16384 | 32 | 128 | 0.005042 | 0.4260 | ort:flash 16 | 16384 | 32 | 128 | 0.004859 | 0.4420 | ort:efficient 16 | 16384 | 32 | 128 | 0.004827 | 0.4449 | ort:lean 16 | 32768 | 16 | 64 | 0.002537 | 0.4233 | ort:flash 16 | 32768 | 16 | 64 | 0.003103 | 0.3461 | ort:efficient 16 | 32768 | 16 | 64 | 0.002385 | 0.4501 | ort:lean 16 | 32768 | 32 | 128 | 0.009961 | 0.4312 | ort:flash 16 | 32768 | 32 | 128 | 0.009605 | 0.4472 | ort:efficient 16 | 32768 | 32 | 128 | 0.009524 | 0.4510 | ort:lean 16 | 65536 | 16 | 64 | 0.005019 | 0.4279 | ort:flash 16 | 65536 | 16 | 64 | 0.006133 | 0.3502 | ort:efficient 16 | 65536 | 16 | 64 | 0.004703 | 0.4566 | ort:lean 16 | 65536 | 32 | 128 | 0.019746 | 0.4350 | ort:flash 16 | 65536 | 32 | 128 | 0.019027 | 0.4515 | ort:efficient 16 | 65536 | 32 | 128 | 0.018864 | 0.4554 | ort:lean ### Motivation and Context

### Description  With this optimization, 96 MultiHeadAttention|Transpose ops in phi3 disappear. Phi3 becomes 113 tokens from 107 tokens on my dGPUs. The optimization mainly skips the transpose op if one of the transposed dims is 1. Reshape is enough.

@fs-eire

Bumps [cookie](https://github.com/jshttp/cookie) and [socket.io](https://github.com/socketio/socket.io). These dependencies needed to be updated together. Updates `cookie` from 0.4.2 to 0.7.2 <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/jshttp/cookie/releases">cookie's releases</a>.</em></p> <blockquote> <h2>v0.7.2</h2> <p><strong>Fixed</strong></p> <ul> <li>Fix object assignment of <code>hasOwnProperty</code> (<a href="https://redirect.github.com/jshttp/cookie/issues/177">#177</a>) bc38ffd</li> </ul> <p><a href="https://github.com/jshttp/cookie/compare/v0.7.1...v0.7.2">https://github.com/jshttp/cookie/compare/v0.7.1...v0.7.2</a></p> <h2>0.7.1</h2> <p><strong>Fixed</strong></p> <ul> <li>Allow leading dot for domain (<a href="https://redirect.github.com/jshttp/cookie/issues/174">#174</a>) <ul> <li>Although not permitted in the spec, some users expect this to work and user agents ignore the leading dot according to spec</li> </ul> </li> <li>Add fast path for <code>serialize</code> without options, use <code>obj.hasOwnProperty</code> when parsing (<a href="https://redirect.github.com/jshttp/cookie/issues/172">#172</a>)</li> </ul> <p><a href="https://github.com/jshttp/cookie/compare/v0.7.0...v0.7.1">https://github.com/jshttp/cookie/compare/v0.7.0...v0.7.1</a></p> <h2>0.7.0</h2> <ul> <li>perf: parse cookies ~10% faster (<a href="https://redirect.github.com/jshttp/cookie/issues/144">#144</a> by <a href="https://github.com/kurtextrem"><code>@kurtextrem</code></a> and <a href="https://redirect.github.com/jshttp/cookie/issues/170">#170</a>)</li> <li>fix: narrow the validation of cookies to match RFC6265 (<a href="https://redirect.github.com/jshttp/cookie/issues/167">#167</a> by <a href="https://github.com/bewinsnw"><code>@bewinsnw</code></a>)</li> <li>fix: add <code>main</code> to <code>package.json</code> for rspack (<a href="https://redirect.github.com/jshttp/cookie/issues/166">#166</a> by <a href="https://github.com/proudparrot2"><code>@proudparrot2</code></a>)</li> </ul> <p><a href="https://github.com/jshttp/cookie/compare/v0.6.0...v0.7.0">https://github.com/jshttp/cookie/compare/v0.6.0...v0.7.0</a></p> <h2>0.6.0</h2> <ul> <li>Add <code>partitioned</code> option</li> </ul> <h2>0.5.0</h2> <ul> <li>Add <code>priority</code> option</li> <li>Fix <code>expires</code> option to reject invalid dates</li> <li>pref: improve default decode speed</li> <li>pref: remove slow string split in parse</li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/jshttp/cookie/commit/d19eaa1a2bb9ca43ac0951edd852ba4e88e410e0"><code>d19eaa1</code></a> 0.7.2</li> <li><a href="https://github.com/jshttp/cookie/commit/bc38ffd0eae716b199236dda061d0bdc74192dd3"><code>bc38ffd</code></a> Fix object assignment of <code>hasOwnProperty</code> (<a href="https://redirect.github.com/jshttp/cookie/issues/177">#177</a>)</li> <li><a href="https://github.com/jshttp/cookie/commit/cf4658f492c5bd96aeaf5693c3500f8495031014"><code>cf4658f</code></a> 0.7.1</li> <li><a href="https://github.com/jshttp/cookie/commit/6a8b8f5a49af7897b98ebfb29a1c4955afa3d33e"><code>6a8b8f5</code></a> Allow leading dot for domain (<a href="https://redirect.github.com/jshttp/cookie/issues/174">#174</a>)</li> <li><a href="https://github.com/jshttp/cookie/commit/58015c0b93de0b63db245cfdc5a108e511a81ad0"><code>58015c0</code></a> Remove more code and perf wins (<a href="https://redirect.github.com/jshttp/cookie/issues/172">#172</a>)</li> <li><a href="https://github.com/jshttp/cookie/commit/ab057d6c06b94a7b1e3358e69a685ae49c97b627"><code>ab057d6</code></a> 0.7.0</li> <li><a href="https://github.com/jshttp/cookie/commit/5f02ca87688481dbcf155e49ca8b61732f30e542"><code>5f02ca8</code></a> Migrate history to GitHub releases</li> <li><a href="https://github.com/jshttp/cookie/commit/a5d591ce8447dd63821779724f96ad3c774c8579"><code>a5d591c</code></a> Migrate history to GitHub releases</li> <li><a href="https://github.com/jshttp/cookie/commit/51968f94b5e820adeceef505539fa193ffe2d105"><code>51968f9</code></a> Skip isNaN</li> <li><a href="https://github.com/jshttp/cookie/commit/9e7ca51ade4b325307eedd6b4dec190983e9e2cc"><code>9e7ca51</code></a> perf(parse): cache length, return early (<a href="https://redirect.github.com/jshttp/cookie/issues/144">#144</a>)</li> <li>Additional commits viewable in <a href="https://github.com/jshttp/cookie/compare/v0.4.2...v0.7.2">compare view</a></li> </ul> </details> <details> <summary>Maintainer changes</summary> <p>This version was pushed to npm by <a href="https://www.npmjs.com/~blakeembrey">blakeembrey</a>, a new releaser for cookie since your current version.</p> </details> <br /> Updates `socket.io` from 4.7.5 to 4.8.0 <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/socketio/socket.io/releases">socket.io's releases</a>.</em></p> <blockquote> <h2>[email protected]</h2> <h3>Features</h3> <h4>Custom transport implementations</h4> <p>The <code>transports</code> option now accepts an array of transport implementations:</p> <pre lang="js"><code>import { io } from "socket.io-client"; import { XHR, WebSocket } from "engine.io-client"; <p>const socket = io({ transports: [XHR, WebSocket] }); </code></pre></p> <p>Here is the list of provided implementations:</p> <table> <thead> <tr> <th>Transport</th> <th>Description</th> </tr> </thead> <tbody> <tr> <td><code>Fetch</code></td> <td>HTTP long-polling based on the built-in <code>fetch()</code> method.</td> </tr> <tr> <td><code>NodeXHR</code></td> <td>HTTP long-polling based on the <code>XMLHttpRequest</code> object provided by the <code>xmlhttprequest-ssl</code> package.</td> </tr> <tr> <td><code>XHR</code></td> <td>HTTP long-polling based on the built-in <code>XMLHttpRequest</code> object.</td> </tr> <tr> <td><code>NodeWebSocket</code></td> <td>WebSocket transport based on the <code>WebSocket</code> object provided by the <code>ws</code> package.</td> </tr> <tr> <td><code>WebSocket</code></td> <td>WebSocket transport based on the built-in <code>WebSocket</code> object.</td> </tr> <tr> <td><code>WebTransport</code></td> <td>WebTransport transport based on the built-in <code>WebTransport</code> object.</td> </tr> </tbody> </table> <p>Usage:</p> <table> <thead> <tr> <th>Transport</th> <th>browser</th> <th>Node.js</th> <th>Deno</th> <th>Bun</th> </tr> </thead> <tbody> <tr> <td><code>Fetch</code></td> <td>:white_check_mark:</td> <td>:white_check_mark: (1)</td> <td>:white_check_mark:</td> <td>:white_check_mark:</td> </tr> <tr> <td><code>NodeXHR</code></td> <td></td> <td>:white_check_mark:</td> <td>:white_check_mark:</td> <td>:white_check_mark:</td> </tr> <tr> <td><code>XHR</code></td> <td>:white_check_mark:</td> <td></td> <td></td> <td></td> </tr> <tr> <td><code>NodeWebSocket</code></td> <td></td> <td>:white_check_mark:</td> <td>:white_check_mark:</td> <td>:white_check_mark:</td> </tr> <tr> <td><code>WebSocket</code></td> <td>:white_check_mark:</td> <td>:white_check_mark: (2)</td> <td>:white_check_mark:</td> <td>:white_check_mark:</td> </tr> <tr> <td><code>WebTransport</code></td> <td>:white_check_mark:</td> <td>:white_check_mark:</td> <td></td> <td></td> </tr> </tbody> </table> <p>(1) since <a href="https://nodejs.org/api/globals.html#fetch">v18.0.0</a> (2) since <a href="https://nodejs.org/api/globals.html#websocket">v21.0.0</a></p> <p>Added in <a href="https://github.com/socketio/engine.io-client/commit/f4d898ee9652939a4550a41ac0e8143056154c0a">f4d898e</a> and <a href="https://github.com/socketio/engine.io-client/commit/b11763beecfe4622867b4dec9d1db77460733ffb">b11763b</a>.</p> <h4>Test each low-level transports</h4> <p>When setting the <code>tryAllTransports</code> option to <code>true</code>, if the first transport (usually, HTTP long-polling) fails, then the other transports will be tested too:</p> <pre lang="js"><code>import { io } from "socket.io-client"; </tr></table> </code></pre> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/socketio/socket.io/commit/d0fc72042068e7eaef448941add617f05e1ec236"><code>d0fc720</code></a> chore(release): [email protected]</li> <li><a href="https://github.com/socketio/socket.io/commit/4a0555c671b8e848e115e81bb1472e99f348e207"><code>4a0555c</code></a> chore(release): [email protected]</li> <li><a href="https://github.com/socketio/socket.io/commit/2b60df18a88432ced79042e63a62d40cd48c823b"><code>2b60df1</code></a> chore(release): [email protected]</li> <li><a href="https://github.com/socketio/socket.io/commit/d4cb3758564b008f98e5d60d81b87c9faf7fc553"><code>d4cb375</code></a> ci: ignore tests when publishing to npm</li> <li><a href="https://github.com/socketio/socket.io/commit/c251ae7ba77d43de73225770f1470eb2fa112c6d"><code>c251ae7</code></a> chore(release): [email protected]</li> <li><a href="https://github.com/socketio/socket.io/commit/8a2f5a3da0addb386e7a0f4970e1a9696b82797e"><code>8a2f5a3</code></a> fix(eio-client): move 'offline' event listener at the top</li> <li><a href="https://github.com/socketio/socket.io/commit/b04fa64365729244a9c50a6b54b12e9bcc9e55d0"><code>b04fa64</code></a> fix(sio): allow to join a room in a middleware (uws)</li> <li><a href="https://github.com/socketio/socket.io/commit/7085f0e3e46cd1fd41d952450b8d01b04de83daf"><code>7085f0e</code></a> refactor(sio-client): mangle private attributes</li> <li><a href="https://github.com/socketio/socket.io/commit/4f667082108235209df81d44f453826a3f5c08e7"><code>4f66708</code></a> chore(sio-client): use babel loose mode when transpiling classes</li> <li><a href="https://github.com/socketio/socket.io/commit/1a95db21454b5469cc43bb602bac774a57a8bd98"><code>1a95db2</code></a> chore(sio-client): add a script to compute the bundle size</li> <li>Additional commits viewable in <a href="https://github.com/socketio/socket.io/compare/[email protected]@4.8.0">compare view</a></li> </ul> </details> <br /> Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) Dependabot will merge this PR once CI passes on it, as requested by @fs-eire. [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/microsoft/onnxruntime/network/alerts). </details> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

### Description  This PR further optimizes matmulnbits specially for iGPUs. The phi3 demo becomes ~12 tokens/second from ~8 tokens on iGPUs. Some todos: 1. Make the optimization more general, Remove the blockSize = 32 limitation. 2. Tune the parameter, such as workgroupSize, components size (currently only support components = 1), to see the performance change.

1. Add python 3.13 to our python packaging pipelines 2. Because numpy 2.0.0 doesn't support thread free python, this PR also upgrades numpy to the latest 3. Delete some unused files.

…soft#22223) - Added a microbenchmark for the `LayerNormalization` MLFloat16 support added in microsoft#22063. - Updated the `LayerNormalization` MLFloat16 implementation to improve the latency. ``` ---------------------------------------------------------------------------------------------- Original MLFloat16 support Time CPU Iterations ---------------------------------------------------------------------------------------------- BM_LayerNormalization<MLFloat16, float>/1/real_time 15599 us 15625 us 47 BM_LayerNormalization<MLFloat16, float>/1/real_time 14714 us 14824 us 39 BM_LayerNormalization<MLFloat16, float>/1/real_time 14634 us 14688 us 50 ---------------------------------------------------------------------------------------------- Updated MLFloat16 support Time CPU Iterations ---------------------------------------------------------------------------------------------- BM_LayerNormalization<MLFloat16, float>/1/real_time 7276 us 7254 us 84 BM_LayerNormalization<MLFloat16, float>/1/real_time 6820 us 6720 us 93 BM_LayerNormalization<MLFloat16, float>/1/real_time 6840 us 6882 us 84 ```

### Description  ### Motivation and Context <!-- - Why is this change required? What problem does it solve?

nvidia awq only use QuantFormat.QDQ quant format

### Description Change the hipify step to remove the -roc option to hipify-perl. This will prefer hipblas over rocblas. rocblas can still be called directly such as in TunableOp. ### Motivation and Context hip interfaces are preferred over roc for porting from cuda to hip. Calling roc interfaces is meant for ROCm-specific enhancements or extensions.

### Description For no, CoreML only support run mlmodels on CPU/ALL, However, sometimes CPU_GPU would be faster a lot. We support the option to select different hardware to boost performance in this PR. ### Motivation and Context  --------- Co-authored-by: Edward Chen <[email protected]>

### Description Today, stable diffusion stage failed due to there's a upgrade in timm. controlnet_aux depends on it. And its latest version limit the timm version less than 0.6.7. So upgrading controlnet_aux can solve it. And controlnet_aux uses opencv-python-headless, pin opencv-python-headless to 4.8.0.74 too. ### Motivation and Context

* Add in missing operators for llama run * Add simplified layer norm ops ### Description  Adding additional supported operators into MIGraphX EP that are supported in MIGraphX ### Motivation and Context  Allows for more models to be run through MIGraphX EP

### Description We are seeing this [packaging pipeline](https://aiinfra.visualstudio.com/Lotus/_build?definitionId=940&_a=summary) fail because we are running into BrowserStack account issues. Disabling this step until issues are resolved

### Description Our nightly CPU python package's name is "ort-nightly" instead of "onnxruntime". It was because of some historical reasons. Tensorflow was like that. Now we would prefer to make them the same. Do this change for all nightly python packages, including CPU, GPU(CUDA), and maybe others. ### Motivation and Context

### Description  ### Motivation and Context increase FP16 test coverage for all related EPs

1. Update ROCm Nuget pipeline build version to ROCm 6.2 2. Update AMD-GPU Agent Pool base docker image for ROCm Nuget pipeline test stage. search `AMD GPU pipeline Nuget` page in onenote to see how to update it. passed pipeline: https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=580846&view=results

### Description BrowserStack account issues have been resolved -- this PR enables E2E browserstack tests in the pipeline again

To include a bug fix: https://gitlab.kitware.com/cmake/cmake/-/merge_requests/9890 Discussion: https://discourse.cmake.org/t/cmake-incorrectly-links-to-nvrtc-builtins/12723/4 This bug fix should be included in our upcoming release, because right now our GPU package depends on “libnvrtc-builtins.so.12.2" which has a hardcoded CUDA version: 12.2. The minor CUDA version should not be there.

…icrosoft#22458) ### Description  ### Motivation and Context

Chromium will rename split's output name from "output" to "outputs" in `OpSupportLimits` to align with spec, the EP should check which name is available to make it compatible.

### Description 1. Delete TVM EP because it is out of maintain 2. Delete ortmodule related docker files and scripts.

…at check old ifdefs. (microsoft#22876)

### Description  Extend timeout for always failed job. ### Motivation and Context

This change fixes multiple tests like QDQTransformerTests.MatMul_U8S8S8, for all architectures where architecture-specific optimized function is not available yet, like s390x. ### Description Matrix B is packed by 16 elements, thus new row starts 16 items later. Also, for next C increment index only by 1 for each increment of C. ### Motivation and Context This change fixes mlas sgemm fallback implementation for all architectures which don't have architecture-specific implementations available, like s390x.

microsoft#22914) …ime/java (microsoft#22771)" This reverts commit 632a36a. ### Description  ### Motivation and Context Run E2E tests using Browserstack failed due to this PR.

### Description  ### Motivation and Context

### Description  when updating from cp38 to cp310, there has some issues for bigmodel pipeine. there are two jobs failed: stable_diffusion and whisper. 1. for stable_diffusion, we are now using "nvcr.io/nvidia/pytorch:22.11-py3" from nvidia repo. it is for cuda11 and python3.8. and they are not providing python3.10 version for cuda 11. the latest version of this docker image is for cuda12 and python3.10. To solve this problem, i use a docker image of ubuntu22.04, and then install all need python package for this job. 2. for whisper. the original docker image is ubuntu20.04 which doesn't have python3.10, and has to update to ubuntu22.04.

### Description Match new SDPA pattern for huggingface BERT model that exported from latest transformers package. Some changes of transformers tests in CI pipeline: (1) Enable tests for bert, distilbert and roberta models in CI. (2) Remove out-of-date tests for huggingface models that were marked as slow and not enabled in CI pipeline. (3) Upgrade transformers package version to the latest. ### Motivation and Context Recent huggingface transformers use torch SDPA in bert modeling. The graph pattern change causes attention fusion not working anymore. Update the fusion script to match the new pattern.

### Description * Reduce GQA test combinations to save about 35 minutes test time in CI pipelines. * Show latency of transformers tests * Use seed in DMMHA test to avoid random failure. * For test_flash_attn_rocm.py, test skipping condition from "has cuda ep" to "not has rocm ep", so that it does not run in cpu build. * For test_flash_attn_cuda.py, move flash attention and memory efficient attention tests to different classes, so that we can skip a test suite instead of checking in each test. ### Motivation and Context It takes too long to run GQA tests in CI pipelines since there are too many combinations. ###### Linux GPU CI Pipeline Before: 5097 passed, 68 skipped, 8 warnings in 1954.64s (0:32:34) After: 150 passed, 176 skipped, 8 warnings in 530.38s (0:08:50) Time Saved: **1424** seconds (0:23:44) ###### Windows GPU CUDA CI Pipeline Before: 1781 passed, 72 skipped, 6 warnings in 605.48s (0:10:05) After: 116 passed, 118 skipped, 6 warnings in 275.48s (0:04:35) Time Saved: **330** seconds (0:05:30) ###### Linux CPU CI Pipeline Before: 5093 passed, 72 skipped, 4 warnings in 467.04s (0:07:47) - 212.96s transformers/test_gqa_cpu.py::TestGQA::test_gqa_past - 154.12s transformers/test_gqa_cpu.py::TestGQA::test_gqa_no_past - 26.45s transformers/test_gqa_cpu.py::TestGQA::test_gqa_interactive_one_batch After: 116 passed, 210 skipped, 4 warnings in 93.41s (0:01:33) - 0.97s transformers/test_gqa_cpu.py::TestGQA::test_gqa_past - 19.23s transformers/test_gqa_cpu.py::TestGQA::test_gqa_no_past - 2.41s transformers/test_gqa_cpu.py::TestGQA::test_gqa_interactive_one_batch Time Saved: **374** seconds (0:06:14).

Option is named onnxruntime_FORCE_GENERIC_ALGORITHMS Follow up to microsoft#22125. ### Description This change adds compile-time option to disable optimized algorithms and use generic algorithms (exclude AVX* and SSE etc in GEMM) on x86. This new option is intended only for testing these algorithms, not for production use. Following build command on linux x86_64 builds onnxruntime with new option enabled: `./build.sh --parallel --cmake_extra_defines onnxruntime_FORCE_GENERIC_ALGORITHMS=1` ### Motivation and Context This change allows testing generic algorithms. This may be needed for platforms which don't have optimized implementations available, like in microsoft#22125.

…22810) ### Description  Update comment for `-I` to mention that symbolic dim values can be provided with `-f`. ### Motivation and Context

### Description * Install PyTorch for transformers tests. The installation is before python tests so that it can use torch if needed. * Update protobuf and numpy versions used in transformers test. ### Motivation and Context Currently, transformers tests are enabled in the following CI pipelines: * Linux CPU CI Pipeline (torch for cpu-only) * Linux GPU CI Pipeline (torch for cuda 12) * Windows GPU CUDA CI Pipeline (torch for cpu-only right now, note that we might change it to torch for cuda 12 in the future). For ROCm CI Pipeline, transformer tests are enabled but skipped since onnx package is not installed in CI. Previously, torch was not installed before python tests, so some tests depending on torch were skipped like [test_bind_onnx_types_not_supported_by_numpy](https://github.com/microsoft/onnxruntime/blob/f6e1d4482941d43737d40723df16a6bf0da43ee5/onnxruntime/test/python/onnxruntime_test_python_iobinding.py#L199) or [test user_compute_stream](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/test/python/onnxruntime_test_python.py#L465-L476). In this PR, we changed build.py to install torch before running python tests.

### Description Merges PR microsoft#21851, microsoft#21222. Implements TreeEnsemble from ai.onnx.ml==5 (CPU). --------- Co-authored-by: Bilyana Indzheva <[email protected]> Co-authored-by: Bilyana Indzheva <[email protected]> Co-authored-by: Christian Bourjau <[email protected]>

### Description Add a new stage to build cuda and dml in Windows GPU CI pipeline (PR checks) to prevent regressions introduced by new cuda tests. Update all tests in cuda/testcases name prefix to CudaEp for skipping them easily ### Motivation and Context 1. CudaNhwcEP is added by default when using cuda ep 2. if onnxruntime_ENABLE_CUDA_EP_INTERNAL_TES is enable, the tests in tests/provider/cuda/testcases is added too. ### To do add enable_pybind in the new stage. Now, --enable_pybind will trigger some python test, like onnxruntime_test_python.py. It uses the API of get_avaible_providers() . More discussions are needed to decide how to make it works

### Description Update pipeline status: (1) replace dead link of cuda pipeline (2) remove dead link of training distributed pipeline (3) add webgpu pipeline Before: https://github.com/microsoft/onnxruntime/blob/main/README.md#builtin-pipeline-status After: https://github.com/microsoft/onnxruntime/blob/8ec473d013d1f41f96459b11f2ebab43f1eb3aa0/README.md#builtin-pipeline-status ### Motivation and Context Some pipelines are removed, need replace with new one.

We need to be able to control/override the exact version of qnn sdk used for the android build as qnn-runtime (maven package) releases are slower to QNN SDK releases.

@Honry

This PR limits the axis of the CumSum operator to be a constant when using WebNN EP. @Honry @fdwr PTAL.

Slice with negative steps can be emulated by reverse+slice.

### Description Fix sequential_executor.cc to avoid segfault when profiling is used on model with empty Optional ### Motivation and Context Fixes microsoft#22890

### Description  Update this patch because the origin file has changed ### Motivation and Context

### Description AppendExecutionProvider("CoreML", {{"MLComputeUnits","MLProgram"}}) ### Motivation and Context  --------- Co-authored-by: Scott McKay <[email protected]> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

In JS, reduce of empty array with no initial value will throw error. Fix it by checking the array length firstly.

### Description Fixes regression in post merge pipeline caused by microsoft#22612 ### Motivation and Context So far, there isn't the artifactFeeds in Public Project

### Description - Erf - Round - Max - ReduceMax - ReduceMean - ReduceSum - Unsqueeze - Squeeze - Softmax ### Motivation and Context  --------- Co-authored-by: Scott McKay <[email protected]> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

### Description Fix mamtulnbits accuracy level ### Motivation and Context

Add ReduceL2 support to QNN EP. Some of the QNN AI Hub models contain Reduce L2, such as openai_clip_CLIPTextEncoder and openai_clip_CLIPIamgeEncoder, without this PR, the ReduceL2 will be assigned to CPU and the graph will be split to 2 QNN graphs, which this PR, all nodes will be in QNN EP.

Some quantized models don't have Conv/Gemm node's bias quantized but still leave them in float. This PR is to create a sub-graph to quantize the bias for Conv/Gemm nodes with scale = scale_input_0 * scale_input_1 and zp = 0. We only do this for bias initializer so that ConstantFolding will fold the sub-graph to a real quantized int32 bias initializer during the graph optimization next round.

jslap-ubi force-pushed the js/allow-disable-SIMD branch from 2f67b8f to 46f8996 Compare August 1, 2024 19:49

jslap-ubi pushed a commit that referenced this pull request Aug 1, 2024

Security fuzz address sanitizer fix Bug #2 and #3 (microsoft#21528)

48fb8a7

### Description Security fuzz test with address sanitizer found several bugs

jslap-ubi force-pushed the js/allow-disable-SIMD branch from 46f8996 to d0aada7 Compare September 23, 2024 16:58

mindest and others added 27 commits October 12, 2024 13:43

DecoderMaskedMultiHeadAttention CPU kernel. (microsoft#22292)

1fa219d

### Description DecoderMaskedMultiHeadAttention CPU kernel.

Move suggest fixes to a separate CI workflow (microsoft#22415)

9b1b4e5

Move suggest fixes to a separate CI workflow so that it is triggered only on PRs and does not fail the main branch.

Add python 3.13 support (microsoft#22380)

4af593a

1. Add python 3.13 to our python packaging pipelines 2. Because numpy 2.0.0 doesn't support thread free python, this PR also upgrades numpy to the latest 3. Delete some unused files.

Refactor one test function in MatMul_test (microsoft#22432)

6e5e320

### Description  ### Motivation and Context <!-- - Why is this change required? What problem does it solve?

nvidia awq only use QuantFormat.QDQ quant format (microsoft#22429)

ec7aa63

nvidia awq only use QuantFormat.QDQ quant format

Fix training artifacts for 2GB+ models and MSELoss (microsoft#22414)

a5e85a9

Enable RunMatMulTest all test cases support FP16 (microsoft#22440)

2b8fc55

### Description  ### Motivation and Context increase FP16 test coverage for all related EPs

Enable BrowserStack tests (microsoft#22457)

691de83

### Description BrowserStack account issues have been resolved -- this PR enables E2E browserstack tests in the pipeline again

Honry and others added 29 commits November 19, 2024 12:44

[WebNN] Check split's output name (microsoft#22884)

5b78712

Chromium will rename split's output name from "output" to "outputs" in `OpSupportLimits` to align with spec, the EP should check which name is available to make it compatible.

Cleanup code (microsoft#22827)

13346fd

### Description 1. Delete TVM EP because it is out of maintain 2. Delete ortmodule related docker files and scripts.

Simplify CPU allocator arena usage helper function, fix unit tests th…

af0303f

…at check old ifdefs. (microsoft#22876)

Fix Pipeline Timeout Issue (microsoft#22901)

712bee1

### Description  Extend timeout for always failed job. ### Motivation and Context

Revert "Update Gradle version 8.7 and java version 17 within onnxrunt… (

a28246a

microsoft#22914) …ime/java (microsoft#22771)" This reverts commit 632a36a. ### Description  ### Motivation and Context Run E2E tests using Browserstack failed due to this PR.

Update the Docker image version (microsoft#22907)

369d7bf

### Description  ### Motivation and Context

Override android qnn sdk version with pipeline param (microsoft#22895)

f80afeb

We need to be able to control/override the exact version of qnn sdk used for the android build as qnn-runtime (maven package) releases are slower to QNN SDK releases.

[WebNN EP] Fix an issue of CumSum operator (microsoft#22936)

558ae86

This PR limits the axis of the CumSum operator to be a constant when using WebNN EP. @Honry @fdwr PTAL.

[WebNN] Support negative steps for slice (microsoft#22871)

afbb539

Slice with negative steps can be emulated by reverse+slice.

microsoft#22890 Fix profiling on empty Optional (microsoft#22891)

8826e39

### Description Fix sequential_executor.cc to avoid segfault when profiling is used on model with empty Optional ### Motivation and Context Fixes microsoft#22890

[WebNN EP] Fixed bug in usage of Array.reduce() (microsoft#22944)

fe749a8

In JS, reduce of empty array with no initial value will throw error. Fix it by checking the array length firstly.

Limit PipAuthenticate in Private Project now (microsoft#22954)

b930b4a

### Description Fixes regression in post merge pipeline caused by microsoft#22612 ### Motivation and Context So far, there isn't the artifactFeeds in Public Project

[CPU] Fix mamtulnbits accuracy level (microsoft#22963)

08abab0

### Description Fix mamtulnbits accuracy level ### Motivation and Context

Allow configuration template to disable some SIMD.

7ca7306

jslap-ubi force-pushed the js/allow-disable-SIMD branch from d0aada7 to 7ca7306 Compare November 29, 2024 21:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow configuration template to disable some SIMD. #3

Allow configuration template to disable some SIMD. #3

jslap-ubi commented Apr 5, 2024

Allow configuration template to disable some SIMD. #3

Are you sure you want to change the base?

Allow configuration template to disable some SIMD. #3

Conversation

jslap-ubi commented Apr 5, 2024

Description

Motivation and Context