-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ORT 1.20.0 Release: Cherry pick round 1 #22526
Conversation
- Allow specification of iOS simulator runtime version to use. - Pick simulator runtime version (iphonesimulator 16.4) that is supported by the Xcode version (14.3.1) that we use. - Disable CoreML EP's DepthToSpace op support for CoreML version less than 7, with DCR mode, and FP16 input. It doesn't produce the correct output in this case. - Some cleanup of iOS test infrastructure.
### Description Update QNN default version to 2.27 in CI pipeline
…tization to the CPU EP (#22436) ### Description Adds QNN provider option `offload_graph_io_quantization` to offload graph input quantization and graph output dequantization to the CPU EP. Option is disabled by default to maintain current behavior. ### Motivation and Context Offloading the handling of I/O quantization to the CPU EP significantly improves inference latency for many models.
### Description This adds support for partial RotaryEmbedding to DML. Essentially, partial RotaryEmbedding simply consists of doing the rotary embedding calculation on a subregion of the input tensor of as if its head size was `rotary_embedding_dim`, while leaving the second part of the tensor (i.e. `head_size - rotary_embedding_dim`) alone. To achieve this, all we need to do is follow the following steps: 1. Split the tensor into 2 parts 2. Run the rotary embedding algorithm on the first part, just like we were doing before on the entire tensor 3. Join the 2 parts back together Since we're leaving the middle part intact, the RotaryEmbedding fusion will still be done within DML. Also, the concat at the end is essentially free because DML optimizes it out and directly allocate the result of RotaryEmbedding at the right place. The only overhead here is the splitting of the tensor at the beginning, which we should eventually make part of the RotaryEmbedding fusion within DML. ### Motivation and Context This fix allows us to correctly run models that have a `partial_rotary_factor` setting in huggingface, including Nvidia's Nemotron: https://huggingface.co/nvidia/Nemotron-Mini-4B-Instruct
do you also want to take #22508 to fix the "Big Models" pipeline? |
I think that's a good idea since I see that the big models checks have already failed a couple times - @apsonawane can you please add that one? |
…es in 0.26.0 (#22508) ### Description Pin huggingface_hub to 0.25.2 due to breaking changes in 0.26.0. ### Motivation and Context We depend on `diffusers==0.28.0`, which [depends on](https://github.com/huggingface/diffusers/blob/v0.28.0-release/setup.py#L104) `huggingface_hub>=0.20.2`. There are breaking changes with the latest huggingface_hub 0.26.0 release that break our Big Models pipeline: [Release v0.26.0: Multi-tokens support, conversational VLMs and quality of life improvements · huggingface/huggingface_hub](https://github.com/huggingface/huggingface_hub/releases/tag/v0.26.0) Specifically, the breaking changes to `cached_download()` cause our pipeline to fail. ![image](https://github.com/user-attachments/assets/c1d15c7e-9a5d-4ef3-8d1b-35bde0a2ca82)
Please also include this change: #22516 |
This pull request upgrades the CMake version from v3.31.0-rc1 to v3.31.0-rc2 to include a bug fix for CUDA https://gitlab.kitware.com/cmake/cmake/-/merge_requests/9902 from Nvidia company. AB#51692
And you may also need #22479 to get the Windows pipelines pass. Or you may need to retry and retry. |
### Description The recent PR #22223 introduced 2 bugs in implementation of CPU LayerNorm f16: - possible access to nullptr for bias `const TensorShape& bias_shape = bias->Shape();` will crash when `bias` does not exist. (amazingly seems this one is not coverred by any test case) - fix: guard with pointer check - a racing condition inside ComputeJob `ComputeJob()` is dispatched to threadpool and it internally tries to modify `LayerNormImpl::scale_fp32_` and `LayerNormImpl::bias_fp32_`, which are `std::unique_ptr`s and are not thread-safe. - fix: move the modification of `LayerNormImpl::scale_fp32_` and `LayerNormImpl::bias_fp32_` out of `ComputeJob()` and put into `LayerNormImpl::ComputeWithoutContext()`. It may still have racing condition because `ConcurrentRunSupported` is set to `true` for CPU EP. Added an OrtMutex. This should fixes the recent flaky tests as well.
ORT 1.20.0 release preparation: Cherry pick round 1
Approved cherry pick comments