Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORT 1.20.0 Release: Cherry pick round 1 #22526

Merged
merged 7 commits into from
Oct 22, 2024
Merged

Conversation

apsonawane
Copy link
Contributor

ORT 1.20.0 release preparation: Cherry pick round 1

Approved cherry pick comments

edgchen1 and others added 4 commits October 21, 2024 13:43
- Allow specification of iOS simulator runtime version to use.
- Pick simulator runtime version (iphonesimulator 16.4) that is supported by the Xcode version (14.3.1) that we use.
- Disable CoreML EP's DepthToSpace op support for CoreML version less than 7, with DCR mode, and FP16 input. It doesn't produce the correct output in this case.
- Some cleanup of iOS test infrastructure.
### Description
Update QNN default version to 2.27 in CI pipeline
…tization to the CPU EP (#22436)

### Description
Adds QNN provider option `offload_graph_io_quantization` to offload
graph input quantization and graph output dequantization to the CPU EP.
Option is disabled by default to maintain current behavior.


### Motivation and Context
Offloading the handling of I/O quantization to the CPU EP significantly
improves inference latency for many models.
### Description
This adds support for partial RotaryEmbedding to DML. Essentially,
partial RotaryEmbedding simply consists of doing the rotary embedding
calculation on a subregion of the input tensor of as if its head size
was `rotary_embedding_dim`, while leaving the second part of the tensor
(i.e. `head_size - rotary_embedding_dim`) alone.

To achieve this, all we need to do is follow the following steps:

1. Split the tensor into 2 parts
2. Run the rotary embedding algorithm on the first part, just like we
were doing before on the entire tensor
3. Join the 2 parts back together

Since we're leaving the middle part intact, the RotaryEmbedding fusion
will still be done within DML. Also, the concat at the end is
essentially free because DML optimizes it out and directly allocate the
result of RotaryEmbedding at the right place. The only overhead here is
the splitting of the tensor at the beginning, which we should eventually
make part of the RotaryEmbedding fusion within DML.



### Motivation and Context
This fix allows us to correctly run models that have a
`partial_rotary_factor` setting in huggingface, including Nvidia's
Nemotron: https://huggingface.co/nvidia/Nemotron-Mini-4B-Instruct
@edgchen1
Copy link
Contributor

do you also want to take #22508 to fix the "Big Models" pipeline?

@sophies927
Copy link
Contributor

do you also want to take #22508 to fix the "Big Models" pipeline?

I think that's a good idea since I see that the big models checks have already failed a couple times - @apsonawane can you please add that one?

…es in 0.26.0 (#22508)

### Description
Pin huggingface_hub to 0.25.2 due to breaking changes in 0.26.0.



### Motivation and Context
We depend on `diffusers==0.28.0`, which [depends
on](https://github.com/huggingface/diffusers/blob/v0.28.0-release/setup.py#L104)
`huggingface_hub>=0.20.2`. There are breaking changes with the latest
huggingface_hub 0.26.0 release that break our Big Models pipeline:
[Release v0.26.0: Multi-tokens support, conversational VLMs and quality
of life improvements ·
huggingface/huggingface_hub](https://github.com/huggingface/huggingface_hub/releases/tag/v0.26.0)

Specifically, the breaking changes to `cached_download()` cause our
pipeline to fail.

![image](https://github.com/user-attachments/assets/c1d15c7e-9a5d-4ef3-8d1b-35bde0a2ca82)
@snnn
Copy link
Member

snnn commented Oct 22, 2024

Please also include this change: #22516

This pull request upgrades the CMake version from v3.31.0-rc1 to
v3.31.0-rc2 to include a bug fix for CUDA
https://gitlab.kitware.com/cmake/cmake/-/merge_requests/9902 from Nvidia
company.

AB#51692
@snnn
Copy link
Member

snnn commented Oct 22, 2024

And you may also need #22479 to get the Windows pipelines pass. Or you may need to retry and retry.

### Description

The recent PR #22223 introduced 2 bugs in implementation of CPU
LayerNorm f16:
- possible access to nullptr for bias
`const TensorShape& bias_shape = bias->Shape();` will crash when `bias`
does not exist. (amazingly seems this one is not coverred by any test
case)
   - fix: guard with pointer check
- a racing condition inside ComputeJob
`ComputeJob()` is dispatched to threadpool and it internally tries to
modify `LayerNormImpl::scale_fp32_` and `LayerNormImpl::bias_fp32_`,
which are `std::unique_ptr`s and are not thread-safe.
- fix: move the modification of `LayerNormImpl::scale_fp32_` and
`LayerNormImpl::bias_fp32_` out of `ComputeJob()` and put into
`LayerNormImpl::ComputeWithoutContext()`. It may still have racing
condition because `ConcurrentRunSupported` is set to `true` for CPU EP.
Added an OrtMutex.

This should fixes the recent flaky tests as well.
@sophies927 sophies927 requested a review from fs-eire October 22, 2024 17:42
@apsonawane apsonawane merged commit 2d00351 into rel-1.20.0 Oct 22, 2024
104 of 105 checks passed
@apsonawane apsonawane deleted the asonawane/cherry-picks branch October 22, 2024 20:57
@sophies927 sophies927 added release:1.20.0 cherry-picked Cherry-picked for a cherrypicks branch labels Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-picked Cherry-picked for a cherrypicks branch release:1.20.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants