Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: updated data ops to support the complete graph on OVEP #374

Closed
wants to merge 102 commits into from

Conversation

ankitm3k
Copy link

Description

The ONNX model provided by issue author was not fully supported for OVEP and was failing inference with ort_perf_test app. The current PR enables GRU and LogSoftmax Op which helps enable the whole model graph on OVEP during execution. The unit test for GRU op is disabled.

Also investigating the inference output for multiple iterations for a single common input, the model was giving consistent and correct output across all the inference iterations during testing. Thus solving any post first inference regression of output for the given model architecture.

This PR fixes - microsoft#19975

chilo-ms and others added 30 commits April 25, 2024 16:07
)

**Current issue:**

Once ORT gets the capability from EP's GetCapability(), it creates a
graph viewer based on the capability as below:
`viewers.push_back(std::make_unique<GraphViewer>(graph,
*cur_capability.sub_graph));` or see the code
[here](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/framework/graph_partitioner.cc#L458).

At this point, the graph viewer has the chance to generate the wrong
order of `nodes_in_topological_order_` when calling
[Graph::ReverseDFSFrom](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/graph/graph_viewer.cc#L107),
so that during EP Compile(), EP might create the "wrong nodes ordering"
model proto from the graph viewer when calling
[GraphViewerToProto()](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/graph/graph_proto_serializer.cc#L37)
because of the `nodes_in_topological_order_`.

This is a problem for TRT EP to refit weights to the "weightless"
engine. Since the engine is built from the model proto provided by TRT
EP and the weights is in the original onnx model. The model proto and
the orignal onnx model are not the same in terms of node ordering which
makes TRT complain when refitting.

**The original model (subgraph of ResNet50):**
<img width="442" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/54722500/bb9a641d-f2f2-46c3-aebf-4084a08ff289">

**The serialized model proto generated by TRT EP:**
(The highlighted part has the wrong node order compared to the original
model.)
<img width="340" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/54722500/bbc6bf34-f960-4753-9474-a18ebc2dc48b">

**The solution 1:**
Change default comparator to `NodeCompare::operator() {return
n1->Index() > n2->Index();}`

The root cause of the different node order between original model and EP
generated model is from graph viewer [generating
](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/graph/graph_viewer.cc#L107)the
different `nodes_in_topological_order_`. Modifying the
`NodeCompare::operator()` for sorting can fix the problem.

The `NodeCompare::operator()` will be used in
[Graph::ReverseDFSFrom](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/graph/graph.cc#L1760)
where the input nodes of the current node will be
[sorted](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/graph/graph.cc#L1802)
based on node index.
Due to the sorted nodes will be pushed into a stack which later
determines the final topological node order in a "first in, last out"
approach, the larger node index should be pushed into the stack first.
So that we can get a topological node order aligns with smaller index
node comes first.

**The solution 2 (This PR uses this solution):**
Use priority-based BFS for topological sort in GraphViewerToProto().
### Description

```
Avx2:
Int8

NS(Prompt)		MLAS(Prompt)  	MLAS(Prompt)Gain/Loss		NS(TokenGen)		MLAS(TokenGen)  	MLAS(TokenGen)Gain/Loss
Blklen16: 	90.96			25.15			-72%					7.65				11.71			53%
Blklen32:	90.73			48.55			-46%					7.86				14.28			81%
Blklen64:	89.49			68.84			-23%					8.30				15.78			90%
Blklen128:	87.38			78.37			-10%					7.90				16.05			103%
Blklen256:	89.45			82.36			-7%					8.30				16.56			99%

Fp32		
NS(Prompt)		MLAS(Prompt)  	MLAS(Prompt)Gain/Loss		NS(TokenGen)		MLAS(TokenGen)  	MLAS(TokenGen)Gain/Loss
Blklen16:	91.36			105.18		15%				7.57			9.52		25%
Blklen32:	89.30			105.99			18%					7.65				9.68			26%
Blklen64:	89.53			101.41			13%					7.97				9.84			23%
Blklen128:	85.23			99.71			16%					7.86				10.39			32%
Blklen256:	88.46			97.94			10%					8.32				10.23			22%

Avx512vnni:
Int8		
NS(Prompt)		MLAS(Prompt)  	MLAS(Prompt)Gain/Loss		NS(TokenGen)		MLAS(TokenGen)  	MLAS(TokenGen)Gain/Loss
Blklen16:	132.18			21.56			-83%					10.34				11.48			11%
Blklen32:	168.28			43.69			-74%					11.85				14.73			24%
Blklen64:	201.81			60.29			-70%					12.36				15.47			25%
Blklen128:	194.92			57.04			-71%					13.03				14.67			12%
Blklen256:	218.76			70.20			-68%					13.33				16.31			22%

Fp32		
NS(Prompt)		MLAS(Prompt)  	MLAS(Prompt)Gain/Loss		NS(TokenGen)		MLAS(TokenGen)  	MLAS(TokenGen)Gain/Loss
Blklen16:	102.81			92.74			-9%					8.41				9.18			9%
Blklen32:	109.49			97.08			-11%					8.83				11.51			30%
Blklen64:	104.13			101.57			-2%					9.32				12.00			28%
Blklen128:	108.45			103.69			-4%					9.58				12.45			29%
Blklen256:	109.43			106.43			-2%					9.19				12.2			32%

```

---------

Signed-off-by: Liqun Fu <[email protected]>
Signed-off-by: liqunfu <[email protected]>
Co-authored-by: edgchen1 <[email protected]>
### Description
Fix the build error for Win ARM64 Release build.
graph_transform_test.cc(1,1): error C1128: number of sections exceeded
object file format limit: compile with /bigobj
[D:\build\Windows\Release\onnxruntime_test_all.vcxproj]


### Motivation and Context
Fix issue: microsoft#20406
### Description
<!-- Describe your changes. -->
Update to more generic url


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Fix some misc build warnings from x86 Windows build


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
…ml (microsoft#20472)

### Description
<!-- Describe your changes. -->

As title.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
- Create a common util to get supported activation set
- Fuse activation to BatchNormalization if possible
### Description

Following the issue microsoft#19223, introduce `per_channel` attribute in
`MinMaxCalibrater` to develop per-channel calibration.

If required, this new functionality should be implemented in the other
_Calibraters_ (`HistogramCalibrater`, `EntropyCalibrater`, ...).

### Motivation and Context
- This is the first part to solve microsoft#19223's proposal.
- If per channel calibration was allowed, the quantization algorithm
could be updated to improve quantization performance, i.e. weights
quantization per channel and not per tensor. That is why it would be
interesting to have a 'per_channel' option in any 'Calibrater' class to
produce a set of calibration vectors instead of a single scalar.
### Description

mlas matmul nbits implementation requires packed b. have a condition for
this.

need to update this logic if it changes.


### Motivation and Context

---------

Signed-off-by: Liqun Fu <[email protected]>
…0453)

The order of defines for these test have to be in the same order. If we
check for TRT -> CUDA ->DML wen cannot reverse that order in later
defines as we might want to build for multiple EPs.

+@PatriceVignola
### Description
<!-- Describe your changes. -->
flatbuffers::String::c_str returns a pointer that may not be null
terminated.

This causes a warning when building on an A100 with gcc 11. Not clear
why other builds with gcc 11 (e.g. Ubuntu 22.04 WSL) don't generate a
warning. Either way it's safer to use str() as that constructs a
std::string with data() and size().

Unclear if this is an issue in reality as it's reading from the
flatbuffer and most likely didn't write out an empty string in order to
save space. There's no perf need to use c_str instead of str, and in
LOAD_STR_FROM_ORT_FORMAT we need to convert the return value to a
std::string anyway.

```c++
struct String : public Vector<char> {
  const char *c_str() const { return reinterpret_cast<const char *>(Data()); }
  std::string str() const { return std::string(c_str(), size()); }
```

```
    inlined from ‘onnxruntime::common::Status onnxruntime::fbs::utils::LoadAttributeOrtFormat(const onnxruntime::fbs::Attribute&, onnx::AttributeProto&, std::unique_ptr<onnxruntime::Graph>&, onnxruntime::Graph&, onnxruntime::Node&, const onnxruntime::OrtFormatLoadOptions&, const onnxruntime::logging::Logger&)’ at /frdong_data/onnxruntime/onnxruntime/core/graph/graph_flatbuffers_utils.cc:385:3:
/usr/include/c++/11/bits/char_traits.h:399:32: error: ‘long unsigned int __builtin_strlen(const char*)’ reading 1 or more bytes from a region of size 0 [-Werror=stringop-overread]
```

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix build error on A100
### Description
<!-- Describe your changes. -->

Error: 

**Artifact name input: e2e_test_logs_1364625_$(Date:yyyyMMddHHmmss)
##[error]Artifact name is not valid:
e2e_test_logs_1364625_$(Date:yyyyMMddHHmmss). It cannot contain '\', /',
"', ':', '<', '>', '|', '*', and '?'**

Date not correctly showing up in the artifact name. Use predefined
pipeline variable BuildNumber instead which also serves similarly as a
timestamp.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

RN CI failure

---------

Co-authored-by: rachguo <[email protected]>
Co-authored-by: rachguo <[email protected]>
… fp32. (microsoft#20486)

### Description
Perform computation in fp32 and convert finally to fp16.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description

```
  tvm_execution_provider.cc
  denormal.cc
D:\a\onnxruntime\onnxruntime\onnxruntime\core\providers\tvm\tvm_execution_provider.cc(122,5): error C2660: 'onnxruntime::GraphViewerToProto': function does not take 4 arguments [D:\a\onnxruntime\onnxruntime\build\Release\onnxruntime_providers_tvm.vcxproj]
  D:\a\onnxruntime\onnxruntime\onnxruntime\core\graph\graph_proto_serializer.h(10,6):
  see declaration of 'onnxruntime::GraphViewerToProto'
  D:\a\onnxruntime\onnxruntime\onnxruntime\core\providers\tvm\tvm_execution_provider.cc(122,5):
  while trying to match the argument list '(const onnxruntime::GraphViewer, onnx::GraphProto, bool, bool)'
  
  cpuid_uarch.cc
  get_execution_providers.cc
  abi_session_options.cc
  bias_dropout_fusion.cc
  if.cc
```


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
…nder linux (microsoft#20466)

### Description
<!-- Describe your changes. -->

[VitisAI] Solve the problem that gsl cannot be found when compiling
under linux

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: Zhenze Wang <[email protected]>
…crosoft#20500)

### Description
<!-- Describe your changes. -->
Update order of steps


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix CI
### Description
Fuse Cast + SoftmaxCrossEntropyLossInternal to
SoftmaxCrossEntropyLossInternal.
script changes to include qnn sdk libs with onnxruntime-qnn python
package.
In CMakeLists.txt:set_msvc_c_cpp_compiler_warning_level(), the regex should match the value that gets added by the function. The latter got updated, so this change updates the former to match.
### Description
Originally, Prelu in QNN will fail when the input is fp16 and alpha is fp32.
QNN requires alpha is fp16 when input is fp16.
This can be resolved by casting alpha to fp16 and pass it to QNN.

### Motivation and Context
Makes QNN Prelu support fp16 case.

---------

Co-authored-by: Hector Li <[email protected]>
Distribute writing-to-output work over all threads in MatMulNBits.
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->

microsoft#20418

Add back Catalyst changes only for now.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: rachguo <[email protected]>
This PR is needed for
microsoft#20411 to make sure TRT EP
use priority-based topo sort for consistency across TRT EP.
### Description

Update README.md in /js/web/

- update compatibility table
- update links to onnxruntime.ai
Bump up version in main from 1.18.0 to 1.19.0 since the release branch
has been cut.

---------

Co-authored-by: Edward Chen <[email protected]>
### Description
Add CUDA implementation for block sparse attention for Phi-3-small.

Block sparse attention was proposed in [Sparse
Transformers](https://arxiv.org/pdf/1904.10509) by OpenAI, and also
adopted in [BigBird](https://arxiv.org/pdf/2007.14062) with different
sparse layout.

In Phi-3-small, the sparse layout is static, and works with
unidirectional (causal) attention.

Compared to dense attention, the benefit of block sparse is to speed up
both training and inference. It could save memory thus support longer
context length.

- [x] Add operator spec and shape inference
- [x] Symbolic shape inference
- [x] Refactor GroupQueryAttention to expose common kernels for kv cache
concatenation, q/k/v transpose etc.
- [x] Add cuda kernel to convert block mask to CSR format
- [x] Add cuda kernel to generate position ids
- [x] Add compile script and template files to convert triton kernel to
cubin and dispatcher.
- [x] Add triton kernel v1 for prompt
- [x] Add triton kernel v2 for token generation and support padding
- [x] Update IO Binding Helper to allow buffer sharing.
- [x] Test relevance
- [x] Test performance

### Performance
Test in A100-SXM4-80GB with `batch_size=4, num_heads=32,
max_seq_len=8192, head_size=128, sparse_block_size=64, local_blocks=16,
vert_stride=8, num_layout=8`

We compare sparse attention to corresponding GQA with local attention
windows size 1024, or GQA with dense causal.

Average latency in milliseconds (for fused attention kernel used in
prompt prefilling):

seq_len | GQA-Dense | GQA-Local | SparseAttention
-- | -- | -- | --
64 | 0.0465 | 0.0722 | 0.0641
128 | 0.0618 | 0.0787 | 0.0672
256 | 0.1086 | 0.1076 | 0.0943
512 | 0.2535 | 0.2487 | 0.1676
1024 | 0.7042 | 0.7050 | 0.3800
2048 | 2.4125 | 1.9316 | 0.8966
4096 | 8.9346 | 4.5699 | 2.1129
8192 | 40.5401 | 10.3508 | 5.1748

Average latency in milliseconds (for fused attention kernel used in
token generation:

past_seq_len | GQA-Dense | GQA-Local | SparseAttention
-- | -- | -- | --
64 | 0.0186 | 0.0186 | 0.0870
128 | 0.0408 | 0.0466 | 0.1165
256 | 0.0530  | 0.0592 | 0.0988
512 | 0.0445| 0.0447 | 0.1150
1024 | 0.0634  | 0.0640 | 0.1454
2048 | 0.1027 | 0.0637 | 0.1589
4096 | 0.1789 | 0.0631 | 0.1806
8192 | 0.3288 | 0.0655 | 0.2146

We can see that the kernel for token generation still have room to
improve.

#### Limitations
Only support right-side padding and unidirectional attention.

The following are not supported in the first version:
(1) Packed mode like PackedMultiHeadAttention where input has been
removed padding.
(2) paged attention.
(3) bidirectional attention.
(4) GPU compute capacity that is not 8.0, 8.6 and 8.9.
(5) Left side padding.

Some of these limitations will be removed in the future (may be in a new
operator).
Fix
onnxruntime/include/onnxruntime/core/session/onnxruntime_c_api.h:4637:
error: argument 'session' of command @param is not found in the argument
list of

```
OrtApi::AddExternalInitializersFromFilesInMemory(
    OrtSessionOptions *options,
    const char *const *external_initializer_file_names,
    char *const *external_initializer_file_buffer_array,
    const size_t *external_initializer_file_lengths,
    size_t num_external_initializer_files)
```
…_SUPPORTED is defined. (microsoft#20509)

Only define CPUIDInfo::pytorch_cpuinfo_init_ data member when CPUINFO_SUPPORTED is defined. It can cause unused variable warnings in some compilations.
### Description
removing excess trailing semicolon from specific macro

### Motivation and Context
I am preparing automatic generation of onnxruntime bindings for perl,
and the parser (ucpp) has broken due to the "double semicolon" error in
the subsequent lines where the macro is applied.
yihonglyu and others added 24 commits May 10, 2024 16:07
### Motivation and Context

The Intel NPU does not support 16-bit int quantized operators.
Consequently, the execution provider removes the
QuantizeLinear/DeQuantizeLinear (Q/DQ) operators from node units and
executes the operation as FP16 in the backend. However, if a Clip
operator was fused into a Q operator in the node unit, the removal of
Q/DQ operators results in inaccuracies because the effect of the
original Clip operators is lost.

Consider the following example:
- FP32 model: -> Op_FP32 -> Clip ->
- QDQ model: -> (DQ-> Op_FP32 -> Q) -> (DQ' -> Clip -> Q') ->
- After ClipQuantFusion: -> (DQ-> Op_FP32 -> Q) -> (DQ' -> Q') ->
- Intel Execution Provider strips Q/DQ: -> Op_FP16 ->

To solve this issue, we have enabled ClipQuantFusion exclusively on the
CPU execution provider.
### Description
Adds the extra option `QDQKeepRemovableActivations` to optionally
prevent automatic removal of Clip/Relu ops in QDQ models. The current
default behavior, which is to remove Clip/Relu, remains the same if the
new option is not enabled.

### Motivation and Context
Explicitly representing these Relu/Clip operators in the QDQ model is
necessary if optimizations or EP transformations will later remove
QuantizeLinear/DequantizeLinear operators from the model.
…osoft#20650)

Do more in the Python helper script so the Bash code in the release definition can be simplified.
TODOs:
1. Handle H * params.kvNumHeads greater than work group size limit.
2. Support BNSH kv cache.
…microsoft#20652)

### Description
And
Set allowPackageConflicts = True
`#allowPackageConflicts: false # boolean. Optional. Use when command =
push && nuGetFeedType = internal. Allow duplicates to be skipped.
Default: false.`

https://learn.microsoft.com/en-us/azure/devops/pipelines/tasks/reference/nuget-command-v2?view=azure-pipelines

Once the publish patial failed, we don't need to rerun the whole package
generation workflow.
### Description
Enable Qnn nuget nightly
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
- Move iOS package build to separate job so it can run in parallel with Android AAR build and be decoupled from the test stage. The test stage fails sometimes (not infrequently) and may need to be re-run.
- Update stop iOS simulator step so it doesn't fail if the start step doesn't run.
### Description
<!-- Describe your changes. -->
- Fix `logSeverityLevel`
- Correct get RCTCxxBridge, old method for some cases will got wrong
bridge


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Yulong Wang <[email protected]>
Update the instructions of how to get test models.
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
WebNN spec has removed activation option for conv and
batchNormalization. We don't need additional activation fusion in WebNN
EP anymore.

[edit by fdwr] Note this is handled in the browser now, which knows more
about the backend platform version and can more safely make decisions
about which fusions are possible (e.g. for the DirectML backend, whether
softmax and gelu can fuse successfully with their base operator).
### Description
<!-- Describe your changes. -->
* Partially revert [previous
change](microsoft#19804), and
   * Redo concurrency_test_result parser outside of post.py
* Add support of syncing memtest result to db


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
To fix the error when CI is running on two model groups.
- When running on two model groups, the [previous
change](microsoft#19804) wrongly
navigates two levels up in the directory after running one model group,
while one level is needed. After that, the script can't find another
model group.
- Running on one model group can't repro the issue
### Description
Removes ref struct return usage on netstandard 2.0 builds.

### Motivation and Context
Unblocks .NET native compilation
- Add MatMulNBits Bias input
- Add graph transformer to fuse MatMulNBits + Add
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description

Currently, there is one bool flag to indicate whether kernel is loaded.
However, there are v1 and v2 kernels, so the flag will allow only one
version of kernel loaded. We use v1 kernel for prompt and v2 kernel for
token generation, and the flag will cause issue when we want both prompt
and token generation.

This bug is found in integration test. The unit test only test one
kernel at a time so the issue was not found before.

Another possible walkaround without this fix is to set an environment
variable `ORT_DISABLE_SPARSE_ATTENTION_V1=1`
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Make C API compliant with Doxygen expectations

### Motivation and Context
Doc workflow is failing.
### Description
This PR adds support for adding GroupQueryAttention (GQA) in models that
are running on CPU.

### Motivation and Context
Previously, the LLaMA scripts supported creating models that have GQA
for CUDA only. With the recently added support for [GQA on
CPU](microsoft#20299), models where
`num_attention_heads != num_key_value_heads` can now use the GQA op and
[run much faster on
CPU](microsoft#20598).
Hipify MatMulNBits to accommodate the need of Phi3 onnx release.
### Description
This PR adds fusions for [OpenAI's CLIP
model](https://huggingface.co/openai/clip-vit-large-patch14-336). Here
is an example of how to run the ORT transformer optimizer for the linked
CLIP model.

```
$ git clone https://github.com/microsoft/onnxruntime
$ cd onnxruntime/onnxruntime/python/tools/transformers
$ python3 optimizer.py --input /path/to/model.onnx --output /path/to/model_opt.onnx --model_type clip --num_heads 16 --hidden_size 1024 --use_external_data_format --opt_level 0
```

### Motivation and Context
This PR helps optimize multi-modal models that use CLIP for the vision
encoder.
@ankitm3k ankitm3k marked this pull request as ready for review May 22, 2024 06:27
@ankitm3k ankitm3k closed this May 22, 2024
@ankitm3k ankitm3k deleted the ankit/data_ops_changes branch May 22, 2024 06:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.