forked from microsoft/onnxruntime
-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: updated data ops to support the complete graph on OVEP #374
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
) **Current issue:** Once ORT gets the capability from EP's GetCapability(), it creates a graph viewer based on the capability as below: `viewers.push_back(std::make_unique<GraphViewer>(graph, *cur_capability.sub_graph));` or see the code [here](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/framework/graph_partitioner.cc#L458). At this point, the graph viewer has the chance to generate the wrong order of `nodes_in_topological_order_` when calling [Graph::ReverseDFSFrom](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/graph/graph_viewer.cc#L107), so that during EP Compile(), EP might create the "wrong nodes ordering" model proto from the graph viewer when calling [GraphViewerToProto()](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/graph/graph_proto_serializer.cc#L37) because of the `nodes_in_topological_order_`. This is a problem for TRT EP to refit weights to the "weightless" engine. Since the engine is built from the model proto provided by TRT EP and the weights is in the original onnx model. The model proto and the orignal onnx model are not the same in terms of node ordering which makes TRT complain when refitting. **The original model (subgraph of ResNet50):** <img width="442" alt="image" src="https://github.com/microsoft/onnxruntime/assets/54722500/bb9a641d-f2f2-46c3-aebf-4084a08ff289"> **The serialized model proto generated by TRT EP:** (The highlighted part has the wrong node order compared to the original model.) <img width="340" alt="image" src="https://github.com/microsoft/onnxruntime/assets/54722500/bbc6bf34-f960-4753-9474-a18ebc2dc48b"> **The solution 1:** Change default comparator to `NodeCompare::operator() {return n1->Index() > n2->Index();}` The root cause of the different node order between original model and EP generated model is from graph viewer [generating ](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/graph/graph_viewer.cc#L107)the different `nodes_in_topological_order_`. Modifying the `NodeCompare::operator()` for sorting can fix the problem. The `NodeCompare::operator()` will be used in [Graph::ReverseDFSFrom](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/graph/graph.cc#L1760) where the input nodes of the current node will be [sorted](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/graph/graph.cc#L1802) based on node index. Due to the sorted nodes will be pushed into a stack which later determines the final topological node order in a "first in, last out" approach, the larger node index should be pushed into the stack first. So that we can get a topological node order aligns with smaller index node comes first. **The solution 2 (This PR uses this solution):** Use priority-based BFS for topological sort in GraphViewerToProto().
### Description ``` Avx2: Int8 NS(Prompt) MLAS(Prompt) MLAS(Prompt)Gain/Loss NS(TokenGen) MLAS(TokenGen) MLAS(TokenGen)Gain/Loss Blklen16: 90.96 25.15 -72% 7.65 11.71 53% Blklen32: 90.73 48.55 -46% 7.86 14.28 81% Blklen64: 89.49 68.84 -23% 8.30 15.78 90% Blklen128: 87.38 78.37 -10% 7.90 16.05 103% Blklen256: 89.45 82.36 -7% 8.30 16.56 99% Fp32 NS(Prompt) MLAS(Prompt) MLAS(Prompt)Gain/Loss NS(TokenGen) MLAS(TokenGen) MLAS(TokenGen)Gain/Loss Blklen16: 91.36 105.18 15% 7.57 9.52 25% Blklen32: 89.30 105.99 18% 7.65 9.68 26% Blklen64: 89.53 101.41 13% 7.97 9.84 23% Blklen128: 85.23 99.71 16% 7.86 10.39 32% Blklen256: 88.46 97.94 10% 8.32 10.23 22% Avx512vnni: Int8 NS(Prompt) MLAS(Prompt) MLAS(Prompt)Gain/Loss NS(TokenGen) MLAS(TokenGen) MLAS(TokenGen)Gain/Loss Blklen16: 132.18 21.56 -83% 10.34 11.48 11% Blklen32: 168.28 43.69 -74% 11.85 14.73 24% Blklen64: 201.81 60.29 -70% 12.36 15.47 25% Blklen128: 194.92 57.04 -71% 13.03 14.67 12% Blklen256: 218.76 70.20 -68% 13.33 16.31 22% Fp32 NS(Prompt) MLAS(Prompt) MLAS(Prompt)Gain/Loss NS(TokenGen) MLAS(TokenGen) MLAS(TokenGen)Gain/Loss Blklen16: 102.81 92.74 -9% 8.41 9.18 9% Blklen32: 109.49 97.08 -11% 8.83 11.51 30% Blklen64: 104.13 101.57 -2% 9.32 12.00 28% Blklen128: 108.45 103.69 -4% 9.58 12.45 29% Blklen256: 109.43 106.43 -2% 9.19 12.2 32% ``` --------- Signed-off-by: Liqun Fu <[email protected]> Signed-off-by: liqunfu <[email protected]> Co-authored-by: edgchen1 <[email protected]>
### Description Fix the build error for Win ARM64 Release build. graph_transform_test.cc(1,1): error C1128: number of sections exceeded object file format limit: compile with /bigobj [D:\build\Windows\Release\onnxruntime_test_all.vcxproj] ### Motivation and Context Fix issue: microsoft#20406
### Description <!-- Describe your changes. --> Update to more generic url ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
### Description <!-- Describe your changes. --> Fix some misc build warnings from x86 Windows build ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
…ml (microsoft#20472) ### Description <!-- Describe your changes. --> As title. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
- Create a common util to get supported activation set - Fuse activation to BatchNormalization if possible
### Description Following the issue microsoft#19223, introduce `per_channel` attribute in `MinMaxCalibrater` to develop per-channel calibration. If required, this new functionality should be implemented in the other _Calibraters_ (`HistogramCalibrater`, `EntropyCalibrater`, ...). ### Motivation and Context - This is the first part to solve microsoft#19223's proposal. - If per channel calibration was allowed, the quantization algorithm could be updated to improve quantization performance, i.e. weights quantization per channel and not per tensor. That is why it would be interesting to have a 'per_channel' option in any 'Calibrater' class to produce a set of calibration vectors instead of a single scalar.
### Description mlas matmul nbits implementation requires packed b. have a condition for this. need to update this logic if it changes. ### Motivation and Context --------- Signed-off-by: Liqun Fu <[email protected]>
…0453) The order of defines for these test have to be in the same order. If we check for TRT -> CUDA ->DML wen cannot reverse that order in later defines as we might want to build for multiple EPs. +@PatriceVignola
### Description <!-- Describe your changes. --> flatbuffers::String::c_str returns a pointer that may not be null terminated. This causes a warning when building on an A100 with gcc 11. Not clear why other builds with gcc 11 (e.g. Ubuntu 22.04 WSL) don't generate a warning. Either way it's safer to use str() as that constructs a std::string with data() and size(). Unclear if this is an issue in reality as it's reading from the flatbuffer and most likely didn't write out an empty string in order to save space. There's no perf need to use c_str instead of str, and in LOAD_STR_FROM_ORT_FORMAT we need to convert the return value to a std::string anyway. ```c++ struct String : public Vector<char> { const char *c_str() const { return reinterpret_cast<const char *>(Data()); } std::string str() const { return std::string(c_str(), size()); } ``` ``` inlined from ‘onnxruntime::common::Status onnxruntime::fbs::utils::LoadAttributeOrtFormat(const onnxruntime::fbs::Attribute&, onnx::AttributeProto&, std::unique_ptr<onnxruntime::Graph>&, onnxruntime::Graph&, onnxruntime::Node&, const onnxruntime::OrtFormatLoadOptions&, const onnxruntime::logging::Logger&)’ at /frdong_data/onnxruntime/onnxruntime/core/graph/graph_flatbuffers_utils.cc:385:3: /usr/include/c++/11/bits/char_traits.h:399:32: error: ‘long unsigned int __builtin_strlen(const char*)’ reading 1 or more bytes from a region of size 0 [-Werror=stringop-overread] ``` ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix build error on A100
### Description <!-- Describe your changes. --> Error: **Artifact name input: e2e_test_logs_1364625_$(Date:yyyyMMddHHmmss) ##[error]Artifact name is not valid: e2e_test_logs_1364625_$(Date:yyyyMMddHHmmss). It cannot contain '\', /', "', ':', '<', '>', '|', '*', and '?'** Date not correctly showing up in the artifact name. Use predefined pipeline variable BuildNumber instead which also serves similarly as a timestamp. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> RN CI failure --------- Co-authored-by: rachguo <[email protected]> Co-authored-by: rachguo <[email protected]>
… fp32. (microsoft#20486) ### Description Perform computation in fp32 and convert finally to fp16. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
### Description ``` tvm_execution_provider.cc denormal.cc D:\a\onnxruntime\onnxruntime\onnxruntime\core\providers\tvm\tvm_execution_provider.cc(122,5): error C2660: 'onnxruntime::GraphViewerToProto': function does not take 4 arguments [D:\a\onnxruntime\onnxruntime\build\Release\onnxruntime_providers_tvm.vcxproj] D:\a\onnxruntime\onnxruntime\onnxruntime\core\graph\graph_proto_serializer.h(10,6): see declaration of 'onnxruntime::GraphViewerToProto' D:\a\onnxruntime\onnxruntime\onnxruntime\core\providers\tvm\tvm_execution_provider.cc(122,5): while trying to match the argument list '(const onnxruntime::GraphViewer, onnx::GraphProto, bool, bool)' cpuid_uarch.cc get_execution_providers.cc abi_session_options.cc bias_dropout_fusion.cc if.cc ``` ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
…nder linux (microsoft#20466) ### Description <!-- Describe your changes. --> [VitisAI] Solve the problem that gsl cannot be found when compiling under linux ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: Zhenze Wang <[email protected]>
…crosoft#20500) ### Description <!-- Describe your changes. --> Update order of steps ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix CI
### Description Fuse Cast + SoftmaxCrossEntropyLossInternal to SoftmaxCrossEntropyLossInternal.
script changes to include qnn sdk libs with onnxruntime-qnn python package.
In CMakeLists.txt:set_msvc_c_cpp_compiler_warning_level(), the regex should match the value that gets added by the function. The latter got updated, so this change updates the former to match.
### Description Originally, Prelu in QNN will fail when the input is fp16 and alpha is fp32. QNN requires alpha is fp16 when input is fp16. This can be resolved by casting alpha to fp16 and pass it to QNN. ### Motivation and Context Makes QNN Prelu support fp16 case. --------- Co-authored-by: Hector Li <[email protected]>
Distribute writing-to-output work over all threads in MatMulNBits. ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
### Description <!-- Describe your changes. --> microsoft#20418 Add back Catalyst changes only for now. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: rachguo <[email protected]>
This PR is needed for microsoft#20411 to make sure TRT EP use priority-based topo sort for consistency across TRT EP.
### Description Update README.md in /js/web/ - update compatibility table - update links to onnxruntime.ai
Bump up version in main from 1.18.0 to 1.19.0 since the release branch has been cut. --------- Co-authored-by: Edward Chen <[email protected]>
### Description Add CUDA implementation for block sparse attention for Phi-3-small. Block sparse attention was proposed in [Sparse Transformers](https://arxiv.org/pdf/1904.10509) by OpenAI, and also adopted in [BigBird](https://arxiv.org/pdf/2007.14062) with different sparse layout. In Phi-3-small, the sparse layout is static, and works with unidirectional (causal) attention. Compared to dense attention, the benefit of block sparse is to speed up both training and inference. It could save memory thus support longer context length. - [x] Add operator spec and shape inference - [x] Symbolic shape inference - [x] Refactor GroupQueryAttention to expose common kernels for kv cache concatenation, q/k/v transpose etc. - [x] Add cuda kernel to convert block mask to CSR format - [x] Add cuda kernel to generate position ids - [x] Add compile script and template files to convert triton kernel to cubin and dispatcher. - [x] Add triton kernel v1 for prompt - [x] Add triton kernel v2 for token generation and support padding - [x] Update IO Binding Helper to allow buffer sharing. - [x] Test relevance - [x] Test performance ### Performance Test in A100-SXM4-80GB with `batch_size=4, num_heads=32, max_seq_len=8192, head_size=128, sparse_block_size=64, local_blocks=16, vert_stride=8, num_layout=8` We compare sparse attention to corresponding GQA with local attention windows size 1024, or GQA with dense causal. Average latency in milliseconds (for fused attention kernel used in prompt prefilling): seq_len | GQA-Dense | GQA-Local | SparseAttention -- | -- | -- | -- 64 | 0.0465 | 0.0722 | 0.0641 128 | 0.0618 | 0.0787 | 0.0672 256 | 0.1086 | 0.1076 | 0.0943 512 | 0.2535 | 0.2487 | 0.1676 1024 | 0.7042 | 0.7050 | 0.3800 2048 | 2.4125 | 1.9316 | 0.8966 4096 | 8.9346 | 4.5699 | 2.1129 8192 | 40.5401 | 10.3508 | 5.1748 Average latency in milliseconds (for fused attention kernel used in token generation: past_seq_len | GQA-Dense | GQA-Local | SparseAttention -- | -- | -- | -- 64 | 0.0186 | 0.0186 | 0.0870 128 | 0.0408 | 0.0466 | 0.1165 256 | 0.0530 | 0.0592 | 0.0988 512 | 0.0445| 0.0447 | 0.1150 1024 | 0.0634 | 0.0640 | 0.1454 2048 | 0.1027 | 0.0637 | 0.1589 4096 | 0.1789 | 0.0631 | 0.1806 8192 | 0.3288 | 0.0655 | 0.2146 We can see that the kernel for token generation still have room to improve. #### Limitations Only support right-side padding and unidirectional attention. The following are not supported in the first version: (1) Packed mode like PackedMultiHeadAttention where input has been removed padding. (2) paged attention. (3) bidirectional attention. (4) GPU compute capacity that is not 8.0, 8.6 and 8.9. (5) Left side padding. Some of these limitations will be removed in the future (may be in a new operator).
Fix onnxruntime/include/onnxruntime/core/session/onnxruntime_c_api.h:4637: error: argument 'session' of command @param is not found in the argument list of ``` OrtApi::AddExternalInitializersFromFilesInMemory( OrtSessionOptions *options, const char *const *external_initializer_file_names, char *const *external_initializer_file_buffer_array, const size_t *external_initializer_file_lengths, size_t num_external_initializer_files) ```
…_SUPPORTED is defined. (microsoft#20509) Only define CPUIDInfo::pytorch_cpuinfo_init_ data member when CPUINFO_SUPPORTED is defined. It can cause unused variable warnings in some compilations.
### Description removing excess trailing semicolon from specific macro ### Motivation and Context I am preparing automatic generation of onnxruntime bindings for perl, and the parser (ucpp) has broken due to the "double semicolon" error in the subsequent lines where the macro is applied.
### Motivation and Context The Intel NPU does not support 16-bit int quantized operators. Consequently, the execution provider removes the QuantizeLinear/DeQuantizeLinear (Q/DQ) operators from node units and executes the operation as FP16 in the backend. However, if a Clip operator was fused into a Q operator in the node unit, the removal of Q/DQ operators results in inaccuracies because the effect of the original Clip operators is lost. Consider the following example: - FP32 model: -> Op_FP32 -> Clip -> - QDQ model: -> (DQ-> Op_FP32 -> Q) -> (DQ' -> Clip -> Q') -> - After ClipQuantFusion: -> (DQ-> Op_FP32 -> Q) -> (DQ' -> Q') -> - Intel Execution Provider strips Q/DQ: -> Op_FP16 -> To solve this issue, we have enabled ClipQuantFusion exclusively on the CPU execution provider.
### Description Adds the extra option `QDQKeepRemovableActivations` to optionally prevent automatic removal of Clip/Relu ops in QDQ models. The current default behavior, which is to remove Clip/Relu, remains the same if the new option is not enabled. ### Motivation and Context Explicitly representing these Relu/Clip operators in the QDQ model is necessary if optimizations or EP transformations will later remove QuantizeLinear/DequantizeLinear operators from the model.
…osoft#20650) Do more in the Python helper script so the Bash code in the release definition can be simplified.
TODOs: 1. Handle H * params.kvNumHeads greater than work group size limit. 2. Support BNSH kv cache.
…microsoft#20652) ### Description And Set allowPackageConflicts = True `#allowPackageConflicts: false # boolean. Optional. Use when command = push && nuGetFeedType = internal. Allow duplicates to be skipped. Default: false.` https://learn.microsoft.com/en-us/azure/devops/pipelines/tasks/reference/nuget-command-v2?view=azure-pipelines Once the publish patial failed, we don't need to rerun the whole package generation workflow.
### Description Enable Qnn nuget nightly
### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
- Move iOS package build to separate job so it can run in parallel with Android AAR build and be decoupled from the test stage. The test stage fails sometimes (not infrequently) and may need to be re-run. - Update stop iOS simulator step so it doesn't fail if the start step doesn't run.
### Description <!-- Describe your changes. --> - Fix `logSeverityLevel` - Correct get RCTCxxBridge, old method for some cases will got wrong bridge ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Yulong Wang <[email protected]>
Update the instructions of how to get test models.
### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
WebNN spec has removed activation option for conv and batchNormalization. We don't need additional activation fusion in WebNN EP anymore. [edit by fdwr] Note this is handled in the browser now, which knows more about the backend platform version and can more safely make decisions about which fusions are possible (e.g. for the DirectML backend, whether softmax and gelu can fuse successfully with their base operator).
### Description <!-- Describe your changes. --> * Partially revert [previous change](microsoft#19804), and * Redo concurrency_test_result parser outside of post.py * Add support of syncing memtest result to db ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> To fix the error when CI is running on two model groups. - When running on two model groups, the [previous change](microsoft#19804) wrongly navigates two levels up in the directory after running one model group, while one level is needed. After that, the script can't find another model group. - Running on one model group can't repro the issue
### Description Removes ref struct return usage on netstandard 2.0 builds. ### Motivation and Context Unblocks .NET native compilation
- Add MatMulNBits Bias input - Add graph transformer to fuse MatMulNBits + Add
### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
### Description Currently, there is one bool flag to indicate whether kernel is loaded. However, there are v1 and v2 kernels, so the flag will allow only one version of kernel loaded. We use v1 kernel for prompt and v2 kernel for token generation, and the flag will cause issue when we want both prompt and token generation. This bug is found in integration test. The unit test only test one kernel at a time so the issue was not found before. Another possible walkaround without this fix is to set an environment variable `ORT_DISABLE_SPARSE_ATTENTION_V1=1` ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
### Description Make C API compliant with Doxygen expectations ### Motivation and Context Doc workflow is failing.
### Description This PR adds support for adding GroupQueryAttention (GQA) in models that are running on CPU. ### Motivation and Context Previously, the LLaMA scripts supported creating models that have GQA for CUDA only. With the recently added support for [GQA on CPU](microsoft#20299), models where `num_attention_heads != num_key_value_heads` can now use the GQA op and [run much faster on CPU](microsoft#20598).
Hipify MatMulNBits to accommodate the need of Phi3 onnx release.
### Description This PR adds fusions for [OpenAI's CLIP model](https://huggingface.co/openai/clip-vit-large-patch14-336). Here is an example of how to run the ORT transformer optimizer for the linked CLIP model. ``` $ git clone https://github.com/microsoft/onnxruntime $ cd onnxruntime/onnxruntime/python/tools/transformers $ python3 optimizer.py --input /path/to/model.onnx --output /path/to/model_opt.onnx --model_type clip --num_heads 16 --hidden_size 1024 --use_external_data_format --opt_level 0 ``` ### Motivation and Context This PR helps optimize multi-modal models that use CLIP for the vision encoder.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
The ONNX model provided by issue author was not fully supported for OVEP and was failing inference with ort_perf_test app. The current PR enables GRU and LogSoftmax Op which helps enable the whole model graph on OVEP during execution. The unit test for GRU op is disabled.
Also investigating the inference output for multiple iterations for a single common input, the model was giving consistent and correct output across all the inference iterations during testing. Thus solving any post first inference regression of output for the given model architecture.
This PR fixes - microsoft#19975