Issue when converting Whisper using --collect_cross_qk on CPU #18216

axelman03 · 2023-11-01T15:58:32Z

Describe the issue

I am currently using the nightly build of the ONNX runtime to convert Whisper to ONNX. I am specifically interested in getting the cross QK of the model, to be used eventually for timestamps. I am trying to convert the model to run on CPU. My issue is that, when I convert it, a runtime exception occurs:

An error occurred while trying to verify parity between PyTorch and ONNX Runtime: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running WhisperBeamSearch node. Name:'BeamSearch_zcode' Status Message: C:\a\_work\1\s\onnxruntime\contrib_ops/cpu/transformers/beam_search_impl_whisper.h:300 onnxruntime::contrib::transformers::BeamSearchWhisper<float>::Execute decoder_subgraph_.has_decoder_masked_attention_ was false. decoder subgraph: output_cross_qk could only work with has_decoder_masked_attention

Traceback (most recent call last):
  File "C:\git\onnx-rt\1.17.0-pre\onnxruntime\onnxruntime\python\tools\transformers\models\whisper\convert_to_onnx.py", line 481, in main
    max_diff = WhisperHelper.verify_onnx(args.model_name_or_path, ort_session, device)
  File "C:\git\onnx-rt\1.17.0-pre\onnxruntime\onnxruntime\python\tools\transformers\models\whisper\whisper_helper.py", line 338, in verify_onnx
    ort_outputs = ort_session.run(None, inputs)[0][0]
  File "C:\Users\alexander.bolejack\Anaconda3\envs\S2T\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 220, in run
    return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running WhisperBeamSearch node. Name:'BeamSearch_zcode' Status Message: 
C:\a\_work\1\s\onnxruntime\contrib_ops/cpu/transformers/beam_search_impl_whisper.h:300 onnxruntime::contrib::transformers::BeamSearchWhisper<float>::Execute decoder_subgraph_.has_decoder_masked_attention_ was false. decoder subgraph: output_cross_qk could only work with has_decoder_masked_attention

When looking into it, it seems that DecoderMaskedMultiHeadAttention is only used when the --use_gpu flag is enabled, as is cross QK.

Is there any way I can build and run this model on the CPU?

To reproduce

Here is the exact command I ran. Use the latest version of ONNX Runtime.

python -m models.whisper.convert_to_onnx -m openai/whisper-base --output whisperbase-timestamps --use_external_data_format --precision int8 --quantize_embedding_layer --extra_decoding_ids --output_sequence_scores --overwrite --output_no_speech_probs --output_cross_qk --collect_cross_qk --use_whisper_beamsearch --provider cpu

Urgency

No response

Platform

Windows

OS Version

Windows 10

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

#18206

ONNX Runtime API

Other / Unknown

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

The text was updated successfully, but these errors were encountered:

github-actions · 2023-12-02T15:00:58Z

This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

thiagocrepaldi · 2023-12-04T16:31:08Z

@shubhambhokare1 @kunal-vaishnavi maybe this is something you are familiar with

axelman03 · 2023-12-05T16:45:28Z

I was trying to look a little further into it. It seems that Cross QK support requires the model to be compiled with DecoderMaskedMultiHeadAttention, which is implemented only for CUDA.
I am not familiar with it enough to say if DecoderMaskedMultiHeadAttention needs to be used for cross QK, or if it can be brought to CPU/other execution providers though.

kunal-vaishnavi · 2023-12-05T21:14:31Z

It seems that Cross QK support requires the model to be compiled with DecoderMaskedMultiHeadAttention, which is implemented only for CUDA.

Cross QK support is added in this PR as part of the WhisperBeamSearch op and DecoderMaskedMultiHeadAttention op. The feature is currently supported on CUDA and will be supported on CPU in the future.

I am not familiar with it enough to say if DecoderMaskedMultiHeadAttention needs to be used for cross QK, or if it can be brought to CPU/other execution providers though.

DecoderMaskedMultiHeadAttention is used specifically for CUDA. Cross QK support can be added to other attention ops that run on CPU.

github-actions · 2024-01-05T15:00:56Z

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

github-actions bot added the platform:windows issues related to the Windows platform label Nov 1, 2023

yuslepukhin added the core runtime issues related to core runtime label Nov 1, 2023

github-actions bot added the stale issues that have not been addressed in a while; categorized by a bot label Dec 2, 2023

github-actions bot removed the stale issues that have not been addressed in a while; categorized by a bot label Dec 5, 2023

github-actions bot added the stale issues that have not been addressed in a while; categorized by a bot label Jan 5, 2024

natke removed the stale issues that have not been addressed in a while; categorized by a bot label Jan 10, 2024

axelman03 closed this as not planned Won't fix, can't repro, duplicate, stale Jan 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue when converting Whisper using --collect_cross_qk on CPU #18216

Issue when converting Whisper using --collect_cross_qk on CPU #18216

axelman03 commented Nov 1, 2023

github-actions bot commented Dec 2, 2023

thiagocrepaldi commented Dec 4, 2023

axelman03 commented Dec 5, 2023

kunal-vaishnavi commented Dec 5, 2023

github-actions bot commented Jan 5, 2024

Issue when converting Whisper using --collect_cross_qk on CPU #18216

Issue when converting Whisper using --collect_cross_qk on CPU #18216

Comments

axelman03 commented Nov 1, 2023

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

github-actions bot commented Dec 2, 2023

thiagocrepaldi commented Dec 4, 2023

axelman03 commented Dec 5, 2023

kunal-vaishnavi commented Dec 5, 2023

github-actions bot commented Jan 5, 2024