Flan-T5 small converted model produces wrong result with batch size > 1 and long senetences #21053

PKaralupov · 2024-06-15T00:04:30Z

Describe the issue

Finetuned Flan-T5 small for spellchecking was converted with
https://github.com/microsoft/onnxruntime/blob/8f0e896c95eedee624b6a3c375e86e1e4263a980/onnxruntime/python/tools/transformers/convert_generation.py

Using:
python convert_generation.py -m t5-small --model_type t5 --output ./models/t5/onnx_models/t5_small_beam_search.onnx --use_gpu --past_present_share_buffer --use_decoder_masked_attention

It works perfectly with batch size 1, however upon increasing batch size with some samples output is incorrect

Why does length of other samples/batch size impacts output?

To reproduce

Infer using:

onnxruntime/onnxruntime/python/tools/transformers/convert_generation.py

Line 648 in 8f0e896

    
           def create_ort_session(model_path: str, use_gpu: bool, use_sln_strict_mode: bool) -> InferenceSession:

from onnxruntime import GraphOptimizationLevel, InferenceSession, SessionOptions, get_available_providers
use_gpu = True
use_sln_strict_mode = False
sess_options = SessionOptions()
sess_options.graph_optimization_level = GraphOptimizationLevel.ORT_DISABLE_ALL
execution_providers = ["CUDAExecutionProvider", "CPUExecutionProvider"] if use_gpu else ["CPUExecutionProvider"]
if use_gpu:
    if "CUDAExecutionProvider" not in get_available_providers():
        raise RuntimeError("CUDAExecutionProvider is not available for --use_gpu!")

    if use_sln_strict_mode:
        cuda_provider_options = {"enable_skip_layer_norm_strict_mode": True}
        provider_options = {"CUDAExecutionProvider": cuda_provider_options}
        execution_providers = [
            (name, provider_options[name]) if name in provider_options else name for name in execution_providers
        ]

ort_session = InferenceSession("model.onnx", sess_options, providers=execution_providers)



import numpy as np
from transformers import T5TokenizerFast

tokenizer = T5TokenizerFast.from_pretrained("google/flan-t5-small", legacy=False)

text = ['Obierika, Okonkwo best friend even stated " He has put a knife on the things that held us together, and we have fallen apart ".', 'loving it.']

input_ids = tokenizer(text, padding="longest", return_tensors="pt").input_ids

inputs = {
        "input_ids": input_ids.cpu().numpy().astype(np.int32),
        "max_length": np.array([512], dtype=np.int32),
        "min_length": np.array([0], dtype=np.int32),
        "num_beams": np.array([5], dtype=np.int32),
        "num_return_sequences": np.array([1], dtype=np.int32),
        "length_penalty": np.array([1.0], dtype=np.float32),
        "repetition_penalty": np.array([5.0], dtype=np.float32),
    }

result = ort_session.run(None, inputs)

Incorrect output for 2nd sample:

# Output: 'Obierika, Okonkwo\'s best friend even stated "He has put a knife on the things that hold us together, and we have fallen apart".'
tokenizer.decode(result[0][0][0], skip_special_tokens=True)

#Output: a.k.a. is the most popular of its kind in the U.K.A. and B.C., as well as many other companies that have been involved with it. (i.e. they are not available to be used for this purpose): you can find out more about them at [www.b.c.no.ph.com/reporting-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-in](https://www.b.c.no.ph.com/reporting-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-in)'
tokenizer.decode(result[0][1][0], skip_special_tokens=True)

While using input as
text = ['Obierika, Okonkwo best friend even stated.', 'loving it.']

Output: 'Obierika, Okonkwo best friend ever stated.'
tokenizer.decode(result[0][0][0], skip_special_tokens=True)

Output: 'I love it.'
tokenizer.decode(result[0][1][0], skip_special_tokens=True)

While using input as
text = [''loving it.']

Output: 'love it.'
tokenizer.decode(result[0][0][0], skip_special_tokens=True)

Urgency

No response

Platform

Linux

OS Version

Debian 11 5.10.0-30-cloud-amd64

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

onnxruntime-gpu 1.18.0

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 12.2

The text was updated successfully, but these errors were encountered:

github-actions · 2024-07-21T15:01:00Z

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

github-actions bot added ep:CUDA issues related to the CUDA execution provider model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. labels Jun 15, 2024

PKaralupov changed the title ~~Flan-T5 converted model produces wrong result with batch size > 1 and long senetences~~ Flan-T5 small converted model produces wrong result with batch size > 1 and long senetences Jun 15, 2024

sophies927 removed the ep:CUDA issues related to the CUDA execution provider label Jun 20, 2024

github-actions bot added the stale issues that have not been addressed in a while; categorized by a bot label Jul 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flan-T5 small converted model produces wrong result with batch size > 1 and long senetences #21053

Flan-T5 small converted model produces wrong result with batch size > 1 and long senetences #21053

PKaralupov commented Jun 15, 2024 •

edited

Loading

github-actions bot commented Jul 21, 2024

Flan-T5 small converted model produces wrong result with batch size > 1 and long senetences #21053

Flan-T5 small converted model produces wrong result with batch size > 1 and long senetences #21053

Comments

PKaralupov commented Jun 15, 2024 • edited Loading

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

github-actions bot commented Jul 21, 2024

PKaralupov commented Jun 15, 2024 •

edited

Loading