Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flan-T5 small converted model produces wrong result with batch size > 1 and long senetences #21053

Open
PKaralupov opened this issue Jun 15, 2024 · 1 comment
Labels
model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. stale issues that have not been addressed in a while; categorized by a bot

Comments

@PKaralupov
Copy link

PKaralupov commented Jun 15, 2024

Describe the issue

Finetuned Flan-T5 small for spellchecking was converted with
https://github.com/microsoft/onnxruntime/blob/8f0e896c95eedee624b6a3c375e86e1e4263a980/onnxruntime/python/tools/transformers/convert_generation.py

Using:
python convert_generation.py -m t5-small --model_type t5 --output ./models/t5/onnx_models/t5_small_beam_search.onnx --use_gpu --past_present_share_buffer --use_decoder_masked_attention

It works perfectly with batch size 1, however upon increasing batch size with some samples output is incorrect

Why does length of other samples/batch size impacts output?

To reproduce

Infer using:

def create_ort_session(model_path: str, use_gpu: bool, use_sln_strict_mode: bool) -> InferenceSession:

from onnxruntime import GraphOptimizationLevel, InferenceSession, SessionOptions, get_available_providers
use_gpu = True
use_sln_strict_mode = False
sess_options = SessionOptions()
sess_options.graph_optimization_level = GraphOptimizationLevel.ORT_DISABLE_ALL
execution_providers = ["CUDAExecutionProvider", "CPUExecutionProvider"] if use_gpu else ["CPUExecutionProvider"]
if use_gpu:
    if "CUDAExecutionProvider" not in get_available_providers():
        raise RuntimeError("CUDAExecutionProvider is not available for --use_gpu!")

    if use_sln_strict_mode:
        cuda_provider_options = {"enable_skip_layer_norm_strict_mode": True}
        provider_options = {"CUDAExecutionProvider": cuda_provider_options}
        execution_providers = [
            (name, provider_options[name]) if name in provider_options else name for name in execution_providers
        ]

ort_session = InferenceSession("model.onnx", sess_options, providers=execution_providers)



import numpy as np
from transformers import T5TokenizerFast

tokenizer = T5TokenizerFast.from_pretrained("google/flan-t5-small", legacy=False)

text = ['Obierika, Okonkwo best friend even stated " He has put a knife on the things that held us together, and we have fallen apart ".', 'loving it.']

input_ids = tokenizer(text, padding="longest", return_tensors="pt").input_ids

inputs = {
        "input_ids": input_ids.cpu().numpy().astype(np.int32),
        "max_length": np.array([512], dtype=np.int32),
        "min_length": np.array([0], dtype=np.int32),
        "num_beams": np.array([5], dtype=np.int32),
        "num_return_sequences": np.array([1], dtype=np.int32),
        "length_penalty": np.array([1.0], dtype=np.float32),
        "repetition_penalty": np.array([5.0], dtype=np.float32),
    }

result = ort_session.run(None, inputs)

Incorrect output for 2nd sample:

# Output: 'Obierika, Okonkwo\'s best friend even stated "He has put a knife on the things that hold us together, and we have fallen apart".'
tokenizer.decode(result[0][0][0], skip_special_tokens=True)

#Output: a.k.a. is the most popular of its kind in the U.K.A. and B.C., as well as many other companies that have been involved with it. (i.e. they are not available to be used for this purpose): you can find out more about them at [www.b.c.no.ph.com/reporting-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-in](https://www.b.c.no.ph.com/reporting-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-insertment-in)'
tokenizer.decode(result[0][1][0], skip_special_tokens=True)

While using input as
text = ['Obierika, Okonkwo best friend even stated.', 'loving it.']

Output: 'Obierika, Okonkwo best friend ever stated.'
tokenizer.decode(result[0][0][0], skip_special_tokens=True)

Output: 'I love it.'
tokenizer.decode(result[0][1][0], skip_special_tokens=True)

While using input as
text = [''loving it.']

Output: 'love it.'
tokenizer.decode(result[0][0][0], skip_special_tokens=True)

Urgency

No response

Platform

Linux

OS Version

Debian 11 5.10.0-30-cloud-amd64

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

onnxruntime-gpu 1.18.0

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 12.2

@github-actions github-actions bot added ep:CUDA issues related to the CUDA execution provider model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. labels Jun 15, 2024
@PKaralupov PKaralupov changed the title Flan-T5 converted model produces wrong result with batch size > 1 and long senetences Flan-T5 small converted model produces wrong result with batch size > 1 and long senetences Jun 15, 2024
@sophies927 sophies927 removed the ep:CUDA issues related to the CUDA execution provider label Jun 20, 2024
Copy link
Contributor

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

@github-actions github-actions bot added the stale issues that have not been addressed in a while; categorized by a bot label Jul 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. stale issues that have not been addressed in a while; categorized by a bot
Projects
None yet
Development

No branches or pull requests

2 participants