Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OVModelForSeq2SeqLM with Helsinki-NLP/opus-mt-es-en has slow inference times when exported to OpenVino #339

Open
tsmith023 opened this issue Jun 8, 2023 · 2 comments
Assignees

Comments

@tsmith023
Copy link

tsmith023 commented Jun 8, 2023

I'm having trouble exporting the Helsinki-NLP/opus-mt-es-en model for language translation into the optimised OpenVino IR format. Reading through the other issues within this repository highlighted this issue #188, which seems to suffer from similar effects.

In that case, it seemed to be an issue with the BigBird architecture and its lack of support by HuggingFace Optimum. However, the Helsinki-NLP/opus-mt-es-en model is of the MarianMT class, which is documented as being supported.

Am I missing something here fundamental? Is the conversion of the MarianMT model into OpenVino IR format currently unsupported by this library in a similar way to the BigBird models as in the above issue? Or are there aspects of the conversion that I am not specifying correctly such that the export is sub-optimal? It would seem that this should be possible given the documentation.

I see the following during the build logs if it helps at all: Asked a sequence length of 16, but a sequence length of 1 will be used with use_past == True for 'decoder_input_ids'.

An MRE looks like:

import os
from optimum.intel.openvino import OVModelForSeq2SeqLM
from transformers import AutoTokenizer, pipeline

tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-es-en")
ov_model = OVModelForSeq2SeqLM.from_pretrained(
    "Helsinki-NLP/opus-mt-es-en",
    export=True,
    use_cache=True,
)

def run(text: str):
    pipe = pipeline("translation_es_to_en", model=ov_model, tokenizer=tokenizer)
    return pipe(text)

def export_to_ov(save_dir: str):
    ov_model.save_pretrained()

if __name__ == "__main__":
    export_to_ov("./")

Operating run("Hola, como estas?") yields an inference time of 0.6323761940002441s while using the exported OVIR binaries in an OVMS model pipeline yields an inference time of 45s.

Any help on this one would be greatly appreciated, cheers!

P.S. I can post the config.json file being passed to the OVMS instance, but it's very long so I'll leave it until it's required!

@tsmith023 tsmith023 changed the title OVModelForSeq2SeqLM for Helsinki-NLP/opus-mt-es-en slow inference times when exported to OpenVino OVModelForSeq2SeqLM with Helsinki-NLP/opus-mt-es-en has slow inference times when exported to OpenVino Jun 8, 2023
@echarlaix echarlaix self-assigned this Jun 8, 2023
@echarlaix
Copy link
Collaborator

Hi @tsmith023,

Apologies for the late reply, yes MarianMT models are supported. Concerning the slow inference you're reporting, are you comparing the resulting OpenVINO model with the original PyTorch model and currently finding that the latency from the OpenVINO model is higher?

I'm not able to reproduce this, could you confirm that you're still observing it with :

import time
import torch
from optimum.intel import OVModelForSeq2SeqLM
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id = "Helsinki-NLP/opus-mt-es-en"
ov_model = OVModelForSeq2SeqLM.from_pretrained(model_id, export=True, use_cache=True)
torch_model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokens = tokenizer("This is a sample input", return_tensors="pt")
decoder_inputs = {"decoder_input_ids": torch.ones((1, 1), dtype=torch.long) * torch_model.config.decoder_start_token_id }

def elapsed_time(model, nb_pass=20):
    start = time.time()
    for _ in range(nb_pass):
        model(**tokens, **decoder_inputs)
    end = time.time()
    return (end - start) / nb_pass

# warmup
elapsed_time(ov_model, nb_pass=5)

time_ov = elapsed_time(ov_model)
time_torch = elapsed_time(torch_model)

@tsmith023
Copy link
Author

Hi @echarlaix, the problem didn't surface when executing within the Python runtime but when running the exported OVIR binaries within OpenVino itself, which is a C++ runtime. I was comparing the performance of the exported model within the Python runtime to its performance within the C++ runtime

Do you feel that this issue is better suited to the OpenVino repository? I raised it originally here since I judged it to be a problem with the model exporting logic. Let me know whether I should relocate it there or whether you feel there is an implementation issue here 😁

@pbebbo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants