Add `exclude_input_in_output` option to vllm backend #35

oandreeva-nv · 2024-02-27T19:01:29Z

This PR adds exclude_input_in_output flag to vllm backend inputs.
It only impacts non-streaming case.
For streaming case, I refactored code and return only diffs, e.g.:
Prompt = "The most dangerous animal is"
Response:

" the",
            " one",
            " that",
            " is",
            " most",
            " likely",
            " to",
            " be",
            " killed",
            " by",
            " a",
            " car",
            ".",

The above case will be the only possible response in the streaming mode. Open to discussions.
cc @@mkhludnev

mkhludnev · 2024-02-27T20:28:44Z

Makes sense. Thank you so much @oandreeva-nv! Looking forward for release.

oandreeva-nv · 2024-02-27T20:32:52Z

@mkhludnev This most likely will be a part of 24.03 release, but you don't need to wait this long.
For this PR, you can simply replace /opt/tritonserver/backends/vllm/model.py with the updated version in 24.01 or 24.02 (soon to be released), and it should work, since there's no related changes in tritonserver itself, to handle this flag.

src/model.py

nnshah1

Nice work!

mkhludnev · 2024-03-01T06:01:42Z

A liitle bit of context triton-inference-server/server#6867
Thanks, @oandreeva-nv so much!!

oandreeva-nv added 2 commits February 26, 2024 14:29

Adding exclude_input_in_output tag

852df54

Test refactor + more tests

cef7b73

oandreeva-nv requested review from nnshah1 and rmccorm4 February 27, 2024 19:03

mkhludnev reviewed Feb 27, 2024

View reviewed changes