-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support input for llama3.2 multi-modal model #69
Merged
Merged
Changes from 4 commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
52e125f
Support modal model input
114c235
Update
4a8b737
support image list input
9897102
Remove best of metrics to align with latest vllm
c85d972
supply the images as individual tensors
566e0cc
Add vllm version check for compatibility
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -31,7 +31,9 @@ | |
import queue | ||
import threading | ||
from typing import Dict, List | ||
|
||
import base64 | ||
from PIL import Image | ||
from io import BytesIO | ||
import numpy as np | ||
import torch | ||
import triton_python_backend_utils as pb_utils | ||
|
@@ -52,6 +54,12 @@ class TritonPythonModel: | |
def auto_complete_config(auto_complete_model_config): | ||
inputs = [ | ||
{"name": "text_input", "data_type": "TYPE_STRING", "dims": [1]}, | ||
{ | ||
"name": "multi_modal_data", | ||
"data_type": "TYPE_STRING", | ||
"dims": [1], | ||
"optional": True, | ||
}, | ||
{ | ||
"name": "stream", | ||
"data_type": "TYPE_BOOL", | ||
|
@@ -385,6 +393,28 @@ async def generate(self, request): | |
).as_numpy()[0] | ||
if isinstance(prompt, bytes): | ||
prompt = prompt.decode("utf-8") | ||
|
||
multi_modal_data_input_tensor = pb_utils.get_input_tensor_by_name( | ||
request, "multi_modal_data" | ||
) | ||
if multi_modal_data_input_tensor: | ||
multi_modal_data = multi_modal_data_input_tensor.as_numpy()[0].decode("utf-8") | ||
multi_modal_data = json.loads(multi_modal_data) | ||
if "image" in multi_modal_data: | ||
image_list = [] | ||
for image_base64_string in multi_modal_data["image"]: | ||
if "base64," in image_base64_string: | ||
image_base64_string = image_base64_string.split("base64,")[-1] | ||
image_data = base64.b64decode(image_base64_string) | ||
image = Image.open(BytesIO(image_data)).convert("RGB") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. NOTE: May need to expose image formats other than RGB in the future, but seems like a sensible default / first support for now. We can probably defer exposing it until we have a use case requiring other formats. |
||
image_list.append(image) | ||
prompt = { | ||
"prompt": prompt, | ||
"multi_modal_data": { | ||
"image": image_list | ||
} | ||
} | ||
|
||
stream = pb_utils.get_input_tensor_by_name(request, "stream") | ||
if stream: | ||
stream = stream.as_numpy()[0] | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GuanLuo @krishung5 @kthui any concerns with passing a serialized JSON input vs. individual input tensors for "image", "audio", etc?
Looks like this is currently mimicing the style of inputs vllm itself expects, so it would be pretty intuitive to vllm users:
Current serialized JSON form:
Example tensor form:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the individual input tensors is cleaner in terms of what inputs are expected, and less prone to user error as it does not involve the additional JSON layer.
Given that we need to teardown the JSON and convert each Base64 into bytes, there are actually some work on the backend to verify the JSON is well-formed for the conversion to happen. I think it is easier to supply the image/audio as individual tensors knowing they are already well-formed, and then convert each Base64 into bytes and format them correctly for vLLM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No actual concerns off the top of my head. Agree with Jacky that the tensor form looks cleaner and could simplify some checks. I think aligning the format with vLLM could slightly improve usability for vLLM backend users in my opinion. However, since the required input changes seem minimal, the impact on vLLM users should be limited.