-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add OpenAI compatible v2 completion #1
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Summary
This PR adds OpenAI-compatible APIs for legacy completions and improves HTTP forwarding, focusing on enhancing performance, compatibility, and ease of testing. Key changes include:
- Added new completion endpoints in
model-engine/model_engine_server/api/v2/completion.py
for OpenAI compatibility - Implemented async methods for predict forwarding in
model-engine/model_engine_server/inference/forwarding/forwarding.py
to address performance bottlenecks - Updated data transfer objects in
model-engine/model_engine_server/common/dtos/llms/completion.py
to align with OpenAI's API structure - Modified
model-engine/model_engine_server/inference/forwarding/http_forwarder.py
to fix route registration and dependency injection issues - Made post-inference handler optional in
model-engine/model_engine_server/inference/forwarding/celery_forwarder.py
for easier local testing
14 file(s) reviewed, 9 comment(s)
Edit PR Review Bot Settings
# if ttft is None and message.startswith("data"): | ||
# ttft = use_case_timer.lap() | ||
# ttft = timer.lap() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Remove or uncomment this code block
# note that this is ok because request will cache the body | ||
body = await request.json() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Awaiting request.json() could impact performance. Consider caching this operation if used multiple times.
# if ttft is None and message.startswith("data"): | ||
# ttft = timer.lap() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: This commented-out code seems unnecessary. Consider removing it if it's no longer needed.
@@ -288,7 +290,7 @@ def inter_token_latency(self) -> Optional[float]: # Only for streaming requests | |||
return (self.total_duration - self.time_to_first_token) / (self.num_completion_tokens - 1) | |||
|
|||
|
|||
class CompletionV2Request(CreateCompletionRequest): | |||
class CompletionV2Request(CreateCompletionRequest, VLLMCompletionAdditionalParams): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: CompletionV2Request inherits from both CreateCompletionRequest and VLLMCompletionAdditionalParams. Ensure that there are no conflicting fields or unexpected behavior due to multiple inheritance.
CompletionV2StreamResponse: TypeAlias = ( | ||
EventSourceResponse # EventSourceResponse[CompletionV2StreamChunk] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: CompletionV2StreamResponse is defined as EventSourceResponse, but the comment suggests it should be EventSourceResponse[CompletionV2StreamChunk]. Clarify if this is intentional or if it should be properly typed.
# This is a version of CompletionV2Response that is used by pydantic to determine the response model | ||
# Since EventSourceResponse isn't a pydantic model, we need to use a Union of the two response types | ||
CompletionV2ResponseItem: TypeAlias = CompletionV2SyncResponse | CompletionV2StreamChunk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: The comment explains the use of Union for CompletionV2ResponseItem. Consider adding this explanation as a docstring for better code documentation.
from typing import Any, Dict, List, Optional, TypeAlias | ||
|
||
from model_engine_server.common.dtos.llms.vllm import VLLMCompletionAdditionalParams | ||
from model_engine_server.common.pydantic_types import BaseModel, Field | ||
from model_engine_server.common.types.gen.openai import ( | ||
CreateCompletionRequest, | ||
CreateCompletionResponse, | ||
) | ||
from sse_starlette import EventSourceResponse | ||
from typing_extensions import Annotated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Multiple imports from different modules. Consider organizing imports alphabetically for better readability.
if forwarder.post_inference_hooks_handler: | ||
forwarder.post_inference_hooks_handler.handle(request_params_pydantic, retval, task_id) # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Consider adding a null check for forwarder.post_inference_hooks_handler
before accessing its handle
method to prevent potential AttributeError.
def get_sync_forwarder(route=route): | ||
return sync_forwarders.get(route) | ||
|
||
def get_stream_forwarder(route=route): | ||
return stream_forwarders.get(route) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Consider using a more descriptive name for these functions, like get_sync_forwarder_for_route
and get_stream_forwarder_for_route
.
Pull Request Summary
What is this PR changing? Why is this change being made? Any caveats you'd like to highlight? Link any relevant documents, links, or screenshots here if applicable.
Test Plan and Usage Guide
How did you validate that your PR works correctly? How do you run or demo the code? Provide enough detail so a reviewer can reasonably reproduce the testing procedure. Paste example command line invocations if applicable.