Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add OpenAI compatible v2 completion #1

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

lizard-boy
Copy link

Pull Request Summary

What is this PR changing? Why is this change being made? Any caveats you'd like to highlight? Link any relevant documents, links, or screenshots here if applicable.

  • Add openai compatible APIs for legacy completions
  • HTTP forwarder improvements
    • Fix route registration logic - dependency injection wasn't working
    • Add async methods for predict forwarding - this was causing major bottlenecks
    • Make post inference handler optional - makes testing locally easier

Test Plan and Usage Guide

How did you validate that your PR works correctly? How do you run or demo the code? Provide enough detail so a reviewer can reasonably reproduce the testing procedure. Paste example command line invocations if applicable.

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

This PR adds OpenAI-compatible APIs for legacy completions and improves HTTP forwarding, focusing on enhancing performance, compatibility, and ease of testing. Key changes include:

  • Added new completion endpoints in model-engine/model_engine_server/api/v2/completion.py for OpenAI compatibility
  • Implemented async methods for predict forwarding in model-engine/model_engine_server/inference/forwarding/forwarding.py to address performance bottlenecks
  • Updated data transfer objects in model-engine/model_engine_server/common/dtos/llms/completion.py to align with OpenAI's API structure
  • Modified model-engine/model_engine_server/inference/forwarding/http_forwarder.py to fix route registration and dependency injection issues
  • Made post-inference handler optional in model-engine/model_engine_server/inference/forwarding/celery_forwarder.py for easier local testing

14 file(s) reviewed, 9 comment(s)
Edit PR Review Bot Settings

Comment on lines 136 to +137
# if ttft is None and message.startswith("data"):
# ttft = use_case_timer.lap()
# ttft = timer.lap()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Remove or uncomment this code block

Comment on lines +22 to +23
# note that this is ok because request will cache the body
body = await request.json()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Awaiting request.json() could impact performance. Consider caching this operation if used multiple times.

Comment on lines +134 to +135
# if ttft is None and message.startswith("data"):
# ttft = timer.lap()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: This commented-out code seems unnecessary. Consider removing it if it's no longer needed.

@@ -288,7 +290,7 @@ def inter_token_latency(self) -> Optional[float]: # Only for streaming requests
return (self.total_duration - self.time_to_first_token) / (self.num_completion_tokens - 1)


class CompletionV2Request(CreateCompletionRequest):
class CompletionV2Request(CreateCompletionRequest, VLLMCompletionAdditionalParams):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: CompletionV2Request inherits from both CreateCompletionRequest and VLLMCompletionAdditionalParams. Ensure that there are no conflicting fields or unexpected behavior due to multiple inheritance.

Comment on lines +334 to +336
CompletionV2StreamResponse: TypeAlias = (
EventSourceResponse # EventSourceResponse[CompletionV2StreamChunk]
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: CompletionV2StreamResponse is defined as EventSourceResponse, but the comment suggests it should be EventSourceResponse[CompletionV2StreamChunk]. Clarify if this is intentional or if it should be properly typed.

Comment on lines +340 to +342
# This is a version of CompletionV2Response that is used by pydantic to determine the response model
# Since EventSourceResponse isn't a pydantic model, we need to use a Union of the two response types
CompletionV2ResponseItem: TypeAlias = CompletionV2SyncResponse | CompletionV2StreamChunk
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: The comment explains the use of Union for CompletionV2ResponseItem. Consider adding this explanation as a docstring for better code documentation.

Comment on lines +1 to 10
from typing import Any, Dict, List, Optional, TypeAlias

from model_engine_server.common.dtos.llms.vllm import VLLMCompletionAdditionalParams
from model_engine_server.common.pydantic_types import BaseModel, Field
from model_engine_server.common.types.gen.openai import (
CreateCompletionRequest,
CreateCompletionResponse,
)
from sse_starlette import EventSourceResponse
from typing_extensions import Annotated
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Multiple imports from different modules. Consider organizing imports alphabetically for better readability.

Comment on lines +122 to +123
if forwarder.post_inference_hooks_handler:
forwarder.post_inference_hooks_handler.handle(request_params_pydantic, retval, task_id) # type: ignore
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Consider adding a null check for forwarder.post_inference_hooks_handler before accessing its handle method to prevent potential AttributeError.

Comment on lines +194 to +198
def get_sync_forwarder(route=route):
return sync_forwarders.get(route)

def get_stream_forwarder(route=route):
return stream_forwarders.get(route)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Consider using a more descriptive name for these functions, like get_sync_forwarder_for_route and get_stream_forwarder_for_route.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants