Is it possible to support multiple endpoints for one server? #271

arkohut · 2024-09-06T07:57:46Z

🚀 Feature

Multiple endpoints like /embedding or /vlm/predict or /ocr/predict.

Motivation

I would like to host multiple models on a single GPU for different purposes. It would be ideal to support numerous (small) models while maintaining high performance, such as through batching.

Additionally, I believe starting multiple litserve instances with different ports may introduce unnecessary complexity, compared to starting a single server with different endpoints.

Pitch

Alternatives

Additional context

The text was updated successfully, but these errors were encountered:

bhimrazy · 2024-09-06T08:20:42Z

Hi @arkohut,

You can add an additional endpoint by implementing a LitSpec API, similar to the OpenAISpec. Currently, it only takes a single spec.

bhimrazy · 2024-09-06T08:38:24Z

Hi @aniketmaurya,

It seems litserve already handle multiple specs, but the worker setup currently accepts only a single one. Do we have plans to support multiple/array of specs?

for spec in self._specs:
    spec: LitSpec
    # TODO: check for path conflicts
    for path, endpoint, methods in spec.endpoints:
        self.app.add_api_route(
            path, endpoint=endpoint, methods=methods, dependencies=[Depends(self.setup_auth())]
        )

bhimrazy · 2024-09-06T08:58:39Z

Hi @arkohut,

You can add an additional endpoint by implementing a LitSpec API, similar to the OpenAISpec. Currently, it only takes a single spec.

Oh, sorry! It looks like we can currently only use one endpoint, either the default one or the one added from the spec.

Also, there are also some discussions about multiple endpoints in issue #90. Feel free to check it out!

arkohut · 2024-09-09T06:46:11Z

Thanks for the reply. The issue #90 is just talking about customize endpoint. I think it is quite necessary. For example, I need a openai compatible embedding endpoint which is not supported by litserve (which just support chat api).

But it is not talking about multiple endpoints....The only way right now is to expose multiple server with different ports.

bhimrazy · 2024-09-09T08:26:00Z

Hi @arkohut, agreed on the multi-endpoints feature, but not sure if it's in the plan.
I did a quick hack for this, though it’s not perfect since the extra endpoints are isolated from the main litserve engine.
Hope it helps!

# server.py

import litserve as ls
import numpy as np
from fastapi import Depends
from openai.types.embedding_create_params import EmbeddingCreateParams
from openai.types.create_embedding_response import (
    CreateEmbeddingResponse,
    Embedding,
    Usage,
)
from typing import Generator


class ChatAPI(ls.LitAPI):
    def setup(self, device: str) -> None:
        """Initialize the model and other required resources."""
        self.model = None  # Placeholder: Initialize or load your model here.

    def predict(self, prompt: str) -> Generator[str, None, None]:
        """Generator function to yield the model output step by step."""
        yield "This is a sample generated output"

    def encode_response(self, output: Generator[str, None, None]) -> Generator[dict, None, None]:
        """Format the response to fit the assistant's message structure."""
        for out in output:
            yield {"role": "assistant", "content": out}
        # Final token after finishing processing
        yield {"role": "assistant", "content": "This is the final msg."}


def embedding_fn(request: EmbeddingCreateParams) -> CreateEmbeddingResponse:
    """Generate a fake embedding for demonstration purposes."""
    # Placeholder: Cache the model here to avoid reloading for every request.
    embeddings = [
        Embedding(embedding=np.random.rand(512).tolist(), index=0, object="embedding")
    ]
    
    # Token usage calculation
    prompt_tokens = 20
    input_len = len(request["input"].split())
    total_tokens = input_len + prompt_tokens

    usage = Usage(prompt_tokens=prompt_tokens, total_tokens=total_tokens)
    
    # Return the response formatted as per OpenAI API structure
    return CreateEmbeddingResponse(
        data=embeddings,
        model=request["model"],
        object="list",
        usage=usage
    )


if __name__ == "__main__":
    # Initialize the API and server
    api = ChatAPI()
    server = ls.LitServer(api, spec=ls.OpenAISpec())

    # Add the embedding API route
    server.app.add_api_route(
        "/v1/embeddings",
        embedding_fn,
        methods=["POST"],
        tags=["embedding"],
        dependencies=[Depends(server.setup_auth())], 
    )

    # Run the server
    server.run(port=8000)

lantiga · 2024-09-09T16:50:18Z

Thanks for the great discussion.

Yes that's definitely something we want to enable. Spec is to make an API conform to a given API specification, I wouldn't abuse it.

What I would rather do is create something to launch a collection of LitServers in the same server.

Initially we thought it would be simpler to pass a list or dict of LitAPIs to LitServer, but then all the arguments to LitServer would have to be specified per-API and things would get very murky.

The simpler thing to do is to have a function or class that takes a collection of LitServers, which you then run.

Could be something like

embed_server = ls.LitServer(embed_api, ...)
llm_server = ls.LitServer(llm_api, ...)

run_servers(embed_server, llm_server)

# or

run_servers({"/embed-prefix": embed_server, "/predict-prefix": llm_server})

or we could introduce another server collection class, but the concept doesn't change.

This is good because it would give you the ability to specify worker settings, batching settings, etc per endpoint, which you absolutely need to do.

bhimrazy · 2024-09-09T17:17:06Z

Thanks so much, @lantiga, for the great idea!
I’m excited about the direction and look forward to doing some research and making a contribution to it.

aniketmaurya · 2024-09-10T12:42:00Z

Hi @bhimrazy! If you're interested in contributing to this issue, you can try the following:

We define a run_all function that accepts a list of LitServer objects.
run_all will create the socket, as shown here, and then perform the rest of the operations in a combined way for LitServe.run method.

Please let me know if you have any question.

bhimrazy · 2024-09-10T15:23:12Z

Sure, @aniketmaurya! I'll start drafting a PR.

Also, I do have a few confusions related to it, but I'll first review the details to gain a clearer understanding
and then get back to you. Thank you 🙂

aniketmaurya · 2024-09-19T11:09:09Z

hi @arkohut, would you be available to chat more about this issue? we are doing some research to enable this feature to the users in the best manner.

arkohut · 2024-09-20T01:25:57Z

hi @arkohut, would you be available to chat more about this issue? we are doing some research to enable this feature to the users in the best manner.

OK, I will tell more about my use case.

I am working on a project that requires multiple models to run on a local PC. The reason for this is to ensure personal privacy is not compromised.

Specifically, this project is much like a current project called Rewind. I need to extract text from screenshots, use a multimodal model to describe the screenshots, and ultimately use an embedding model to store the extracted data into a vector database. Then I can use text to search the indexed data.

In this process, multiple models are involved:

OCR model
VLM model
Embedding model

I hope these models can be loaded on a local GPU, and preferably use a solution like litserve to ensure the operational efficiency of the models.

Currently, ollama seems to be a very good local model running solution, but it has the following issues:

Ollama's support for newer models is often slower or even non-existent, far less flexible than litserve.
Ollama itself tends to support the operation of LLMs, with relatively limited support for other models. Similarly, litserve is more flexible and can support a richer variety of models.

I would like to emphasize that the models have a significant impact on the effectiveness of the project, so I am very keen on running the best possible models locally, even with limited computational power. At present, it seems that many excellent VLM models are not supported by ollama by now, such as Qwen-VL, Florence 2, and InternVL2.

Even if, in the end, due to model performance limitations, running the models locally isn’t feasible, it is still crucial to have a solution that allows multiple models to run on a single GPU with a fast inference speed (rather than using multiple GPUs, even if it’s an A100 or H100 GPU).

aceliuchanghong · 2024-09-25T14:13:35Z

good

akansal1 · 2024-09-26T08:46:30Z

hi, i want to understand how to manage GPU memory in case of multi-model serving ?

aniketmaurya · 2024-09-26T11:55:30Z

hi, i want to understand how to manage GPU memory in case of multi-model serving ?

hi @akansal1, if you have multiple model instances then each instance will take the GPU memory individually.

If you are using multiple workers, you can set set_per_process_memory_fraction to limit the cache allocator.

arkohut added enhancement New feature or request help wanted Extra attention is needed labels Sep 6, 2024

bhimrazy linked a pull request Sep 12, 2024 that will close this issue

Feat: multiple endpoints using a list of LitServer #276

Draft

4 tasks

bsergean mentioned this issue Sep 20, 2024

More complex model management (multiple models, model reloading etc...) #282

Closed

aniketmaurya added the question Further information is requested label Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to support multiple endpoints for one server? #271

Is it possible to support multiple endpoints for one server? #271

arkohut commented Sep 6, 2024

bhimrazy commented Sep 6, 2024

bhimrazy commented Sep 6, 2024

bhimrazy commented Sep 6, 2024 •

edited

Loading

arkohut commented Sep 9, 2024

bhimrazy commented Sep 9, 2024

lantiga commented Sep 9, 2024

bhimrazy commented Sep 9, 2024

aniketmaurya commented Sep 10, 2024 •

edited

Loading

bhimrazy commented Sep 10, 2024

aniketmaurya commented Sep 19, 2024 •

edited

Loading

arkohut commented Sep 20, 2024

aceliuchanghong commented Sep 25, 2024

akansal1 commented Sep 26, 2024

aniketmaurya commented Sep 26, 2024

Is it possible to support multiple endpoints for one server? #271

Is it possible to support multiple endpoints for one server? #271

Comments

arkohut commented Sep 6, 2024

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

bhimrazy commented Sep 6, 2024

bhimrazy commented Sep 6, 2024

bhimrazy commented Sep 6, 2024 • edited Loading

arkohut commented Sep 9, 2024

bhimrazy commented Sep 9, 2024

lantiga commented Sep 9, 2024

bhimrazy commented Sep 9, 2024

aniketmaurya commented Sep 10, 2024 • edited Loading

bhimrazy commented Sep 10, 2024

aniketmaurya commented Sep 19, 2024 • edited Loading

arkohut commented Sep 20, 2024

aceliuchanghong commented Sep 25, 2024

akansal1 commented Sep 26, 2024

aniketmaurya commented Sep 26, 2024

bhimrazy commented Sep 6, 2024 •

edited

Loading

aniketmaurya commented Sep 10, 2024 •

edited

Loading

aniketmaurya commented Sep 19, 2024 •

edited

Loading