Serve and Deploy LLMs

This document shows how you can serve a LitGPT for deployment.

Serve an LLM

This section illustrates how we can set up an inference server for a phi-2 LLM using litgpt serve that is minimal and highly scalable.

Step 1: Start the inference server

# 1) Download a pretrained model (alternatively, use your own finetuned model)
litgpt download microsoft/phi-2

# 2) Start the server
litgpt serve microsoft/phi-2

Tip

Use litgpt serve --help to display additional options, including the port, devices, LLM temperature setting, and more.

Step 2: Query the inference server

You can now send requests to the inference server you started in step 2. For example, in a new Python session, we can send requests to the inference server as follows:

import requests, json

response = requests.post(
    "http://127.0.0.1:8000/predict", 
    json={"prompt": "Fix typos in the following sentence: Exampel input"}
)

print(response.json()["output"])

Executing the code above prints the following output:

Example input.

Optional streaming mode

The 2-step procedure described above returns the complete response all at once. If you want to stream the response on a token-by-token basis, start the server with the streaming option enabled:

litgpt serve microsoft/phi-2 --stream true

Then, use the following updated code to query the inference server:

import requests, json

response = requests.post(
    "http://127.0.0.1:8000/predict", 
    json={"prompt": "Fix typos in the following sentence: Exampel input"},
    stream=True
)

# stream the response
for line in response.iter_lines(decode_unicode=True):
    if line:
        print(json.loads(line)["output"], end="")

Sure, here is the corrected sentence:

Example input

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deploy.md

deploy.md

Serve and Deploy LLMs

Serve an LLM

Step 1: Start the inference server

Step 2: Query the inference server

Optional streaming mode

Files

deploy.md

Latest commit

History

deploy.md

File metadata and controls

Serve and Deploy LLMs

Serve an LLM

Step 1: Start the inference server

Step 2: Query the inference server

Optional streaming mode