Skip to content

Latest commit

 

History

History
80 lines (53 loc) · 1.83 KB

deploy.md

File metadata and controls

80 lines (53 loc) · 1.83 KB

Serve and Deploy LLMs

This document shows how you can serve a LitGPT for deployment.

 

Serve an LLM

This section illustrates how we can set up an inference server for a phi-2 LLM using litgpt serve that is minimal and highly scalable.

 

Step 1: Start the inference server

# 1) Download a pretrained model (alternatively, use your own finetuned model)
litgpt download microsoft/phi-2

# 2) Start the server
litgpt serve microsoft/phi-2

Tip

Use litgpt serve --help to display additional options, including the port, devices, LLM temperature setting, and more.

 

Step 2: Query the inference server

You can now send requests to the inference server you started in step 2. For example, in a new Python session, we can send requests to the inference server as follows:

import requests, json

response = requests.post(
    "http://127.0.0.1:8000/predict", 
    json={"prompt": "Fix typos in the following sentence: Exampel input"}
)

print(response.json()["output"])

Executing the code above prints the following output:

Example input.

 

Optional streaming mode

The 2-step procedure described above returns the complete response all at once. If you want to stream the response on a token-by-token basis, start the server with the streaming option enabled:

litgpt serve microsoft/phi-2 --stream true

Then, use the following updated code to query the inference server:

import requests, json

response = requests.post(
    "http://127.0.0.1:8000/predict", 
    json={"prompt": "Fix typos in the following sentence: Exampel input"},
    stream=True
)

# stream the response
for line in response.iter_lines(decode_unicode=True):
    if line:
        print(json.loads(line)["output"], end="")
Sure, here is the corrected sentence:

Example input