-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add instructions for running vLLM backend #8
Changes from 1 commit
1688a33
0ba6200
a4921c1
92124bf
aa8a105
ed108d0
c5213f6
2c6881c
ac33407
d2fdb3f
02c1167
d164dab
5ed4d0e
0cd3d91
45a531f
1e27105
97417c5
d943de2
99943cc
b08f426
682ad0c
e7578f1
502f4db
ea35a73
fe06416
0144d33
0f0f968
b81574d
faa29a6
4259a7e
76c2d89
31f1733
76d0652
a50ae8d
edaff54
33dbaed
6575197
9effb18
8dc3f51
bf0d905
7ec9b5f
3b64abc
3a3b326
48e08e7
9b4a193
45be0f6
204ce5a
8c9c4e7
3ab4774
757e2b2
aa9ec65
e0161f4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,93 @@ | ||
<!-- | ||
# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
# | ||
# Redistribution and use in source and binary forms, with or without | ||
# modification, are permitted provided that the following conditions | ||
# are met: | ||
# * Redistributions of source code must retain the above copyright | ||
# notice, this list of conditions and the following disclaimer. | ||
# * Redistributions in binary form must reproduce the above copyright | ||
# notice, this list of conditions and the following disclaimer in the | ||
# documentation and/or other materials provided with the distribution. | ||
# * Neither the name of NVIDIA CORPORATION nor the names of its | ||
# contributors may be used to endorse or promote products derived | ||
# from this software without specific prior written permission. | ||
# | ||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY | ||
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE | ||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR | ||
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR | ||
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, | ||
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, | ||
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR | ||
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY | ||
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT | ||
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE | ||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | ||
--> | ||
|
||
[![License](https://img.shields.io/badge/License-BSD3-lightgrey.svg)](https://opensource.org/licenses/BSD-3-Clause) | ||
|
||
# vLLM Backend | ||
|
||
The Triton backend for [vLLM](https://github.com/vllm-project/vllm). | ||
You can learn more about Triton backends in the [backend | ||
repo](https://github.com/triton-inference-server/backend). Ask | ||
questions or report problems on the [issues | ||
page](https://github.com/triton-inference-server/server/issues). | ||
This backend is designed to run vLLM's | ||
[supported HuggingFace models](https://vllm.readthedocs.io/en/latest/models/supported_models.html). | ||
|
||
Where can I ask general questions about Triton and Triton backends? | ||
Be sure to read all the information below as well as the [general | ||
Triton documentation](https://github.com/triton-inference-server/server#triton-inference-server) | ||
available in the main [server](https://github.com/triton-inference-server/server) | ||
repo. If you don't find your answer there you can ask questions on the | ||
main Triton [issues page](https://github.com/triton-inference-server/server/issues). | ||
|
||
## Build the vLLM Backend | ||
|
||
As a Python-based backend, your Triton server just needs to have the (Python backend)[https://github.com/triton-inference-server/python_backend] | ||
built under `/opt/tritonserver/backends/python`. After that, you can save this in the backends folder as `/opt/tritonserver/backends/vllm`. The `model.py` file in the `src` directory should be in the vllm folder and will function as your Python-based backend. | ||
|
||
In other words, there are no build steps. You only need to copy this to your Triton backends repository. If you use the official Triton vLLM container, this is already set up for you. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are a couple of options, how you can build vLLM backend.
Option 2. You can install vLLM backend directly into our NGC Triton container. In this case, please install vllm first:
Note: we should also mention separate container at some point There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for drafting these instructions. Added! |
||
|
||
The backend repository should look like this: | ||
``` | ||
/opt/tritonserver/backends/ | ||
`-- vllm | ||
oandreeva-nv marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|-- model.py | ||
-- python | ||
|-- libtriton_python.so | ||
|-- triton_python_backend_stub | ||
|-- triton_python_backend_utils.py | ||
``` | ||
|
||
rmccorm4 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Using the vLLM Backend | ||
|
||
You can see an example model_repository in the `samples` folder. | ||
You can use this as is and change the model by changing the `model` value in `model.json`. | ||
rmccorm4 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
You can change the GPU utilization and logging in that file as well. | ||
|
||
In the `samples` folder, you can also find a sample client, `client.py`. | ||
rmccorm4 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
This client is meant to function similarly to the Triton | ||
rmccorm4 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
(vLLM example)[https://github.com/triton-inference-server/tutorials/tree/main/Quick_Deploy/vLLM]. | ||
By default, this will test `prompts.txt`, which we have included in the samples folder. | ||
|
||
|
||
## Important Notes | ||
|
||
* At present, Triton only supports one Python-based backend per server. If you try to start multiple vLLM models, you will get an error. | ||
|
||
### Running Multiple Instances of Triton Server | ||
|
||
Python-based backends use shared memory to transfer requests to the stub process. When running multiple instances of Triton Server on the same machine that use Python-based backend models, there would be shared memory region name conflicts that can result in segmentation faults or hangs. In order to avoid this issue, you need to specify different shm-region-prefix-name using the --backend-config flag. | ||
dyastremsky marked this conversation as resolved.
Show resolved
Hide resolved
|
||
``` | ||
# Triton instance 1 | ||
tritonserver --model-repository=/models --backend-config=python,shm-region-prefix-name=prefix1 | ||
|
||
# Triton instance 2 | ||
tritonserver --model-repository=/models --backend-config=python,shm-region-prefix-name=prefix2 | ||
``` | ||
Note that the hangs would only occur if the /dev/shm is shared between the two instances of the server. If you run the servers in different containers that don't share this location, you don't need to specify shm-region-prefix-name. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,202 @@ | ||
# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
# | ||
# Redistribution and use in source and binary forms, with or without | ||
# modification, are permitted provided that the following conditions | ||
# are met: | ||
# * Redistributions of source code must retain the above copyright | ||
# notice, this list of conditions and the following disclaimer. | ||
# * Redistributions in binary form must reproduce the above copyright | ||
# notice, this list of conditions and the following disclaimer in the | ||
# documentation and/or other materials provided with the distribution. | ||
# * Neither the name of NVIDIA CORPORATION nor the names of its | ||
# contributors may be used to endorse or promote products derived | ||
# from this software without specific prior written permission. | ||
# | ||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY | ||
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE | ||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR | ||
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR | ||
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, | ||
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, | ||
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR | ||
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY | ||
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT | ||
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE | ||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | ||
|
||
import argparse | ||
import asyncio | ||
import queue | ||
|
||
import sys | ||
from os import system | ||
|
||
import json | ||
|
||
import numpy as np | ||
import tritonclient.grpc.aio as grpcclient | ||
from tritonclient.utils import * | ||
|
||
|
||
|
||
def create_request(prompt, stream, request_id, sampling_parameters, model_name, send_parameters_as_tensor=True): | ||
inputs = [] | ||
prompt_data = np.array([prompt.encode("utf-8")], dtype=np.object_) | ||
try: | ||
inputs.append(grpcclient.InferInput("PROMPT", [1], "BYTES")) | ||
inputs[-1].set_data_from_numpy(prompt_data) | ||
except Exception as e: | ||
print(f"Encountered an error {e}") | ||
|
||
stream_data = np.array([stream], dtype=bool) | ||
inputs.append(grpcclient.InferInput("STREAM", [1], "BOOL")) | ||
inputs[-1].set_data_from_numpy(stream_data) | ||
|
||
# Request parameters are not yet supported via BLS. Provide an | ||
# optional mechanism to send serialized parameters as an input | ||
# tensor until support is added | ||
|
||
if send_parameters_as_tensor: | ||
sampling_parameters_data = np.array( | ||
[json.dumps(sampling_parameters).encode("utf-8")], dtype=np.object_ | ||
) | ||
inputs.append(grpcclient.InferInput("SAMPLING_PARAMETERS", [1], "BYTES")) | ||
inputs[-1].set_data_from_numpy(sampling_parameters_data) | ||
|
||
# Add requested outputs | ||
outputs = [] | ||
outputs.append(grpcclient.InferRequestedOutput("TEXT")) | ||
|
||
# Issue the asynchronous sequence inference. | ||
return { | ||
"model_name": model_name, | ||
"inputs": inputs, | ||
"outputs": outputs, | ||
"request_id": str(request_id), | ||
"parameters": sampling_parameters | ||
} | ||
|
||
|
||
async def main(FLAGS): | ||
model_name = "vllm_opt" | ||
sampling_parameters = {"temperature": "0.1", "top_p": "0.95"} | ||
stream = FLAGS.streaming_mode | ||
with open(FLAGS.input_prompts, "r") as file: | ||
print(f"Loading inputs from `{FLAGS.input_prompts}`...") | ||
prompts = file.readlines() | ||
|
||
results_dict = {} | ||
|
||
async with grpcclient.InferenceServerClient( | ||
url=FLAGS.url, verbose=FLAGS.verbose | ||
) as triton_client: | ||
# Request iterator that yields the next request | ||
async def async_request_iterator(): | ||
try: | ||
for iter in range(FLAGS.iterations): | ||
for i, prompt in enumerate(prompts): | ||
prompt_id = FLAGS.offset + (len(prompts) * iter) + i | ||
results_dict[str(prompt_id)] = [] | ||
yield create_request( | ||
prompt, stream, prompt_id, sampling_parameters, model_name | ||
) | ||
except Exception as error: | ||
print(f"caught error in request iterator: {error}") | ||
|
||
try: | ||
# Start streaming | ||
response_iterator = triton_client.stream_infer( | ||
inputs_iterator=async_request_iterator(), | ||
stream_timeout=FLAGS.stream_timeout, | ||
) | ||
# Read response from the stream | ||
async for response in response_iterator: | ||
result, error = response | ||
if error: | ||
print(f"Encountered error while processing: {error}") | ||
else: | ||
output = result.as_numpy("TEXT") | ||
for i in output: | ||
results_dict[result.get_response().id].append(i) | ||
|
||
except InferenceServerException as error: | ||
print(error) | ||
sys.exit(1) | ||
|
||
with open(FLAGS.results_file, "w") as file: | ||
for id in results_dict.keys(): | ||
for result in results_dict[id]: | ||
file.write(result.decode("utf-8")) | ||
file.write("\n") | ||
file.write("\n=========\n\n") | ||
print(f"Storing results into `{FLAGS.results_file}`...") | ||
|
||
if FLAGS.verbose: | ||
print(f"\nContents of `{FLAGS.results_file}` ===>") | ||
system(f"cat {FLAGS.results_file}") | ||
|
||
print("PASS: vLLM example") | ||
|
||
|
||
if __name__ == "__main__": | ||
parser = argparse.ArgumentParser() | ||
parser.add_argument( | ||
"-v", | ||
"--verbose", | ||
action="store_true", | ||
required=False, | ||
default=False, | ||
help="Enable verbose output", | ||
) | ||
parser.add_argument( | ||
"-u", | ||
"--url", | ||
type=str, | ||
required=False, | ||
default="localhost:8001", | ||
help="Inference server URL and it gRPC port. Default is localhost:8001.", | ||
) | ||
parser.add_argument( | ||
"-t", | ||
"--stream-timeout", | ||
type=float, | ||
required=False, | ||
default=None, | ||
help="Stream timeout in seconds. Default is None.", | ||
) | ||
parser.add_argument( | ||
"--offset", | ||
type=int, | ||
required=False, | ||
default=0, | ||
help="Add offset to request IDs used", | ||
) | ||
parser.add_argument( | ||
"--input-prompts", | ||
type=str, | ||
required=False, | ||
default="prompts.txt", | ||
help="Text file with input prompts", | ||
) | ||
parser.add_argument( | ||
"--results-file", | ||
type=str, | ||
required=False, | ||
default="results.txt", | ||
help="The file with output results", | ||
) | ||
parser.add_argument( | ||
"--iterations", | ||
type=int, | ||
required=False, | ||
default=1, | ||
help="Number of iterations through the prompts file", | ||
) | ||
parser.add_argument( | ||
"-s", | ||
"--streaming-mode", | ||
action="store_true", | ||
required=False, | ||
default=False, | ||
help="Enable streaming mode", | ||
) | ||
FLAGS = parser.parse_args() | ||
asyncio.run(main(FLAGS)) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
{ | ||
"model":"facebook/opt-125m", | ||
"disable_log_requests": "true", | ||
"gpu_memory_utilization": 0.5 | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,75 @@ | ||
# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
# | ||
# Redistribution and use in source and binary forms, with or without | ||
# modification, are permitted provided that the following conditions | ||
# are met: | ||
# * Redistributions of source code must retain the above copyright | ||
# notice, this list of conditions and the following disclaimer. | ||
# * Redistributions in binary form must reproduce the above copyright | ||
# notice, this list of conditions and the following disclaimer in the | ||
# documentation and/or other materials provided with the distribution. | ||
# * Neither the name of NVIDIA CORPORATION nor the names of its | ||
# contributors may be used to endorse or promote products derived | ||
# from this software without specific prior written permission. | ||
# | ||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY | ||
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE | ||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR | ||
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR | ||
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, | ||
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, | ||
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR | ||
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY | ||
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT | ||
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE | ||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | ||
|
||
name: "vllm_opt" | ||
backend: "vllm" | ||
|
||
# Disabling batching in Triton, let vLLM handle the batching on its own. | ||
max_batch_size: 0 | ||
|
||
# We need to use decoupled transaction policy for saturating | ||
# vLLM engine for max throughtput. | ||
# TODO [DLIS:5233]: Allow asychronous execution to lift this | ||
# restriction for cases there is exactly a single response to | ||
# a single request. | ||
model_transaction_policy { | ||
decoupled: True | ||
} | ||
|
||
input [ | ||
{ | ||
name: "PROMPT" | ||
dyastremsky marked this conversation as resolved.
Show resolved
Hide resolved
|
||
data_type: TYPE_STRING | ||
dims: [ 1 ] | ||
}, | ||
{ | ||
name: "STREAM" | ||
dyastremsky marked this conversation as resolved.
Show resolved
Hide resolved
|
||
data_type: TYPE_BOOL | ||
dims: [ 1 ] | ||
}, | ||
{ | ||
name: "SAMPLING_PARAMETERS" | ||
dyastremsky marked this conversation as resolved.
Show resolved
Hide resolved
|
||
data_type: TYPE_STRING | ||
dims: [ 1 ] | ||
optional: true | ||
} | ||
] | ||
|
||
output [ | ||
{ | ||
name: "TEXT" | ||
dyastremsky marked this conversation as resolved.
Show resolved
Hide resolved
|
||
data_type: TYPE_STRING | ||
dims: [ -1 ] | ||
} | ||
] | ||
nnshah1 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
# The usage of device is deferred to the vLLM engine | ||
instance_group [ | ||
{ | ||
count: 1 | ||
kind: KIND_MODEL | ||
} | ||
] |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
Hello, my name is | ||
The most dangerous animal is | ||
The capital of France is | ||
The future of AI is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would installing dependencies be part of build? Or do we need a seperate section on dependencies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I'll add this. I had made the assumption that this is using the vLLM backend, but we need to clarify/offer an independent build (e.g. adding these to a general Triton container).