Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add instructions for running vLLM backend #8

Merged
merged 52 commits into from
Oct 18, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
1688a33
Draft README and samples
dyastremsky Oct 10, 2023
0ba6200
Run pre-commit
dyastremsky Oct 10, 2023
a4921c1
Remove unused queue.
dyastremsky Oct 10, 2023
92124bf
Fixes for README
dyastremsky Oct 10, 2023
aa8a105
Add client.py shebang
dyastremsky Oct 10, 2023
ed108d0
Add Conda instructions.
dyastremsky Oct 10, 2023
c5213f6
Spacing, title
dyastremsky Oct 10, 2023
2c6881c
Switch i/o to lowercase
dyastremsky Oct 10, 2023
ac33407
Switch i/o to lowercase
dyastremsky Oct 10, 2023
d2fdb3f
Switch i/o to lowercase
dyastremsky Oct 10, 2023
02c1167
Switch i/o to lowercase
dyastremsky Oct 10, 2023
d164dab
Change client code to use lowercase inputs/outputs
dyastremsky Oct 10, 2023
5ed4d0e
Merge branch 'main' of https://github.com/triton-inference-server/vll…
dyastremsky Oct 10, 2023
0cd3d91
Merge branch 'dyas-README' of https://github.com/triton-inference-ser…
dyastremsky Oct 10, 2023
45a531f
Update client to use iterable client class
dyastremsky Oct 11, 2023
1e27105
Rename vLLM model, add note to config
dyastremsky Oct 11, 2023
97417c5
Remove unused imports and vars
dyastremsky Oct 11, 2023
d943de2
Clarify whaat Conda parameter is doing.
dyastremsky Oct 11, 2023
99943cc
Add clarifying note to model config
dyastremsky Oct 11, 2023
b08f426
Run pre-commit
dyastremsky Oct 11, 2023
682ad0c
Remove limitation, model name
dyastremsky Oct 11, 2023
e7578f1
Fix gen vllm env script name
dyastremsky Oct 11, 2023
502f4db
Update wording for supported models
dyastremsky Oct 11, 2023
ea35a73
Merge branch 'dyas-README' of https://github.com/triton-inference-ser…
dyastremsky Oct 11, 2023
fe06416
Update capitalization
dyastremsky Oct 11, 2023
0144d33
Update wording around shared memory across servers
dyastremsky Oct 11, 2023
0f0f968
Remove extra note about shared memory hangs across servers
dyastremsky Oct 11, 2023
b81574d
Fix line lengths and clarify wording.
dyastremsky Oct 11, 2023
faa29a6
Add container steps
dyastremsky Oct 12, 2023
4259a7e
Add links to engine args, define model.json
dyastremsky Oct 12, 2023
76c2d89
Change verbiage around vLLM engine models
dyastremsky Oct 12, 2023
31f1733
Fix links
dyastremsky Oct 12, 2023
76d0652
Fix links, grammar
dyastremsky Oct 12, 2023
a50ae8d
Remove Conda references.
dyastremsky Oct 12, 2023
edaff54
Fix client I/O and model names
dyastremsky Oct 12, 2023
33dbaed
Remove model name in config
dyastremsky Oct 12, 2023
6575197
Add generate endpoint, switch to min container
dyastremsky Oct 12, 2023
9effb18
Change to min
dyastremsky Oct 12, 2023
8dc3f51
Apply suggestions from code review
oandreeva-nv Oct 13, 2023
bf0d905
Update README.md
oandreeva-nv Oct 13, 2023
7ec9b5f
Add example model args, link to multi-server behavior
dyastremsky Oct 17, 2023
3b64abc
Format client input, add upstream tag.
dyastremsky Oct 17, 2023
3a3b326
Fix links, grammar
dyastremsky Oct 17, 2023
48e08e7
Add quotes to shm-region-prefix-name
dyastremsky Oct 17, 2023
9b4a193
Update sentence ordering, remove extra issues link
dyastremsky Oct 17, 2023
45be0f6
Modify input text example, one arg per line
dyastremsky Oct 17, 2023
204ce5a
Remove line about CUDA version compatibility.
dyastremsky Oct 18, 2023
8c9c4e7
Wording of Triton container option
dyastremsky Oct 18, 2023
3ab4774
Update wording of pre-built Docker container option
dyastremsky Oct 18, 2023
757e2b2
Update README.md wording
dyastremsky Oct 18, 2023
aa9ec65
Update wording - add "the"
dyastremsky Oct 18, 2023
e0161f4
Standarize capitalization, headings
dyastremsky Oct 18, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
166 changes: 166 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
<!--
# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
# * Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
# * Neither the name of NVIDIA CORPORATION nor the names of its
# contributors may be used to endorse or promote products derived
# from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-->

[![License](https://img.shields.io/badge/License-BSD3-lightgrey.svg)](https://opensource.org/licenses/BSD-3-Clause)

# vLLM Backend

The Triton backend for [vLLM](https://github.com/vllm-project/vllm)
is designed to run
[supported models](https://vllm.readthedocs.io/en/latest/models/supported_models.html)
on a
[vLLM engine](https://github.com/vllm-project/vllm/blob/main/vllm/engine/async_llm_engine.py).
tanmayv25 marked this conversation as resolved.
Show resolved Hide resolved
You can learn more about Triton backends in the [backend
repo](https://github.com/triton-inference-server/backend).


This is a Python-based backend. When using this backend, all requests are placed on the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to hyperlink "python-based backend" to the docs on it when triton-inference-server/backend#88 is merged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should normalize our terms:

python-based or python based

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference would be towards the first, because I think that'd be clearer and more grammatically correct.

Sources: grammar website about -based words. APA more general rules around hyphenating (principles 1 and 3 seem to apply).
CC: @oandreeva-nv

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, Python should be capitalized in my opinion. We capitalize Python in the Python backend README. Capitalizing the "p" in Python also aligns with the capitalization guidelines in Python's style guide.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing a search through the Triton Inference Server GitHub organization for markdown files with the word "Python" in them, we are pretty consistent with using capitalization. There are a few documents in the tutorials where we use both versions that we could update for consistency, e.g. part 6 in tutorials and the new request cancellation document.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python-based backend it is

nnshah1 marked this conversation as resolved.
Show resolved Hide resolved
vLLM AsyncEngine as soon as they are received. Inflight batching and paged attention is handled
by the vLLM engine.

Where can I ask general questions about Triton and Triton backends?
Be sure to read all the information below as well as the [general
Triton documentation](https://github.com/triton-inference-server/server#triton-inference-server)
available in the main [server](https://github.com/triton-inference-server/server)
repo. If you don't find your answer there you can ask questions on the
main Triton [issues page](https://github.com/triton-inference-server/server/issues).

## Building the vLLM Backend

There are several ways to install and deploy the vLLM backend.

### Option 1. Use the Pre-Built Docker Container.

Pull the container with vLLM backend from [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver) registry. This container has everything you need to run your vLLM model.

### Option 2. Build a Custom Container From Source
You can follow steps described in the
[Building With Docker](https://github.com/triton-inference-server/server/blob/main/docs/customization_guide/build.md#building-with-docker)
guide and use the
[build.py](https://github.com/triton-inference-server/server/blob/main/build.py)
script.

A sample command to build a Triton Server container with all options enabled is shown below. Feel free to customize flags according to your needs.

```
./build.py -v --enable-logging
--enable-stats
--enable-tracing
--enable-metrics
--enable-gpu-metrics
--enable-cpu-metrics
--enable-gpu
--filesystem=gcs
--filesystem=s3
--filesystem=azure_storage
--endpoint=http
--endpoint=grpc
--endpoint=sagemaker
--endpoint=vertex-ai
--upstream-container-version=23.10
--backend=python:r23.10
--backend=vllm:r23.10
```

### Option 3. Add the vLLM Backend to the Default Triton Container

You can install the vLLM backend directly into the NGC Triton container.
In this case, please install vLLM first. You can do so by running
`pip install vllm==<vLLM_version>`. Then, set up the vLLM backend in the
container with the following commands:

```
mkdir -p /opt/tritonserver/backends/vllm
wget -P /opt/tritonserver/backends/vllm https://raw.githubusercontent.com/triton-inference-server/vllm_backend/main/src/model.py
```
Comment on lines +97 to +100
Copy link
Contributor

@rmccorm4 rmccorm4 Oct 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an action item here, but a random food for thought that could be nice for both users and developers. If we standardize on a certain python-based-backend git repository structure, we can do something like:

git clone https://github.com/triton-inference-server/vllm_backend.git /opt/tritonserver/backends
  1. Single command
  2. Developers could iterate on the backend directly in the git repo and just reload triton without copying files/builds around (developer experience)
  3. More support for multi-file implementations. The wget is nice, but won't scale past a single file. Ex: Imagine model.py implements TritonPythonModel but imports implementation.py that has all the gorey details for certain features.

Just some random Tuesday ideas in my head. Core would just be updated to also look for src/model.py or whatever standard we set instead of just model.py.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will not work with git clone, since required model.py is in sub-directory of vllm_backend, plus clone will clone tests as well.

We can discuss the best solution at some point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the ease of development, I think your earlier idea of symlinks makes more sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will not work with git clone, since required model.py is in sub-directory of vllm_backend, plus clone will clone tests as well.

I know it won't work as-is and would require minor changes. Not necessarily asking for this feature at this time, just food for thought.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a separate goal of improving python backend developer experience (more for things like debugging, ipdb, etc) somewhere in the pipeline, so this came to mind as a tangential idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, by any chance, do you know in what ticket this is tracked? If you don't remember, then no worries


rmccorm4 marked this conversation as resolved.
Show resolved Hide resolved
## Using the vLLM Backend

You can see an example
[model_repository](samples/model_repository)
in the [samples](samples) folder.
You can use this as is and change the model by changing the `model` value in `model.json`.
rmccorm4 marked this conversation as resolved.
Show resolved Hide resolved
`model.json` represents a key-value dictionary that is fed to vLLM's AsyncLLMEngine when initializing the model.
You can see supported arguments in vLLM's
[arg_utils.py](https://github.com/vllm-project/vllm/blob/main/vllm/engine/arg_utils.py).
Specifically,
[here](https://github.com/vllm-project/vllm/blob/ee8217e5bee5860469204ee57077a91138c9af02/vllm/engine/arg_utils.py#L11)
and
[here](https://github.com/vllm-project/vllm/blob/ee8217e5bee5860469204ee57077a91138c9af02/vllm/engine/arg_utils.py#L201).

tanmayv25 marked this conversation as resolved.
Show resolved Hide resolved
For multi-GPU support, EngineArgs like tensor_parallel_size can be specified in
[model.json](samples/model_repository/vllm_model/1/model.json).

Note: vLLM greedily consume up to 90% of the GPU's memory under default settings.
The sample model updates this behavior by setting gpu_memory_utilization to 50%.
You can tweak this behavior using fields like gpu_memory_utilization and other settings in
[model.json](samples/model_repository/vllm_model/1/model.json).

In the [samples](samples) folder, you can also find a sample client,
[client.py](samples/client.py).

## Running the Latest vLLM Version

To see the version of vLLM in the container, see the
[version_map](https://github.com/triton-inference-server/server/blob/85487a1e15438ccb9592b58e308a3f78724fa483/build.py#L83)
in [build.py](https://github.com/triton-inference-server/server/blob/main/build.py)
for the Triton version you are using.

If you would like to use a specific vLLM commit or the latest version of vLLM, you
will need to use a
[custom execution environment](https://github.com/triton-inference-server/python_backend#creating-custom-execution-environments).


## Sending Your First Inference

After you
[start Triton](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/getting_started/quickstart.html)
with the
[sample model_repository](samples/model_repository),
you can quickly run your first inference request with the
[generate endpoint](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_generate.md).

Try out the command below.

```
$ curl -X POST localhost:8000/v2/models/vllm_model/generate -d '{"text_input": "What is Triton Inference Server?", "parameters": {"stream": false, "temperature": 0}}'
```

## Running Multiple Instances of Triton Server

If you are running multiple instances of Triton server with a Python-based backend,
you need to specify a different `shm-region-prefix-name` for each server. See
[here](https://github.com/triton-inference-server/python_backend#running-multiple-instances-of-triton-server)
for more information.

## Referencing the Tutorial

You can read further in the
[vLLM Quick Deploy guide](https://github.com/triton-inference-server/tutorials/tree/main/Quick_Deploy/vLLM)
in the
[tutorials](https://github.com/triton-inference-server/tutorials/) repository.
Loading
Loading