SGLang doc user flow updates (#703)

- Expand examples in user docs to cover all supported features - Default docs to targeting cluster application - General cleanup and removal of unnecessary text - Add k8s instructions for shortfin deployment --------- Co-authored-by: saienduri <[email protected]> Co-authored-by: saienduri <[email protected]> Co-authored-by: Scott Todd <[email protected]>
nod-ai · Dec 23, 2024 · f5e9cb4 · f5e9cb4
1 parent 2776186
commit f5e9cb4
Show file tree

Hide file tree

Showing 5 changed files with 276 additions and 135 deletions.
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -7,6 +7,7 @@ repos:
     -   id: trailing-whitespace
     -   id: end-of-file-fixer
     -   id: check-yaml
+        args: ['--allow-multiple-documents']
     -   id: check-added-large-files
 -   repo: https://github.com/psf/black
     rev: 22.10.0

diff --git a/docs/shortfin/llm/user/llama_end_to_end.md → docs/shortfin/llm/user/llama_serving.md b/docs/shortfin/llm/user/llama_end_to_end.md → docs/shortfin/llm/user/llama_serving.md
@@ -272,3 +272,32 @@ If you want to find the process again:
 ```bash
 ps -f | grep shortfin
 ```
+
+## Server Options
+
+To run the server with different options, you can use the
+following command to see the available flags:
+
+```bash
+python -m shortfin_apps.llm.server --help
+```
+
+### Server Options
+
+A full list of options can be found below:
+
+| Argument                                        | Description                                                                                                                                                                         |
+| ----------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `--host HOST`                                   | Specify the host to bind the server.                                                                                                                                                |
+| `--port PORT`                                   | Specify the port to bind the server.                                                                                                                                                |
+| `--root-path ROOT_PATH`                         | Root path to use for installing behind a path-based proxy.                                                                                                                          |
+| `--timeout-keep-alive TIMEOUT_KEEP_ALIVE`       | Keep-alive timeout duration.                                                                                                                                                        |
+| `--tokenizer_json TOKENIZER_JSON`               | Path to a `tokenizer.json` file.                                                                                                                                                    |
+| `--tokenizer_config_json TOKENIZER_CONFIG_JSON` | Path to a `tokenizer_config.json` file.                                                                                                                                             |
+| `--model_config MODEL_CONFIG`                   | Path to the model config file.                                                                                                                                                      |
+| `--vmfb VMFB`                                   | Model [VMFB](https://iree.dev/developers/general/developer-tips/#inspecting-vmfb-files) to load.                                                                                    |
+| `--parameters [FILE ...]`                       | Parameter archives to load (supports: `gguf`, `irpa`, `safetensors`).                                                                                                               |
+| `--device {local-task,hip,amdgpu}`              | Device to serve on (e.g., `local-task`, `hip`). Same options as [iree-run-module --list_drivers](https://iree.dev/guides/deployment-configurations/gpu-rocm/#get-the-iree-runtime). |
+| `--device_ids [DEVICE_IDS ...]`                 | Device IDs visible to the system builder. Defaults to None (full visibility). Can be an index or a device ID like `amdgpu:0:0@0`.                                                   |
+| `--isolation {none,per_fiber,per_call}`         | Concurrency control: How to isolate programs.                                                                                                                                       |
+| `--amdgpu_async_allocations`                    | Enable asynchronous allocations for AMD GPU device contexts.                                                                                                                        |
diff --git a/docs/shortfin/llm/user/llama_serving_on_kubernetes.md b/docs/shortfin/llm/user/llama_serving_on_kubernetes.md
@@ -0,0 +1,44 @@
+# Llama 8b GPU instructions on Kubernetes
+
+## Setup
+
+We will use an example with `llama_8b_f16` in order to describe the
+process of exporting a model and deploying four instances of a shortfin llm server
+behind a load balancer on MI300X GPU.
+
+### Pre-Requisites
+
+- Kubernetes cluster available to use
+- kubectl installed on system and configured for cluster of interest
+    - To install kubectl, please check out [kubectl install](https://kubernetes.io/docs/tasks/tools/#kubectl)
+    and make sure to set the `KUBECONFIG` environment variable to point to your kube config file to authorize
+    connection to the cluster.
+
+### Deploy shortfin llama app service
+
+To generate the artifacts required for this k8s deployment, please follow [llama_serving.md](./llama_serving.md) until you have have all of the files that we need to run the shortfin LLM server.
+Please upload your artifacts to a storage option that you can pull from in your k8s cluster (NFS, S3, CSP).
+Save [llama-app-deployment.yaml](../../../../shortfin/deployment/shortfin_apps/llm/k8s/llama-app-deployment.yaml) locally and edit it to include your artifacts you just stored and change flags to intended configuration.
+
+To deploy llama app:
+
+```
+kubectl apply -f llama-app-deployment.yaml
+```
+
+To retrieve external IP for targetting the llama app load balancer:
+
+```
+kubectl get service shark-llama-app-service
+```
+
+Now, you can use the external IP for sglang integration or just sending text generation requests.
+
+### Delete shortfin llama app service
+
+After done using, make sure to delete:
+
+```
+kubectl delete deployment shark-llama-app-deployment
+kubectl delete service shark-llama-app-service
+```