Add user instructions for converting safetensors to gguf (#772)

Adds a note to [llama_serving.md](https://github.com/nod-ai/shark-ai/blob/main/docs/shortfin/llm/user/llama_serving.md) that instructs the user how to convert a collection of `.safetensor` weight files to a single `.gguf` file that can be used in the instructions that follow.
nod-ai · Jan 7, 2025 · e2cbcb4 · e2cbcb4
1 parent c71a250
commit e2cbcb4
Show file tree

Hide file tree

Showing 2 changed files with 14 additions and 2 deletions.
diff --git a/docs/shortfin/llm/user/llama_serving.md b/docs/shortfin/llm/user/llama_serving.md
@@ -87,6 +87,16 @@ LLama3.1 8b f16 model.
 python -m sharktank.utils.hf_datasets llama3_8B_fp16 --local-dir $EXPORT_DIR
 ```
 
+> [!NOTE]
+> If you have the model weights as a collection of `.safetensors` files (downloaded from HuggingFace Model Hub, for example), you can use the `convert_hf_to_gguf.py` script from the [llama.cpp repository](https://github.com/ggerganov/llama.cpp) to convert them to a single `.gguf` file.
+> ```bash
+> export WEIGHTS_DIR=/path/to/safetensors/weights_directory/
+> git clone --depth 1 https://github.com/ggerganov/llama.cpp.git
+> cd llama.cpp
+> python3 convert_hf_to_gguf.py $WEIGHTS_DIR --outtype f16 --outfile $EXPORT_DIR/<output_gguf_name>.gguf
+> ```
+> Now this GGUF file can be used in the instructions ahead.
+
 ### Define environment variables
 
 We'll first define some environment variables that are shared between the

diff --git a/docs/user_guide.md b/docs/user_guide.md
@@ -78,8 +78,10 @@ To get started with SDXL, please follow the [SDXL User Guide](../shortfin/python
 
 ### Llama 3.1
 
-To get started with Llama 3.1, please follow the [Llama User Guide](shortfin/llm/user/llama_serving.md).
+To get started with Llama 3.1, please follow the [Llama User Guide][1].
 
 * Once you've set up the Llama server in the guide above, we recommend that you use [SGLang Frontend](https://sgl-project.github.io/frontend/frontend.html) by following the [Using `shortfin` with `sglang` guide](shortfin/llm/user/shortfin_with_sglang_frontend_language.md)
 * If you would like to deploy LLama on a Kubernetes cluster we also provide a simple set of instructions and deployment configuration to do so [here](shortfin/llm/user/llama_serving_on_kubernetes.md).
-* Finally, if you'd like to leverage the instructions above to run against a different variant of Llama 3.1, it's supported. However, you will need to generate a gguf dataset for that variant. In order to do this leverage the [HuggingFace](https://huggingface.co/)'s [`huggingface-cli`](https://huggingface.co/docs/huggingface_hub/en/guides/cli) in combination with [llama.cpp](https://github.com/ggerganov/llama.cpp)'s convert_hf_to_gguf.py. In future releases, we plan to streamline these instructions to make it easier for users to compile their own models from HuggingFace.
+* Finally, if you'd like to leverage the instructions above to run against a different variant of Llama 3.1, it's supported. However, you will need to generate a gguf dataset for that variant (explained in the [user guide][1]). In future releases, we plan to streamline these instructions to make it easier for users to compile their own models from HuggingFace.
+
+[1]: shortfin/llm/user/llama_serving.md