Skip to content

Commit

Permalink
Add user instructions for converting safetensors to gguf (#772)
Browse files Browse the repository at this point in the history
Adds a note to
[llama_serving.md](https://github.com/nod-ai/shark-ai/blob/main/docs/shortfin/llm/user/llama_serving.md)
that instructs the user how to convert a collection of `.safetensor`
weight files to a single `.gguf` file that can be used in the
instructions that follow.
  • Loading branch information
vinayakdsci authored Jan 7, 2025
1 parent c71a250 commit e2cbcb4
Show file tree
Hide file tree
Showing 2 changed files with 14 additions and 2 deletions.
10 changes: 10 additions & 0 deletions docs/shortfin/llm/user/llama_serving.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,16 @@ LLama3.1 8b f16 model.
python -m sharktank.utils.hf_datasets llama3_8B_fp16 --local-dir $EXPORT_DIR
```

> [!NOTE]
> If you have the model weights as a collection of `.safetensors` files (downloaded from HuggingFace Model Hub, for example), you can use the `convert_hf_to_gguf.py` script from the [llama.cpp repository](https://github.com/ggerganov/llama.cpp) to convert them to a single `.gguf` file.
> ```bash
> export WEIGHTS_DIR=/path/to/safetensors/weights_directory/
> git clone --depth 1 https://github.com/ggerganov/llama.cpp.git
> cd llama.cpp
> python3 convert_hf_to_gguf.py $WEIGHTS_DIR --outtype f16 --outfile $EXPORT_DIR/<output_gguf_name>.gguf
> ```
> Now this GGUF file can be used in the instructions ahead.
### Define environment variables
We'll first define some environment variables that are shared between the
Expand Down
6 changes: 4 additions & 2 deletions docs/user_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,8 +78,10 @@ To get started with SDXL, please follow the [SDXL User Guide](../shortfin/python

### Llama 3.1

To get started with Llama 3.1, please follow the [Llama User Guide](shortfin/llm/user/llama_serving.md).
To get started with Llama 3.1, please follow the [Llama User Guide][1].

* Once you've set up the Llama server in the guide above, we recommend that you use [SGLang Frontend](https://sgl-project.github.io/frontend/frontend.html) by following the [Using `shortfin` with `sglang` guide](shortfin/llm/user/shortfin_with_sglang_frontend_language.md)
* If you would like to deploy LLama on a Kubernetes cluster we also provide a simple set of instructions and deployment configuration to do so [here](shortfin/llm/user/llama_serving_on_kubernetes.md).
* Finally, if you'd like to leverage the instructions above to run against a different variant of Llama 3.1, it's supported. However, you will need to generate a gguf dataset for that variant. In order to do this leverage the [HuggingFace](https://huggingface.co/)'s [`huggingface-cli`](https://huggingface.co/docs/huggingface_hub/en/guides/cli) in combination with [llama.cpp](https://github.com/ggerganov/llama.cpp)'s convert_hf_to_gguf.py. In future releases, we plan to streamline these instructions to make it easier for users to compile their own models from HuggingFace.
* Finally, if you'd like to leverage the instructions above to run against a different variant of Llama 3.1, it's supported. However, you will need to generate a gguf dataset for that variant (explained in the [user guide][1]). In future releases, we plan to streamline these instructions to make it easier for users to compile their own models from HuggingFace.

[1]: shortfin/llm/user/llama_serving.md

0 comments on commit e2cbcb4

Please sign in to comment.