Skip to content

Commit

Permalink
Merge branch 'main' into patch-27
Browse files Browse the repository at this point in the history
  • Loading branch information
mikekgfb authored Jan 6, 2025
2 parents 2b97253 + 654bb03 commit 19a6ce0
Show file tree
Hide file tree
Showing 5 changed files with 306 additions and 7 deletions.
17 changes: 17 additions & 0 deletions .ci/scripts/run-docs
Original file line number Diff line number Diff line change
Expand Up @@ -125,3 +125,20 @@ if [ "$1" == "native" ]; then
bash -x ./run-native.sh
echo "::endgroup::"
fi

if [ "$1" == "distributed" ]; then

echo "::group::Create script to run distributed"
python3 torchchat/utils/scripts/updown.py --file docs/distributed.md > ./run-distributed.sh
# for good measure, if something happened to updown processor,
# and it did not error out, fail with an exit 1
echo "exit 1" >> ./run-distributed.sh
echo "::endgroup::"

echo "::group::Run distributed"
echo "*******************************************"
cat ./run-distributed.sh
echo "*******************************************"
bash -x ./run-distributed.sh
echo "::endgroup::"
fi
7 changes: 7 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,13 @@ aliases.
|[tinyllamas/stories42M](https://huggingface.co/karpathy/tinyllamas/tree/main)||Toy model for `generate`. Alias to `stories42M`.|
|[tinyllamas/stories110M](https://huggingface.co/karpathy/tinyllamas/tree/main)||Toy model for `generate`. Alias to `stories110M`.|
|[openlm-research/open_llama_7b](https://huggingface.co/openlm-research/open_llama_7b)||Best for `generate`. Alias to `open-llama`.|
| [ibm-granite/granite-3b-code-instruct-128k](https://huggingface.co/ibm-granite/granite-3b-code-instruct-128k) || Alias to `granite-code` and `granite-code-3b`.|
| [ibm-granite/granite-8b-code-instruct-128k](https://huggingface.co/ibm-granite/granite-8b-code-instruct-128k) || Alias to `granite-code-8b`.|
| [ibm-granite/granite-3.0-2b-instruct](https://huggingface.co/ibm-granite/granite-3.0-2b-instruct) || Alias to `granite3-2b` and `granite3`.|
| [ibm-granite/granite-3.0-8b-instruct](https://huggingface.co/ibm-granite/granite-3.0-8b-instruct) || Alias to `granite3-8b`.|
| [ibm-granite/granite-3.1-2b-instruct](https://huggingface.co/ibm-granite/granite-3.1-2b-instruct) || Alias to `granite3.1-2b` and `granite3.1`.|
| [ibm-granite/granite-3.1-8b-instruct](https://huggingface.co/ibm-granite/granite-3.1-8b-instruct) || Alias to `granite3.1-8b`.|


## Installation
The following steps require that you have [Python 3.10](https://www.python.org/downloads/release/python-3100/) installed.
Expand Down
125 changes: 125 additions & 0 deletions docs/distributed.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
# Distributed Inference with torchchat

torchchat supports distributed inference for large language models (LLMs) on GPUs seamlessly.
At present, torchchat supports distributed inference using Python only.

## Installation
The following steps require that you have [Python 3.10](https://www.python.org/downloads/release/python-3100/) installed.

> [!TIP]
> torchchat uses the latest changes from various PyTorch projects so it's highly recommended that you use a venv (by using the commands below) or CONDA.
[skip default]: begin
```bash
git clone https://github.com/pytorch/torchchat.git
cd torchchat
python3 -m venv .venv
source .venv/bin/activate
./install/install_requirements.sh
```
[skip default]: end

[shell default]: ./install/install_requirements.sh

## Login to HF for Downloading Weights
Most models use Hugging Face as the distribution channel, so you will need to create a Hugging Face account. Create a Hugging Face user access token as documented here with the write role.

Log into Hugging Face:

[prefix default]: HF_TOKEN="${SECRET_HF_TOKEN_PERIODIC}"

```
huggingface-cli login
```

## Enabling Distributed torchchat Inference

To enable distributed inference, use the option `--distributed`. In addition, `--tp <num>` and `--pp <num>`
allow users to specify the types of parallelism to use where tp refers to tensor parallelism and pp to pipeline parallelism.


## Generate Output with Distributed torchchat Inference

To generate output using distributed inference with 4 GPUs, you can use:
```
python3 torchchat.py generate llama3.1 --distributed --tp 2 --pp 2 --prompt "write me a story about a boy and his bear"
```


## Chat with Distributed torchchat Inference

This mode allows you to chat with an LLM in an interactive fashion with distributed Inference. The following example uses 4 GPUs:

[skip default]: begin
```bash
python3 torchchat.py chat llama3.1 --max-new-tokens 10 --distributed --tp 2 --pp 2
```
[skip default]: end


## A Server with Distributed torchchat Inference

This mode exposes a REST API for interacting with a model.
The server follows the [OpenAI API specification](https://platform.openai.com/docs/api-reference/chat) for chat completions.

To test out the REST API, **you'll need 2 terminals**: one to host the server, and one to send the request.

In one terminal, start the server to run with 4 GPUs:

[skip default]: begin

```bash
python3 torchchat.py server llama3.1 --distributed --tp 2 --pp 2
```
[skip default]: end

<!--
[shell default]: python3 torchchat.py server llama3.1 --distributed --tp 2 --pp 2 & server_pid=$! ; sleep 180 # wait for server to be ready to accept requests
-->

In another terminal, query the server using `curl`. Depending on the model configuration, this query might take a few minutes to respond.

> [!NOTE]
> Since this feature is under active development, not every parameter is consumed. See api/api.py for details on
> which request parameters are implemented. If you encounter any issues, please comment on the [tracking Github issue](https://github.com/pytorch/torchchat/issues/973).
<details>
<summary>Example Query</summary>

Setting `stream` to "true" in the request emits a response in chunks. If `stream` is unset or not "true", then the client will await the full response from the server.

**Example Input + Output**

```
curl http://127.0.0.1:5000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1",
"stream": "true",
"max_tokens": 200,
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello!"
}
]
}'
```
[skip default]: begin
```
{"response":" I'm a software developer with a passion for building innovative and user-friendly applications. I have experience in developing web and mobile applications using various technologies such as Java, Python, and JavaScript. I'm always looking for new challenges and opportunities to learn and grow as a developer.\n\nIn my free time, I enjoy reading books on computer science and programming, as well as experimenting with new technologies and techniques. I'm also interested in machine learning and artificial intelligence, and I'm always looking for ways to apply these concepts to real-world problems.\n\nI'm excited to be a part of the developer community and to have the opportunity to share my knowledge and experience with others. I'm always happy to help with any questions or problems you may have, and I'm looking forward to learning from you as well.\n\nThank you for visiting my profile! I hope you find my information helpful and interesting. If you have any questions or would like to discuss any topics, please feel free to reach out to me. I"}
```

[skip default]: end

<!--
[shell default]: kill ${server_pid}
-->

</details>

[end default]: end
138 changes: 138 additions & 0 deletions docs/local-model.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
# Using Local Models in Torchcha/
Torchchat provides powerful capabilities for running large language models (LLMs) locally. This guide focuses on utilizing local copies of
model checkpoints or models in GGUF format to create a chat application. It also highlights relevant options for advanced users.

## Prerequisites
To work with local models, you need:
1. **Model Weights**: A checkpoint file (e.g., `.pth`, `.pt`) or a GGUF file (e.g., `.gguf`).
2. **Tokenizer**: A tokenizer model file.This can either be in SentencePiece or TikToken format, depending on the tokenizer used with the model.
3. **Parameter File**: (a) A custom parameter file in JSON format, or (b) a pre-existing parameter file with `--params-path`
or `--params-table`, or (c) a pathname that’s matched against known models by longest substring in configuration name, using the same algorithm as GPT-fast.

Ensure the tokenizer and parameter files are in the same directory as the checkpoint or GGUF file for automatic detection.
Let’s use a local download of the stories15M tinyllama model as an example:

```
mkdir stories15M
cd stories15M
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.pt
wget https://github.com/karpathy/llama2.c/raw/refs/heads/master/tokenizer.model
cp ../torchchat/model_params/stories15M.json model.json
cd ..
```


## Using Local Checkpoints
Torchchat provides the CLI flag `--checkpoint-path` for specifying local model weights. The tokenizer is
loaded from the same directory as the checkpoint with the name ‘tokenizer.model’ unless separately specified.
This example obtains the model parameters by name matching to known models because ‘stories15M’ is one of the
models known to torchchat with a configuration stories in ‘torchchat/model_params’:


### Example 1: Basic Text Generation


```
python3 torchchat.py generate \
--checkpoint-path stories15M/stories15M.pt \
--prompt "Hello, my name is"
```


### Example 2: Providing Additional Artifacts
The following is an example of how to specify a local model checkpoint, the model architecture, and a tokenizer file:
```
python3 torchchat.py generate \
--prompt "Once upon a time" \
--checkpoint-path stories15M/stories15M.pt \
--params-path stories15M/model.json \
--tokenizer-path stories15M/tokenizer.model
```


Alternatively, we can specify the known architecture configuration for known models using ‘--params-table’
to specify a p[particular architecture in the ‘torchchat/model_params’:

```
python3 torchchat.py generate \
--prompt "Once upon a time" \
--checkpoint-path stories15M/stories15M.pt \
--params-table stories15M \
--tokenizer-path stories15M//tokenizer.model
```


## Using GGUF Models
Torchchat supports loading models in GGUF format using the `--gguf-file`. Refer to GGUF.md for additional
documentation about using GGUF files in torchchat.

The GGUF format is compatible with several quantization levels such as F16, F32, Q4_0, and Q6_K. Model
configuration information is obtained directly from the GGUF file, simplifying setup and obviating the
need for a separate `model.json` model architecture specification.


## Using local models
Torchchat supports all commands such as chat, browser, server and export using local models. (In fact,
known models simply download and populate the parameters specified for local models.)
Here is an example setup for running a server with a local model:


[skip default]: begin
```
python3 torchchat.py server --checkpoint-path stories15M/stories15M.pt
```
[skip default]: end


[shell default]: python3 torchchat.py server --checkpoint-path stories15M/stories15M.pt & server_pid=$! ; sleep 90 # wait for server to be ready to accept requests


In another terminal, query the server using `curl`. Depending on the model configuration, this query might take a few minutes to respond.


> [!NOTE]
> Since this feature is under active development, not every parameter is consumed. See `#api/api.pyi` for details on
> which request parameters are implemented. If you encounter any issues, please comment on the [tracking Github issue](https://github.com/pytorch/torchchat/issues/973).

<details>


<summary>Example Query</summary>
Setting `stream` to "true" in the request emits a response in chunks. If `stream` is unset or not "true", then the client will
await the full response from the server.


**Example: using the server**
A model server used witha local model works like any other torchchat server. You can test it by sending a request with ‘curl’:
```
curl http://127.0.0.1:5000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1",
"stream": "true",
"max_tokens": 200,
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello!"
}
]
}'
```


[shell default]: kill ${server_pid}


</details>


For more information about using different commands, see the root README.md and refer to the Advanced Users Guide for further details on advanced configurations and parameter tuning.


[end default]: end
26 changes: 19 additions & 7 deletions torchchat/utils/docs/evaluation.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ The evaluation mode of `torchchat.py` script can be used to evaluate your langua

## Examples

### Evaluation example with model in Python
### Evaluation example with model in Python environment

Running wikitext for 10 iterations
```
Expand All @@ -35,33 +35,45 @@ Running wikitext with torch.compile for 10 iterations
python3 torchchat.py eval stories15M --compile --tasks wikitext --limit 10
```

Running multiple tasks and calling eval.py directly (with torch.compile):
Running multiple tasks with torch.compile for evaluation and prefill:
```
python3 torchchat.py eval stories15M --compile --tasks wikitext hellaswag
python3 torchchat.py eval stories15M --compile --compile-prefill --tasks wikitext hellaswag
```

### Evaluation with model exported to PTE with ExecuTorch

Running an exported model with ExecuTorch (as PTE)
Running an exported model with ExecuTorch (as PTE). Advantageously, because you can
load an exported PTE model back into the Python environment with torchchat,
you can run evaluation on the exported model!
```
python3 torchchat.py export stories15M --output-pte-path stories15M.pte
python3 torchchat.py eval stories15M --pte-path stories15M.pte
```

Running multiple tasks and calling eval.py directly (with PTE):
Running multiple tasks directly on the created PTE mobile model:
```
python3 torchchat.py eval stories15M --pte-path stories15M.pte --tasks wikitext hellaswag
```

Now let's evaluate the effect of quantization on evaluation results by exporting with quantization using `--quantize` and an exemplary quantization configuration:
```
python3 torchchat.py export stories15M --output-pte-path stories15M.pte --quantize torchchat/quant_config/mobile.json
python3 torchchat.py eval stories15M --pte-path stories15M.pte --tasks wikitext hellaswag
```

Now try your own export options to explore different trade-offs between model size, evaluation speed and accuracy using model quantization!

### Evaluation with model exported to DSO with AOT Inductor (AOTI)

Running an exported model with AOT Inductor (DSO model)
Running an exported model with AOT Inductor (DSO model). Advantageously, because you can
load an exported DSO model back into the Python environment with torchchat,
you can run evaluation on the exported model!
```
python3 torchchat.py export stories15M --dtype fast16 --output-dso-path stories15M.so
python3 torchchat.py eval stories15M --dtype fast16 --dso-path stories15M.so
```

Running multiple tasks and calling eval.py directly (with AOTI):
Running multiple tasks with AOTI:
```
python3 torchchat.py eval stories15M --dso-path stories15M.so --tasks wikitext hellaswag
```
Expand Down

0 comments on commit 19a6ce0

Please sign in to comment.