diff --git a/docs/genai/api/c.md b/docs/genai/api/c.md new file mode 100644 index 0000000000000..63f1dfe801e79 --- /dev/null +++ b/docs/genai/api/c.md @@ -0,0 +1,590 @@ +--- +title: C API +description: C API reference for ONNX Runtime GenAI +has_children: false +parent: API docs +grand_parent: Generative AI (Preview) +nav_order: 3 +--- + +# ONNX Runtime GenAI C API + +_Note: this API is in preview and is subject to change._ + +{: .no_toc } + +* TOC placeholder +{:toc} + + +## Overview + +## Model API + +### Create model + +Creates a model from the given configuration directory and device type. + +#### Parameters + * Input: config_path The path to the model configuration directory. The path is expected to be encoded in UTF-8. + * Output: out The created model. + +#### Returns + `OgaResult` containing the error message if the model creation failed. + +```c +OGA_EXPORT OgaResult* OGA_API_CALL OgaCreateModel(const char* config_path, OgaModel** out); +``` + +### Destroy model + +Destroys the given model. + + +#### Parameters + +* Input: model The model to be destroyed. + +#### Returns +`void` + +```c +OGA_EXPORT void OGA_API_CALL OgaDestroyModel(OgaModel* model); +``` + +### Generate + +Generates an array of token arrays from the model execution based on the given generator params. + +#### Parameters + +* Input: model The model to use for generation. +* Input: generator_params The parameters to use for generation. +* Output: out The generated sequences of tokens. The caller is responsible for freeing the sequences using OgaDestroySequences after it is done using the sequences. + +#### Returns + +OgaResult containing the error message if the generation failed. + +```c +OGA_EXPORT OgaResult* OGA_API_CALL OgaGenerate(const OgaModel* model, const OgaGeneratorParams* generator_params, OgaSequences** out); +``` + +## Tokenizer API + +### Create Tokenizer + +#### Parameters +* Input: model. The model for which the tokenizer should be created + +#### Returns +`OgaResult` containing the error message if the tokenizer creation failed. + +```c +OGA_EXPORT OgaResult* OGA_API_CALL OgaCreateTokenizer(const OgaModel* model, OgaTokenizer** out); +``` + +### Destroy Tokenizer + +```c +OGA_EXPORT void OGA_API_CALL OgaDestroyTokenizer(OgaTokenizer*); +``` +### Encode + +Encodes a single string and adds the encoded sequence of tokens to the OgaSequences. The OgaSequences must be freed with OgaDestroySequences when it is no longer needed. + +#### Parameters + +#### Returns + +```c +OGA_EXPORT OgaResult* OGA_API_CALL OgaTokenizerEncode(const OgaTokenizer*, const char* str, OgaSequences* sequences); +``` + +### Decode + +Decode a single token sequence and returns a null terminated utf8 string. out_string must be freed with OgaDestroyString + +#### Parameters + +#### Returns + +```c +OGA_EXPORT OgaResult* OGA_API_CALL OgaTokenizerDecode(const OgaTokenizer*, const int32_t* tokens, size_t token_count, const char** out_string); +``` + +### Encode batch + +#### Parameters +* +```c +OGA_EXPORT OgaResult* OGA_API_CALL OgaTokenizerEncodeBatch(const OgaTokenizer*, const char** strings, size_t count, TokenSequences** out); +``` + +### Decode batch + +```c +OGA_EXPORT OgaResult* OGA_API_CALL OgaTokenizerDecodeBatch(const OgaTokenizer*, const OgaSequences* tokens, const char*** out_strings); +``` + +### Destroy tokenizer strings + +```c +OGA_EXPORT void OGA_API_CALL OgaTokenizerDestroyStrings(const char** strings, size_t count); +``` + +### Create tokenizer stream + +OgaTokenizerStream is used to decoded token strings incrementally, one token at a time. + +```c +OGA_EXPORT OgaResult* OGA_API_CALL OgaCreateTokenizerStream(const OgaTokenizer*, OgaTokenizerStream** out); +``` + +### Destroy tokenizer stream + +#### Parameters + +```c +OGA_EXPORT void OGA_API_CALL OgaDestroyTokenizerStream(OgaTokenizerStream*); +``` + +### Decode stream + +Decode a single token in the stream. If this results in a word being generated, it will be returned in 'out'. The caller is responsible for concatenating each chunk together to generate the complete result. +'out' is valid until the next call to OgaTokenizerStreamDecode or when the OgaTokenizerStream is destroyed + +```c +OGA_EXPORT OgaResult* OGA_API_CALL OgaTokenizerStreamDecode(OgaTokenizerStream*, int32_t token, const char** out); +``` + +## Generator Params API + +### Create Generator Params + +Creates a OgaGeneratorParams from the given model. + +#### Parameters + +* Input: model The model to use for generation. +* Output: out The created generator params. + +#### Returns + +`OgaResult` containing the error message if the generator params creation failed. + +```c +OGA_EXPORT OgaResult* OGA_API_CALL OgaCreateGeneratorParams(const OgaModel* model, OgaGeneratorParams** out); +``` + +### Destroy Generator Params + +Destroys the given generator params. + +#### Parameters + +* Input: generator_params The generator params to be destroyed. + +#### Returns +`void` + +```c +OGA_EXPORT void OGA_API_CALL OgaDestroyGeneratorParams(OgaGeneratorParams* generator_params); +``` + +### Set search option (number) + +Set a search option where the option is a number + +#### Parameters +* generator_params: The generator params object to set the parameter on +* name: the name of the parameter +* value: the value to set + +#### Returns +`OgaResult` containing the error message if the generator params creation failed. + +```c +OGA_EXPORT OgaResult* OGA_API_CALL OgaGeneratorParamsSetSearchNumber(OgaGeneratorParams* generator_params, const char* name, double value); +``` + +### Set search option (bool) + +Set a search option where the option is a bool. + +#### Parameters +* generator_params: The generator params object to set the parameter on +* name: the name of the parameter +* value: the value to set + +#### Returns +`OgaResult` containing the error message if the generator params creation failed. + +```c +OGA_EXPORT OgaResult* OGA_API_CALL OgaGeneratorParamsSetSearchBool(OgaGeneratorParams* generator_params, const char* name, bool value); +``` + +### Set inputs + +Sets the input ids for the generator params. The input ids are used to seed the generation. + +#### Parameters + + * Input: generator_params The generator params to set the input ids on. + * Input: input_ids The input ids array of size input_ids_count = batch_size * sequence_length. + * Input: input_ids_count The total number of input ids. + * Input: sequence_length The sequence length of the input ids. + * Input: batch_size The batch size of the input ids. + +#### Returns + +`OgaResult` containing the error message if the setting of the input ids failed. + +```c +OGA_EXPORT OgaResult* OGA_API_CALL OgaGeneratorParamsSetInputIDs(OgaGeneratorParams* generator_params, const int32_t* input_ids, size_t input_ids_count, size_t sequence_length, size_t batch_size); +``` + +### Set input sequence + +Sets the input id sequences for the generator params. The input id sequences are used to seed the generation. + +#### Parameters + + * Input: generator_params The generator params to set the input ids on. + * Input: sequences The input id sequences. + +#### Returns + + OgaResult containing the error message if the setting of the input id sequences failed. + +```c +OGA_EXPORT OgaResult* OGA_API_CALL OgaGeneratorParamsSetInputSequences(OgaGeneratorParams* generator_params, const OgaSequences* sequences); +``` + + +## Generator API + +### Create Generator + +Creates a generator from the given model and generator params. + +#### Parameters + + * Input: model The model to use for generation. + * Input: params The parameters to use for generation. + * Output: out The created generator. + +#### Returns +`OgaResult` containing the error message if the generator creation failed. + +```c +OGA_EXPORT OgaResult* OGA_API_CALL OgaCreateGenerator(const OgaModel* model, const OgaGeneratorParams* params, OgaGenerator** out); +``` + +### Destroy generator + +Destroys the given generator. + +#### Parameters + +* Input: generator The generator to be destroyed. + +#### Returns +`void` + +```c +OGA_EXPORT void OGA_API_CALL OgaDestroyGenerator(OgaGenerator* generator); +``` + +### Check if generation has completed + +Returns true if the generator has finished generating all the sequences. + +#### Parameters + +* Input: generator The generator to check if it is done with generating all sequences. + +#### Returns + +True if the generator has finished generating all the sequences, false otherwise. + +```c +OGA_EXPORT bool OGA_API_CALL OgaGenerator_IsDone(const OgaGenerator* generator); +``` + +### Run one iteration of the model + +Computes the logits from the model based on the input ids and the past state. The computed logits are stored in the generator. + +#### Parameters + +* Input: generator The generator to compute the logits for. + +#### Returns + +OgaResult containing the error message if the computation of the logits failed. + +```c +OGA_EXPORT OgaResult* OGA_API_CALL OgaGenerator_ComputeLogits(OgaGenerator* generator); +``` + +### Generate next token + +Generates the next token based on the computed logits using the greedy search. + +#### Parameters + + * Input: generator The generator to generate the next token for. + +#### Returns + +OgaResult containing the error message if the generation of the next token failed. + +```c +OGA_EXPORT OgaResult* OGA_API_CALL OgaGenerator_GenerateNextToken_Top(OgaGenerator* generator); +``` + +### Generate next token with Top K sampling + +#### Parameters + +#### Returns + +```c +OGA_EXPORT OgaResult* OGA_API_CALL OgaGenerator_GenerateNextToken_TopK(OgaGenerator* generator, int k, float t); +``` + +### Generate next token with Top P sampling + +#### Parameters + +#### Returns + +```c +OGA_EXPORT OgaResult* OGA_API_CALL OgaGenerator_GenerateNextToken_TopP(OgaGenerator* generator, float p, float t); +``` + +### Get number of tokens + + Returns the number of tokens in the sequence at the given index. + +#### Parameters + + * Input: generator The generator to get the count of the tokens for the sequence at the given index. + * Input: index. The index at which to return the tokens + +#### Returns + +The number tokens in the sequence at the given index. + +```c +OGA_EXPORT size_t OGA_API_CALL OgaGenerator_GetSequenceLength(const OgaGenerator* generator, size_t index); +``` + +### Get sequence + +Returns a pointer to the sequence data at the given index. The number of tokens in the sequence is given by OgaGenerator_GetSequenceLength. + +#### Parameters + +* Input: generator The generator to get the sequence data for the sequence at the given index. The pointer to the sequence data at the given index. The sequence data is owned by the OgaGenerator and will be freed when the OgaGenerator is destroyed. The caller must copy the data if it needs to be used after the OgaGenerator is destroyed. +* Input: index. The index at which to get the sequence. + +#### Returns + +A pointer to the token sequence + +```c +OGA_EXPORT const int32_t* OGA_API_CALL OgaGenerator_GetSequence(const OgaGenerator* generator, size_t index); +``` + +## Enums and structs + +```c +typedef enum OgaDataType { + OgaDataType_int32, + OgaDataType_float32, + OgaDataType_string, // UTF8 string +} OgaDataType; +``` + +```c +typedef struct OgaResult OgaResult; +typedef struct OgaGeneratorParams OgaGeneratorParams; +typedef struct OgaGenerator OgaGenerator; +typedef struct OgaModel OgaModel; +typedef struct OgaBuffer OgaBuffer; +``` + + +## Utility functions + +### Get error message + +#### Parameters + +* Input: result OgaResult that contains the error message. + +#### Returns + +Error message contained in the OgaResult. The const char* is owned by the OgaResult and can will be freed when the OgaResult is destroyed. + +```c +OGA_EXPORT const char* OGA_API_CALL OgaResultGetError(OgaResult* result); +``` + +### Destroy result + +#### Parameters + +* Input: result OgaResult to be destroyed. + +#### Returns +`void` + +```c +OGA_EXPORT void OGA_API_CALL OgaDestroyResult(OgaResult*); +``` + +### Destroy string + +#### Parameters +* Input: string to be destroyed + +#### Returns + +```c +OGA_EXPORT void OGA_API_CALL OgaDestroyString(const char*); +``` + +### Destroy buffer + +#### Parameters +* Input: buffer to be destroyed + +#### Returns +`void` + +```c +OGA_EXPORT void OGA_API_CALL OgaDestroyBuffer(OgaBuffer*); +``` + +### Get buffer type + +#### Parameters +* Input: the buffer + +#### Returns + +The type of the buffer + +```c +OGA_EXPORT OgaDataType OGA_API_CALL OgaBufferGetType(const OgaBuffer*); +``` + +### Get the number of dimensions of a buffer + +#### Parameters +* Input: the buffer + +#### Returns +The number of dimensions in the buffer + +```c +OGA_EXPORT size_t OGA_API_CALL OgaBufferGetDimCount(const OgaBuffer*); +``` + +### Get buffer dimensions + +Get the dimensions of a buffer + +#### Parameters +* Input: the buffer +* Output: a dimension array + +#### Returns +`OgaResult` + +```c +OGA_EXPORT OgaResult* OGA_API_CALL OgaBufferGetDims(const OgaBuffer*, size_t* dims, size_t dim_count); +``` + +### Get buffer data + +Get the data from a buffer + +#### Parameters + +#### Returns +`void` + +```c +OGA_EXPORT const void* OGA_API_CALL OgaBufferGetData(const OgaBuffer*); +``` + +### Create sequences + +```c +OGA_EXPORT OgaResult* OGA_API_CALL OgaCreateSequences(OgaSequences** out); +``` + +### Destroy sequences + +#### Parameters + +* Input: sequences OgaSequences to be destroyed. + +#### Returns +`void` + +#### Returns + +```c +OGA_EXPORT void OGA_API_CALL OgaDestroySequences(OgaSequences* sequences); +``` + +### Get number of sequences + +Returns the number of sequences in the OgaSequences + +#### Parameters + +* Input: sequences + +#### Returns +The number of sequences in the OgaSequences + +```c +OGA_EXPORT size_t OGA_API_CALL OgaSequencesCount(const OgaSequences* sequences); +``` + +### Get the number of tokens in a sequence + +Returns the number of tokens in the sequence at the given index + +#### Parameters + +* Input: sequences + +#### Returns + +The number of tokens in the sequence at the given index + +```c +OGA_EXPORT size_t OGA_API_CALL OgaSequencesGetSequenceCount(const OgaSequences* sequences, size_t sequence_index); +``` + +### Get sequence data + +Returns a pointer to the sequence data at the given index. The number of tokens in the sequence is given by OgaSequencesGetSequenceCount + +#### Parameters +* Input: sequences + +#### Returns + +The pointer to the sequence data at the given index. The pointer is valid until the OgaSequences is destroyed. + +```c +OGA_EXPORT const int32_t* OGA_API_CALL OgaSequencesGetSequenceData(const OgaSequences* sequences, size_t sequence_index); +``` diff --git a/docs/genai/api/csharp.md b/docs/genai/api/csharp.md new file mode 100644 index 0000000000000..86b566f451cc2 --- /dev/null +++ b/docs/genai/api/csharp.md @@ -0,0 +1,157 @@ +--- +title: C# API +description: C# API reference for ONNX Runtime GenAI +has_children: false +parent: API docs +grand_parent: Generative AI (Preview) +nav_order: 2 +--- + +# ONNX Runtime GenAI C# API + +_Note: this API is in preview and is subject to change._ + +{: .no_toc } + +* TOC placeholder +{:toc} + +## Overview + +## Model class + +### Constructor + +```csharp +public Model(string modelPath) +``` + +### Generate method + +```csharp +public Sequences Generate(GeneratorParams generatorParams) +``` + +## Tokenizer class + +### Constructor + +```csharp +public Tokenizer(Model model) +``` + +### Encode method + +```csharp +public Sequences Encode(string str) +``` + +### Encode batch method + +```csharp +public Sequences EncodeBatch(string[] strings) +``` + +### Decode method + +```csharp +public string Decode(ReadOnlySpan sequence) +``` + +### Decode batch method + +```csharp +public string[] DecodeBatch(Sequences sequences) +``` + +### Create stream method + +```csharp +public TokenizerStream CreateStream() +``` + +## TokenizerStream class + +### Decode method + +```csharp +public string Decode(int token) +``` + +## GeneratorParams class + +### Constructor + +```csharp +public GeneratorParams(Model model) +``` + +### Set search option (double) + +```csharp +public void SetSearchOption(string searchOption, double value) +``` + +### Set search option (bool) method + +```csharp +public void SetSearchOption(string searchOption, bool value) +``` + +### Set input ids method + +```csharp +public void SetInputIDs(ReadOnlySpan inputIDs, ulong sequenceLength, ulong batchSize) +``` + +### Set input sequences method + +```csharp +public void SetInputSequences(Sequences sequences) +``` + + + + + +## Generator class + +### Constructor + +```csharp +public Generator(Model model, GeneratorParams generatorParams) +``` + +### Is done method + +```csharp +public bool IsDone() +``` + +### Compute logits + +```csharp +public void ComputeLogits() +``` + +### Generate next token method + +```csharp +public void GenerateNextTokenTop() +``` + + +## Sequences class + +### Num sequences member + +```csharp +public ulong NumSequences { get { return _numSequences; } } +``` + +### [] operator + +```csharp +public ReadOnlySpan this[ulong sequenceIndex] +``` + diff --git a/docs/genai/api/index.md b/docs/genai/api/index.md new file mode 100644 index 0000000000000..1684099508fa4 --- /dev/null +++ b/docs/genai/api/index.md @@ -0,0 +1,9 @@ +--- +title: API docs +description: API documentation for ONNX Runtime GenAI +parent: Generative AI (Preview) +has_children: true +nav_order: 2 +--- + +_Note: this API is in preview and is subject to change._ diff --git a/docs/genai/api/python.md b/docs/genai/api/python.md new file mode 100644 index 0000000000000..52adeac3cba69 --- /dev/null +++ b/docs/genai/api/python.md @@ -0,0 +1,299 @@ +--- +title: Python API +description: Python API reference for ONNX Runtime GenAI +has_children: false +parent: API docs +grand_parent: Generative AI (Preview) +nav_order: 1 +--- + +# Python API + +_Note: this API is in preview and is subject to change._ + +{: .no_toc } + +* TOC placeholder +{:toc} + +## Install and import + +The Python API is delivered by the onnxruntime-genai Python package. + +```bash +pip install onnxruntime-genai +``` + +```python +import onnxruntime_genai +``` + +## Model class + +### Load the model + +Loads the ONNX model(s) and configuration from a folder on disk. + +```python +onnxruntime_genai.Model(model_folder: str) -> onnxruntime_genai.Model +``` + +#### Parameters + +- `model_folder`: Location of model and configuration on disk +- `device`: The device to run on. One of: + - onnxruntime_genai.CPU + - onnxruntime_genai.CUDA + If not specified, defaults to CPU. + +#### Returns + +`onnxruntime_genai.Model` + +### Generate method + +```python +onnxruntime_genai.Model.generate(params: GeneratorParams) -> numpy.ndarray[int, int] +``` + +#### Parameters +- `params`: (Required) Created by the `GenerateParams` method. + +#### Returns + +`numpy.ndarray[int, int]`: a two dimensional numpy array with dimensions equal to the size of the batch passed in and the maximum length of the sequence of tokens. + + +## GeneratorParams class + +### Create GeneratorParams object + +```python +onnxruntime_genai.GeneratorParams(model: onnxruntime_genai.Model) -> onnxruntime_genai.GeneratorParams +``` + +#### Parameters + +- `model`: (required) The model that was loaded by onnxruntime_genai.Model() + +#### Returns + +`onnxruntime_genai.GeneratorParams`: The GeneratorParams object + +## Tokenizer class + +### Create tokenizer object + +```python +onnxruntime_genai.Model.Tokenizer(model: onnxruntime_genai.Model) -> onnxruntime_genai.Tokenizer +``` + +#### Parameters + +- `model`: (Required) The model that was loaded by the `Model()` + +#### Returns + +- `Tokenizer`: The tokenizer object + +### Encode + +```python +onnxruntime_genai.Tokenizer.encode(text: str) -> numpy.ndarray[numpy.int32] +``` + +#### Parameters + +- `text`: (Required) + +#### Returns + +`numpy.ndarray[numpy.int32]`: an array of tokens representing the prompt + +### Decode + +```python +onnxruntime_genai.Tokenizer.decode(tokens: numpy.ndarry[int]) -> str +``` + +#### Parameters + +- `numpy.ndarray[numpy.int32]`: (Required) a sequence of generated tokens + + +#### Returns + +`str`: the decoded generated tokens + + +### Encode batch + +```python +onnxruntime_genai.Tokenizer.encode_batch(texts: list[str]) -> numpy.ndarray[int, int] +``` + +#### Parameters + +- `texts`: A list of inputs + +#### Returns + +`numpy.ndarray[int, int]`: The batch of tokenized strings + +### Decode batch + +```python +onnxruntime_genai.Tokenize.decode_batch(tokens: [[numpy.int32]]) -> list[str] +``` + +#### Parameters + +- tokens + +#### Returns + +`texts`: a batch of decoded text + + +### Create tokenizer decoding stream + + +```python +onnxruntime_genai.Tokenizer.create_stream() -> TokenizerStream +``` + +#### Parameters + +None + +#### Returns + +`onnxruntime_genai.TokenizerStream` The tokenizer stream object + +## TokenizerStream class + +This class accumulates the next displayable string (according to the tokenizer's vocabulary). + +### Decode method + + +```python +onnxruntime_genai.TokenizerStream.decode(token: int32) -> str +``` + +#### Parameters + +- `token`: (Required) A token to decode + +#### Returns + +`str`: If a displayable string has accumulated, this method returns it. If not, this method returns the empty string. + +## GeneratorParams class + +### Create a Generator Params + +```python +onnxruntime_genai.GeneratorParams(model: Model) -> GeneratorParams +``` + +### Input_ids member + +```python +onnxruntime_genai.GeneratorParams.input_ids = numpy.ndarray[numpy.int32, numpy.int32] +``` + +### Set search options method + +```python +onnxruntime_genai.GeneratorParams.set_search_options(options: dict[str, Any]) +``` + +### + +## Generator class + +### Create a Generator + +```python +onnxruntime_genai.Generator(model: Model, params: GeneratorParams) -> Generator +``` + +#### Parameters + +- `model`: (Required) The model to use for generation +- `params`: (Required) The set of parameters that control the generation + +#### Returns + +`onnxruntime_genai.Generator` The Generator object + + +### Is generation done + +```python +onnxruntime_genai.Generator.is_done() -> bool +``` + +#### Returns + +Returns true when all sequences are at max length, or have reached the end of sequence. + + +### Compute logits + +Runs the model through one iteration. + +```python +onnxruntime_genai.Generator.compute_logits() +``` + +### Generate next token + +Using the current set of logits and the specified generator parameters, calculates the next batch of tokens, using Top P sampling. + +```python +onnxruntime_genai.Generator.generate_next_token() +``` + +### Generate next token with Top P sampling + +Using the current set of logits and the specified generator parameters, calculates the next batch of tokens, using Top P sampling. + +```python +onnxruntime_genai.Generator.generate_next_token_top_p() +``` + +### Generate next token with Top K sampling + +Using the current set of logits and the specified generator parameters, calculates the next batch of tokens, using Top K sampling. + +```python +onnxruntime_genai.Generator.generate_next_token_top_k() +``` + +### Generate next token with Top K and Top P sampling + +Using the current set of logits and the specified generator parameters, calculates the next batch of tokens, using both Top K then Top P sampling. + +```python +onnxruntime_genai.Generator.generate_next_token_top_k_top_p() +``` + +### Get next tokens + +```python +onnxruntime_genai.Generator.get_next_tokens() -> numpy.ndarray[numpy.int32] +``` + +Returns + +`numpy.ndarray[numpy.int32]`: The most recently generated tokens + +### Get sequence + +```python +onnxruntime_genai.Generator.get_sequence(index: int) -> numpy.ndarray[numpy.int32] +``` + +- `index`: (Required) The index of the sequence in the batch to return \ No newline at end of file diff --git a/docs/genai/howto/build-from-source.md b/docs/genai/howto/build-from-source.md new file mode 100644 index 0000000000000..71c345cd9d365 --- /dev/null +++ b/docs/genai/howto/build-from-source.md @@ -0,0 +1,93 @@ +--- +title: Build from source +description: How to build ONNX Runtime GenAI from source +has_children: false +parent: How to +grand_parent: Generative AI (Preview) +nav_order: 2 +--- + +# Build onnxruntime-genai from source +{: .no_toc } + +* TOC placeholder +{:toc} + +## Pre-requisites + +`cmake` + +## Build steps + +1. Clone this repo + + ```bash + git clone https://github.com/microsoft/onnxruntime-genai + cd onnxruntime-genai + ``` + +2. Install ONNX Runtime + + By default, the onnxruntime-genai build expects to find the ONNX Runtime include and binaries in a folder called `ort` in the root directory of onnxruntime-genai. You can put the ONNX Runtime files in a different location and specify this location to the onnxruntime-genai build. These instructions use ORT_HOME as the location. + + * Install from release + + These instructions are for the Linux GPU build of ONNX Runtime. Replace the location with the operating system and target of choice. + + ```bash + cd $ORT_HOME + wget https://github.com/microsoft/onnxruntime/releases/download/v1.17.0/onnxruntime-linux-x64-gpu-1.17.0.tgz + tar xvzf onnxruntime-linux-x64-gpu-1.17.0.tgz + mv onnxruntime-linux-x64-gpu-1.17.0/include . + mv onnxruntime-linux-x64-gpu-1.17.0/lib . + ``` + + * Or build from source + + ``` + git clone https://github.com/microsoft/onnxruntime.git + cd onnxruntime + ``` + + Create include and lib folders in the ORT_HOME directory + + ```bash + mkdir $ORT_HOME/include + mkdir $ORT_HOME/lib + ``` + + Build from source and copy the include and libraries into ORT_HOME + + On Windows + + ```cmd + build.bat --build_shared_lib --skip_tests --parallel [--use_cuda] + copy include\onnxruntime\core\session\onnxruntime_c_api.h $ORT_HOME\include + copy build\Windows\Debug\Debug\*.dll $ORT_HOME\lib + ``` + + On Linux + + ```cmd + ./build.sh --build_shared_lib --skip_tests --parallel [--use_cuda] + cp include/onnxruntime/core/session/onnxruntime_c_api.h $ORT_HOME/include + cp build/Linux/RelWithDebInfo/libonnxruntime*.so* $ORT_HOME/lib + ``` + +3. Build onnxruntime-genai + + If you are building for CUDA, add the cuda_home argument. + + ```bash + cd .. + python build.py [--cuda_home ] + ``` + + + +4. Install Python wheel + + ```bash + cd build/wheel + pip install *.whl + ``` \ No newline at end of file diff --git a/docs/genai/howto/build-model.md b/docs/genai/howto/build-model.md new file mode 100644 index 0000000000000..408710d34ed61 --- /dev/null +++ b/docs/genai/howto/build-model.md @@ -0,0 +1,139 @@ +--- +title: Build models +description: How to build models with ONNX Runtime GenAI +has_children: false +parent: How to +grand_parent: Generative AI (Preview) +nav_order: 2 +--- + +# Generate models using Model Builder +{: .no_toc } + +* TOC placeholder +{:toc} + +The model builder greatly accelerates creating optimized and quantized ONNX models that run with ONNX Runtime GenAI. + +## Current Support +The tool currently supports the following model architectures. + +- Gemma +- LLaMA +- Mistral +- Phi + +## Usage + +### Full Usage +For all available options, please use the `-h/--help` flag. +``` +# From wheel: +python3 -m onnxruntime_genai.models.builder --help + +# From source: +python3 builder.py --help +``` + +### Original PyTorch Model from Hugging Face +This scenario is where your PyTorch model is not downloaded locally (either in the default Hugging Face cache directory or in a local folder on disk). +``` +# From wheel: +python3 -m onnxruntime_genai.models.builder -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_to_save_hf_files + +# From source: +python3 builder.py -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_to_save_hf_files +``` + +### Original PyTorch Model from Disk +This scenario is where your PyTorch model is already downloaded locally (either in the default Hugging Face cache directory or in a local folder on disk). +``` +# From wheel: +python3 -m onnxruntime_genai.models.builder -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_where_hf_files_are_saved + +# From source: +python3 builder.py -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_where_hf_files_are_saved +``` + +### Customized or Finetuned PyTorch Model +This scenario is where your PyTorch model has been customized or finetuned for one of the currently supported model architectures and your model can be loaded in Hugging Face. +``` +# From wheel: +python3 -m onnxruntime_genai.models.builder -i path_to_local_folder_on_disk -o path_to_output_folder -p precision -e execution_provider + +# From source: +python3 builder.py -i path_to_local_folder_on_disk -o path_to_output_folder -p precision -e execution_provider +``` + +### GGUF Model +This scenario is where your float16/float32 GGUF model is already on disk. +``` +# From wheel: +python3 -m onnxruntime_genai.models.builder -m model_name -i path_to_gguf_file -o path_to_output_folder -p precision -e execution_provider -c cache_dir_for_hf_files + +# From source: +python3 builder.py -m model_name -i path_to_gguf_file -o path_to_output_folder -p precision -e execution_provider -c cache_dir_for_hf_files +``` + +### Extra Options +This scenario is for when you want to have control over some specific settings. The below example shows how you can pass key-value arguments to `--extra_options`. +``` +# From wheel: +python3 -m onnxruntime_genai.models.builder -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_for_hf_files --extra_options filename=decoder.onnx + +# From source: +python3 builder.py -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_for_hf_files --extra_options filename=decoder.onnx +``` +To see all available options through `--extra_options`, please use the `help` commands in the `Full Usage` section above. + +### Config Only +This scenario is for when you already have your optimized and/or quantized ONNX model and you need to create the config files to run with ONNX Runtime GenAI. +``` +# From wheel: +python3 -m onnxruntime_genai.models.builder -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_for_hf_files --extra_options config_only=true + +# From source: +python3 builder.py -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_for_hf_files --extra_options config_only=true +``` + +Afterwards, please open the `genai_config.json` file in the output folder and modify the fields as needed for your model. You should store your ONNX model in the output folder as well. + +### Unit Testing Models +This scenario is where your PyTorch model is already downloaded locally (either in the default Hugging Face cache directory or in a local folder on disk). If it is not already downloaded locally, here is an example of how you can download it. + +``` +from transformers import AutoModelForCausalLM, AutoTokenizer + +model_name = "your_model_name" +cache_dir = "cache_dir_to_save_hf_files" + +model = AutoModelForCausalLM.from_pretrained(model_name, cache_dir=cache_dir) +model.save_pretrained(cache_dir) + +tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir=cache_dir) +tokenizer.save_pretrained(cache_dir) +``` + +#### Option 1: Use the model builder tool directly +This option is the simplest but it will download another copy of the PyTorch model onto disk to accommodate the change in the number of hidden layers. +``` +# From wheel: +python3 -m onnxruntime_genai.models.builder -m model_name -o path_to_output_folder -p precision -e execution_provider --extra_options num_hidden_layers=4 + +# From source: +python3 builder.py -m model_name -o path_to_output_folder -p precision -e execution_provider --extra_options num_hidden_layers=4 +``` + +#### Option 2: Edit the config.json file on disk and then run the model builder tool + +1. Navigate to where the PyTorch model and its associated files are saved on disk. +2. Modify `num_hidden_layers` in `config.json` to your desired target (e.g. 4 layers). +3. Run the below command for the model builder tool. + +``` +# From wheel: +python3 -m onnxruntime_genai.models.builder -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_where_hf_files_are_saved + +# From source: +python3 builder.py -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_where_hf_files_are_saved +``` \ No newline at end of file diff --git a/docs/genai/howto/index.md b/docs/genai/howto/index.md new file mode 100644 index 0000000000000..06847318ef626 --- /dev/null +++ b/docs/genai/howto/index.md @@ -0,0 +1,9 @@ +--- +title: How to +description: How to perform specific tasks with ONNX Runtime GenAI +parent: Generative AI (Preview) +has_children: true +nav_order: 3 +--- + +_Note: this API is in preview and is subject to change._ diff --git a/docs/genai/howto/install.md b/docs/genai/howto/install.md new file mode 100644 index 0000000000000..f37151ed2374b --- /dev/null +++ b/docs/genai/howto/install.md @@ -0,0 +1,56 @@ +--- +title: Install +description: Instructions to install ONNX Runtime GenAI on your target platform in your environment +has_children: false +parent: How to +grand_parent: Generative AI (Preview) +nav_order: 1 +--- + +# Install ONNX Runtime GenAI +{: .no_toc } + +* TOC placeholder +{:toc} + +## Python package release candidates + +```bash +pip install numpy +pip install onnxruntime-genai --pre --index-url= +https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-genai/pypi/simple/` +``` + +Append `-cuda` for the library that is optimized for CUDA environments + +```bash +pip install onnxruntime-genai-cuda --pre --index-url= +https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-genai/pypi/simple/` + +``` + +## Nuget package release candidates + +To install the NuGet release candidates, add a new package source in Visual Studio, go to `Project` -> `Manage NuGet Packages`. + +1. Click on the `Settings` cog icon + +2. Click the `+` button to add a new package source + + - Change the Name to `onnxruntime-genai` + - Change the Source to `https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-genai/nuget/v3/index.json` + +3. Check the `Include prerelease` button + +4. Add the `Microsoft.ML.OnnxRuntimeGenAI` package + +5. Add the `Microsoft.ML.OnnxRuntime` package + +To run with CUDA, use the following packages instead: + +- `Microsoft.ML.OnnxRuntimeGenAI.Cuda` +- `Microsoft.ML.OnnxRuntime.Gpu` + + + + diff --git a/docs/genai/index.md b/docs/genai/index.md new file mode 100644 index 0000000000000..57634b8f896e3 --- /dev/null +++ b/docs/genai/index.md @@ -0,0 +1,18 @@ +--- +title: Generative AI (Preview) +description: Run generative models with ONNX Runtime GenAI +has_children: true +nav_order: 6 +--- + +# Generative AI with ONNX Runtime + +_Note: this API is in preview and is subject to change._ + +Run generative AI models with ONNX Runtime. + +This library provides the generative AI loop for ONNX models, including inference with ONNX Runtime, logits processing, search and sampling, and KV cache management. + +Users can call a high level `generate()` method, or run each iteration of the model in a loop, generating one token at a time, and optionally updating generation parameters inside the loop. + +It has support for greedy/beam search and TopP, TopK sampling to generate token sequences and built-in logits processing like repetition penalties. You can also easily add custom scoring. diff --git a/docs/genai/reference/config.md b/docs/genai/reference/config.md new file mode 100644 index 0000000000000..ce3dd138b8eeb --- /dev/null +++ b/docs/genai/reference/config.md @@ -0,0 +1,172 @@ +--- +title: Config reference +description: Reference for the ONNX Runtime Generative AI configuration file +has_children: false +parent: Reference +grand_parent: Generative AI (Preview) +nav_order: 1 +--- + +# Configuration reference + +_Note: this API is in preview and is subject to change._ + +A configuration file called genai_config.json is generated automatically if the model is generated with the model builder. If you provide your own model, you can copy the example below and modify it for your scenario. + +{: .no_toc } + +* TOC placeholder +{:toc} + +## Example file for phi-2 + +```json +{ + "model": { + "bos_token_id": 50256, + "context_length": 2048, + "decoder": { + "session_options": { + "log_id": "onnxruntime-genai", + "provider_options": [ + { + "cuda": {} + } + ] + }, + "filename": "model.onnx", + "head_size": 80, + "hidden_size": 2560, + "inputs": { + "input_ids": "input_ids", + "attention_mask": "attention_mask", + "position_ids": "position_ids", + "past_key_names": "past_key_values.%d.key", + "past_value_names": "past_key_values.%d.value" + }, + "outputs": { + "logits": "logits", + "present_key_names": "present.%d.key", + "present_value_names": "present.%d.value" + }, + "num_attention_heads": 32, + "num_hidden_layers": 32, + "num_key_value_heads": 32 + }, + "eos_token_id": 50256, + "pad_token_id": 50256, + "type": "phi", + "vocab_size": 51200 + }, + "search": { + "diversity_penalty": 0.0, + "do_sample": false, + "early_stopping": true, + "length_penalty": 1.0, + "max_length": 20, + "min_length": 0, + "no_repeat_ngram_size": 0, + "num_beams": 1, + "num_return_sequences": 1, + "past_present_share_buffer": true, + "repetition_penalty": 1.0, + "temperature": 1.0, + "top_k": 50, + "top_p": 1.0 + } +} +``` + +## Configuration + +### Model section + +#### General model config + +* **_type_**: The type of model. Can be phi, llama or gpt. + +* **_vocab_size_**: The size of the vocabulary that the model processes ie the number of tokens in the vocabulary. + +* **_bos_token_id_**: The id of the beginning of sequence token. + +* **_eos_token_id_**: The id of the end of sequence token. + +* **_pad_token_**: The id of the padding token. + +* **_context_length_**: The maximum length of sequence that the model can process. + + +#### Session options + +These are the options that are passed to ONNX Runtime, which runs the model on each token generation iteration. + +* **_provider_options_**: a prioritized list of execution targets on which to run the model. If running on CPU, this option is not present. A list of execution provider specific configurations can be specified inside the provider item. + +* **_log_id_**: a prefix to output when logging. + + +Then for each model in the pipeline there is one section, named by the model. + +#### Decoder model config + +* **_filename_**: The name of the model file. + +* **_inputs_**: The names of each of the inputs. Sequences of model inputs can contain a wildcard representing the index in the sequence. + +* **_outputs_**: The names of each of the outputs. + +* **_num_attention_heads_**: The number of attention heads in the model. + +* **_head_size_**: The size of the attention heads. + +* **_hidden_size_**: The size of the hidden layers. + +* **_num_key_value_heads_**: The number of key value heads. + + +### Generation search section + +* **_max_length_**: The maximum length that the model will generate. + +* **_min_length_**: The minimum length that the model will generate. + +* **_do_sample_**: Enables Top P / Top K generation. When set to true, generation uses the configured `top_p` and `top_k` values. When set to false, generation uses beam search or greedy search. + +* **_num_beams_**: The number of beams to apply when generating the output sequence using beam search. If num_beams=1, then generation is performed using greedy search. If num_beans > 1, then generation is performed using beam search. + +* **_early_stopping_**: Whether to stop the beam search when at least num_beams sentences are finished per batch or not. Defaults to false. + +* **_num_return_sequences_**: The number of sequences to generate. Returns the sequences with the highest scores in order. + +* **_top_k_**: Only includes tokens that do fall within the list of the `K` most probable tokens. Range is 1 to the vocabulary size. + +* **_top_p_**: Only includes the most probable tokens with probabilities that add up to `P` or higher. Defaults to `1`, which includes all of the tokens. Range is 0 to 1, exclusive of 0. + +* **_temperature_**: The temperature value scales the probability of each token so that probable tokens become more likely while less probable ones become less likely. This value can have a range 0 < `temperature` ≤ 1. When temperature is equal to `1`, it has no effect. + +* **_repetition_penalty_**: Discounts the scores of previously generated tokens if set to a value greater than `1`. Defaults to `1`. + +* **_length_penalty_**: Controls the length of the output generated. Value less than `1` encourages the generation to produce shorter sequences. Values greater than `1` encourages longer sequences. Defaults to `1`. + +* **_diversity_penalty_**: Not supported. + +* **_no_repeat_ngram_size_**: Not supported. + +* **_past_present_share_buffer_**: If set to true, the past and present buffer are shared for efficiency. + +## Search combinations + +1. Beam search + + - num beams > 1 + - do_sample = False + +2. Greedy search + + - num_beams = 1 + - do_sample = False + +3. Top P / Top K + + - do_sample = True + \ No newline at end of file diff --git a/docs/genai/reference/index.md b/docs/genai/reference/index.md new file mode 100644 index 0000000000000..f34d266dedbaf --- /dev/null +++ b/docs/genai/reference/index.md @@ -0,0 +1,9 @@ +--- +title: Reference +description: Reference information for ONNX Runtime Generative AI +parent: Generative AI (Preview) +has_children: true +nav_order: 4 +--- + +_Note: this API is in preview and is subject to change._ diff --git a/docs/genai/tutorials/index.md b/docs/genai/tutorials/index.md new file mode 100644 index 0000000000000..c05d1a1797827 --- /dev/null +++ b/docs/genai/tutorials/index.md @@ -0,0 +1,10 @@ +--- +title: Tutorials +description: Build your application with ONNX Runtime GenAI +parent: Generative AI (Preview) +has_children: true +nav_order: 1 +--- + +_Note: this API is in preview and is subject to change._ + diff --git a/docs/genai/tutorials/phi2-python.md b/docs/genai/tutorials/phi2-python.md new file mode 100644 index 0000000000000..8318c5dfd06a7 --- /dev/null +++ b/docs/genai/tutorials/phi2-python.md @@ -0,0 +1,108 @@ +--- +title: Python phi-2 tutorial +description: Learn how to write a language generation application with ONNX Runtime GenAI in Python using the phi-2 model +has_children: false +parent: Tutorials +grand_parent: Generative AI (Preview) +nav_order: 1 +--- + +# Language generation in Python with phi-2 + +## Setup and installation + +Install the ONNX Runtime GenAI Python package using the [installation instructions](../howto/install.md). + +## Build phi-2 ONNX model + +The onnxruntime-genai package contains a model builder that generates the phi-2 ONNX model using the weights and config on Huggingface. The tools also allows you to download the weights from Hugging Face, load locally stored weights, or convert from GGUF format. For more details, see [how to build models](../howto/build-model.md) + +If using the `-m` option shown here, you will need to login into Hugging Face. + +```bash +pip install huggingface-hub` +huggingface-cli --login +``` + +You can build the model in different precisions. This command uses int4 as it produces the smallest model and can run on a CPU. + +```bash +python -m onnxruntime_genai.models.builder -m microsoft/phi-2 -e cpu -p int4 -o ./example-models/phi2-int4-cpu +``` +You can replace the name of the output folder specified with the `-o` option with a folder of your choice. + +After you run the script, you will see a series of files generated in this folder. They include the HuggingFace configs for your reference, as well as the following generated files used by ONNX Runtime GenAI. + +- `model.onnx`: the phi-2 ONNX model +- `model.onnx.data`: the phi-2 ONNX model weights +- `genai_config.json`: the configuration used by ONNX Runtime GenAI + +You can view and change the values in the `genai_config.json` file. The model section should not be updated unless you have brought your own model and it has different parameters. + +The search parameters can be changed. For example, you might want to generate with a different temperature value. These values can also be set via the `set_search_options` method shown below. + +## Run the model with a sample prompt + +Run the model with the following Python script. You can change the prompt and other parameters as needed. + +```python +import onnxruntime_genai as og + +prompt = '''def print_prime(n): + """ + Print all primes between 1 and n + """''' + +model=og.Model(f'example-models/phi2-int4-cpu') + +tokenizer = model.create_tokenizer() + +tokens = tokenizer.encode(prompt) + +params=og.GeneratorParams(model) +params.set_search_options({"max_length":200}) +params.input_ids = tokens + +output_tokens=model.generate(params)[0] + +text = tokenizer.decode(output_tokens) + +print(text) +``` + +## Run batches of prompts + +You can also run batches of prompts through the model. + +```python +prompts = [ + "This is a test.", + "Rats are awesome pets!", + "The quick brown fox jumps over the lazy dog.", + ] + +inputs = tokenizer.encode_batch(prompts) + +params=og.GeneratorParams(model) +params.input_ids = tokens + +outputs = model.generate(params)[0] + +text = tokenizer.decode(output_tokens) +``` + +## Stream the output of the tokenizer + +If you are developing an application that requires tokens to be output to the user interface one at a time, you can use the streaming tokenizer. + +```python +generator=og.Generator(model, params) +tokenizer_stream=tokenizer.create_stream() + +print(prompt, end='', flush=True) + +while not generator.is_done(): + generator.compute_logits() + generator.generate_next_token_top_p(0.7, 0.6) + print(tokenizer_stream.decode(generator.get_next_tokens()[0]), end='', flush=True) +```