From 8b8a6fe1a8e98c9cd32e67230d21e1605b2ce279 Mon Sep 17 00:00:00 2001 From: natke Date: Fri, 16 Feb 2024 16:21:09 -0800 Subject: [PATCH 01/44] Add structure for GenAI --- docs/genai/api/c.md | 0 docs/genai/api/csharp.md | 0 docs/genai/api/python.md | 0 docs/genai/howto/build-from-source.md | 0 docs/genai/howto/build-model.md | 0 docs/genai/howto/index.md | 0 docs/genai/index.md | 0 docs/genai/tutorials/index.md | 0 8 files changed, 0 insertions(+), 0 deletions(-) create mode 100644 docs/genai/api/c.md create mode 100644 docs/genai/api/csharp.md create mode 100644 docs/genai/api/python.md create mode 100644 docs/genai/howto/build-from-source.md create mode 100644 docs/genai/howto/build-model.md create mode 100644 docs/genai/howto/index.md create mode 100644 docs/genai/index.md create mode 100644 docs/genai/tutorials/index.md diff --git a/docs/genai/api/c.md b/docs/genai/api/c.md new file mode 100644 index 0000000000000..e69de29bb2d1d diff --git a/docs/genai/api/csharp.md b/docs/genai/api/csharp.md new file mode 100644 index 0000000000000..e69de29bb2d1d diff --git a/docs/genai/api/python.md b/docs/genai/api/python.md new file mode 100644 index 0000000000000..e69de29bb2d1d diff --git a/docs/genai/howto/build-from-source.md b/docs/genai/howto/build-from-source.md new file mode 100644 index 0000000000000..e69de29bb2d1d diff --git a/docs/genai/howto/build-model.md b/docs/genai/howto/build-model.md new file mode 100644 index 0000000000000..e69de29bb2d1d diff --git a/docs/genai/howto/index.md b/docs/genai/howto/index.md new file mode 100644 index 0000000000000..e69de29bb2d1d diff --git a/docs/genai/index.md b/docs/genai/index.md new file mode 100644 index 0000000000000..e69de29bb2d1d diff --git a/docs/genai/tutorials/index.md b/docs/genai/tutorials/index.md new file mode 100644 index 0000000000000..e69de29bb2d1d From 8e24b439edc7414cfde6fa91f8b02dc2242a1dc3 Mon Sep 17 00:00:00 2001 From: natke Date: Sun, 18 Feb 2024 04:47:44 -0800 Subject: [PATCH 02/44] Start adding content for GenAI docs --- docs/genai/howto/install.md | 69 +++++++++++++++++++++++++++++++++++++ docs/genai/index.md | 9 +++++ 2 files changed, 78 insertions(+) create mode 100644 docs/genai/howto/install.md diff --git a/docs/genai/howto/install.md b/docs/genai/howto/install.md new file mode 100644 index 0000000000000..1b972a939ace3 --- /dev/null +++ b/docs/genai/howto/install.md @@ -0,0 +1,69 @@ +--- +title: Install ONNX Runtime GenAI +description: Instructions to install ONNX Runtime GenAI on your target platform in your environment +has_children: false +nav_order: 1 +--- + +# Install ONNX Runtime GenAI + +## Python package + +(Coming soon) `pip install onnxruntime-genai` + +(Temporary) +1. Build from source + + Follow the instructions in [build-from-source.md] + +2. Install wheel + + ```bash + cd build/wheel + pip install onnxruntime-genai*.whl + ``` + +## C# package + +(Coming soon) `dotnet add package Microsoft.ML.OnnxRuntime.GenAI` + +(Temporary) +1. Build from source + + Follow the instructions in [build-from-source.md] + +2. Build nuget package + + ```cmd + nuget.exe pack Microsoft.ML.OnnxRuntimeGenAI.nuspec -Prop version=0.1.0 -Prop id="Microsoft.ML.OnnxRuntimeGenAI.Gpu" + ``` + +3. Install the nuget package + + ```cmd + dotnet add package .. local instructions + ``` + + +## C artifacts + +(Coming soon) Download release archive + +Unzip archive + +(Temporary) +1. Build from source + + Follow the instructions in [build-from-source.md] + + +2. Use the following include locations to build your C application + + * + +3. Use the following library locations to build your C application + + * + + + diff --git a/docs/genai/index.md b/docs/genai/index.md index e69de29bb2d1d..1634e6fdd7337 100644 --- a/docs/genai/index.md +++ b/docs/genai/index.md @@ -0,0 +1,9 @@ +# Generative AI with ONNX Runtime + +Run generative AI models with ONNX Runtime. + +This library provides the generative AI loop for ONNX models, including inference with ONNX Runtime, logits processing, search and sampling, and KV cache management. + +Users can call a high level `generate()` method, or run each iteration of the model in a loop, generating one token at a time, and optionally updating generation parameters inside the loop. + +It has support for greedy/beam search and TopP, TopK sampling to generate token sequences and built-in logits processing like repetition penalties. You can also easily add custom scoring. From 7d45eff5e79887b1bd9fe7b2e82ba4eb1b0a4823 Mon Sep 17 00:00:00 2001 From: natke Date: Tue, 20 Feb 2024 04:27:43 -0800 Subject: [PATCH 03/44] Fill in some content for install and API --- docs/ecosystem/index.md | 2 +- docs/genai/api/c.md | 335 ++++++++++++++++++++++++++ docs/genai/api/csharp.md | 9 + docs/genai/api/index.md | 0 docs/genai/api/python.md | 183 ++++++++++++++ docs/genai/howto/build-from-source.md | 87 +++++++ docs/genai/howto/build-model.md | 125 ++++++++++ docs/genai/howto/index.md | 5 + docs/reference/index.md | 2 +- 9 files changed, 746 insertions(+), 2 deletions(-) create mode 100644 docs/genai/api/index.md diff --git a/docs/ecosystem/index.md b/docs/ecosystem/index.md index 1a0f95d77f4e0..1aa193c842cbf 100644 --- a/docs/ecosystem/index.md +++ b/docs/ecosystem/index.md @@ -1,7 +1,7 @@ --- title: Ecosystem description: See examples of how ONNX Runtime working end to end within the Azure AI and ML landscape and ecosystem -nav_order: 9 +nav_order: 10 redirect_from: /docs/tutorials/ecosystem --- # ORT Ecosystem diff --git a/docs/genai/api/c.md b/docs/genai/api/c.md index e69de29bb2d1d..72e6bbbcf90c2 100644 --- a/docs/genai/api/c.md +++ b/docs/genai/api/c.md @@ -0,0 +1,335 @@ +--- +title: C API +description: C API reference for ONNX Runtime GenAI +has_children: false +nav_order: 2 +--- + +# + +## Create model + +/* + * \brief Creates a model from the given configuration directory and device type. + * \param[in] config_path The path to the model configuration directory. The path is expected to be encoded in UTF-8. + * \param[in] device_type The device type to use for the model. + * \param[out] out The created model. + * \return OgaResult containing the error message if the model creation failed. + */ + +```c +OGA_EXPORT OgaResult* OGA_API_CALL OgaCreateModel(const char* config_path, OgaDeviceType device_type, OgaModel** out); +``` + +/* + * \brief Destroys the given model. + * \param[in] model The model to be destroyed. + */ +```c +OGA_EXPORT void OGA_API_CALL OgaDestroyModel(OgaModel* model); +``` + +## Create Tokenizer + +```c +OGA_EXPORT OgaResult* OGA_API_CALL OgaCreateTokenizer(const OgaModel* model, OgaTokenizer** out); +OGA_EXPORT void OGA_API_CALL OgaDestroyTokenizer(OgaTokenizer*); +OGA_EXPORT OgaResult* OGA_API_CALL OgaTokenizerEncodeBatch(const OgaTokenizer*, const char** strings, size_t count, OgaSequences** out); +OGA_EXPORT OgaResult* OGA_API_CALL OgaTokenizerDecodeBatch(const OgaTokenizer*, const OgaSequences* tokens, const char*** out_strings); +OGA_EXPORT void OGA_API_CALL OgaTokenizerDestroyStrings(const char** strings, size_t count); +``` + + +/* OgaTokenizerStream is to decoded token strings incrementally, one token at a time. + */ +```c +OGA_EXPORT OgaResult* OGA_API_CALL OgaCreateTokenizerStream(const OgaTokenizer*, OgaTokenizerStream** out); +OGA_EXPORT void OGA_API_CALL OgaDestroyTokenizerStream(OgaTokenizerStream*); +``` + +/* + * Decode a single token in the stream. If this results in a word being generated, it will be returned in 'out'. + * The caller is responsible for concatenating each chunk together to generate the complete result. + * 'out' is valid until the next call to OgaTokenizerStreamDecode or when the OgaTokenizerStream is destroyed + */ +```c +OGA_EXPORT OgaResult* OGA_API_CALL OgaTokenizerStreamDecode(OgaTokenizerStream*, int32_t token, const char** out); +``` + + +## Create Generator + +/* + * \brief Creates a generator from the given model and generator params. + * \param[in] model The model to use for generation. + * \param[in] params The parameters to use for generation. + * \param[out] out The created generator. + * \return OgaResult containing the error message if the generator creation failed. + */ +```c +OGA_EXPORT OgaResult* OGA_API_CALL OgaCreateGenerator(const OgaModel* model, const OgaGeneratorParams* params, OgaGenerator** out); +``` + +/* + * \brief Destroys the given generator. + * \param[in] generator The generator to be destroyed. + */ +```c +OGA_EXPORT void OGA_API_CALL OgaDestroyGenerator(OgaGenerator* generator); +``` + +## Create and set generator input parameters + +/* + * \brief Creates a OgaGeneratorParams from the given model. + * \param[in] model The model to use for generation. + * \param[out] out The created generator params. + * \return OgaResult containing the error message if the generator params creation failed. + */ +```c +OGA_EXPORT OgaResult* OGA_API_CALL OgaCreateGeneratorParams(const OgaModel* model, OgaGeneratorParams** out); +``` + +/* + * \brief Destroys the given generator params. + * \param[in] generator_params The generator params to be destroyed. + */ +```c +OGA_EXPORT void OGA_API_CALL OgaDestroyGeneratorParams(OgaGeneratorParams* generator_params); +``` + +/* + * \brief Sets the maximum length that the generated sequence can have. + * \param[in] params The generator params to set the maximum length on. + * \param[in] max_length The maximum length of the generated sequences. + * \return OgaResult containing the error message if the setting of the maximum length failed. + */ +```c +OGA_EXPORT OgaResult* OGA_API_CALL OgaGeneratorParamsSetMaxLength(OgaGeneratorParams* generator_params, size_t max_length); +``` + +/* + * \brief Sets the input ids for the generator params. The input ids are used to seed the generation. + * \param[in] generator_params The generator params to set the input ids on. + * \param[in] input_ids The input ids array of size input_ids_count = batch_size * sequence_length. + * \param[in] input_ids_count The total number of input ids. + * \param[in] sequence_length The sequence length of the input ids. + * \param[in] batch_size The batch size of the input ids. + * \return OgaResult containing the error message if the setting of the input ids failed. + */ +```c +OGA_EXPORT OgaResult* OGA_API_CALL OgaGeneratorParamsSetInputIDs(OgaGeneratorParams* generator_params, const int32_t* input_ids, size_t input_ids_count, size_t sequence_length, size_t batch_size); +``` + +/* + * \brief Sets the input id sequences for the generator params. The input id sequences are used to seed the generation. + * \param[in] generator_params The generator params to set the input ids on. + * \param[in] sequences The input id sequences. + * \return OgaResult containing the error message if the setting of the input id sequences failed. + */ +```c +OGA_EXPORT OgaResult* OGA_API_CALL OgaGeneratorParamsSetInputSequences(OgaGeneratorParams* generator_params, const OgaSequences* sequences); +``` + +## Tokenize and decode tokens + +/* Encodes a single string and adds the encoded sequence of tokens to the OgaSequences. The OgaSequences must be freed with OgaDestroySequences + when it is no longer needed. + */ +```c +OGA_EXPORT OgaResult* OGA_API_CALL OgaTokenizerEncode(const OgaTokenizer*, const char* str, OgaSequences* sequences); +``` + +/* Decode a single token sequence and returns a null terminated utf8 string. out_string must be freed with OgaDestroyString + */ +```c +OGA_EXPORT OgaResult* OGA_API_CALL OgaTokenizerDecode(const OgaTokenizer*, const int32_t* tokens, size_t token_count, const char** out_string); +``` + + +## Generate output + +### High level API + +/* + * \brief Generates an array of token arrays from the model execution based on the given generator params. + * \param[in] model The model to use for generation. + * \param[in] generator_params The parameters to use for generation. + * \param[out] out The generated sequences of tokens. The caller is responsible for freeing the sequences using OgaDestroySequences + * after it is done using the sequences. + * \return OgaResult containing the error message if the generation failed. + */ +```c +OGA_EXPORT OgaResult* OGA_API_CALL OgaGenerate(const OgaModel* model, const OgaGeneratorParams* generator_params, OgaSequences** out); +``` + +/* + * \brief Creates a OgaGeneratorParams from the given model. + * \param[in] model The model to use for generation. + * \param[out] out The created generator params. + * \return OgaResult containing the error message if the generator params creation failed. + */ +```c +OGA_EXPORT OgaResult* OGA_API_CALL OgaCreateGeneratorParams(const OgaModel* model, OgaGeneratorParams** out); +``` + +/* + * \brief Destroys the given generator params. + * \param[in] generator_params The generator params to be destroyed. + */ +```c +OGA_EXPORT void OGA_API_CALL OgaDestroyGeneratorParams(OgaGeneratorParams* generator_params); +``` + +### Low level API + + +/* + * \brief Returns true if the generator has finished generating all the sequences. + * \param[in] generator The generator to check if it is done with generating all sequences. + * \return True if the generator has finished generating all the sequences, false otherwise. + */ +```c +OGA_EXPORT bool OGA_API_CALL OgaGenerator_IsDone(const OgaGenerator* generator); +``` + +/* + * \brief Computes the logits from the model based on the input ids and the past state. The computed logits are stored in the generator. + * \param[in] generator The generator to compute the logits for. + * \return OgaResult containing the error message if the computation of the logits failed. + */ +```c +OGA_EXPORT OgaResult* OGA_API_CALL OgaGenerator_ComputeLogits(OgaGenerator* generator); +``` + +/* + * \brief Generates the next token based on the computed logits using the greedy search. + * \param[in] generator The generator to generate the next token for. + * \return OgaResult containing the error message if the generation of the next token failed. + */ +```c +OGA_EXPORT OgaResult* OGA_API_CALL OgaGenerator_GenerateNextToken_Top(OgaGenerator* generator); + +OGA_EXPORT OgaResult* OGA_API_CALL OgaGenerator_GenerateNextToken_TopK(OgaGenerator* generator, int k, float t); +OGA_EXPORT OgaResult* OGA_API_CALL OgaGenerator_GenerateNextToken_TopP(OgaGenerator* generator, float p, float t); +``` + +/* + * \brief Returns the number of tokens in the sequence at the given index. + * \param[in] generator The generator to get the count of the tokens for the sequence at the given index. + * \return The number tokens in the sequence at the given index. + */ +```c +OGA_EXPORT size_t OGA_API_CALL OgaGenerator_GetSequenceLength(const OgaGenerator* generator, size_t index); +``` + +/* + * \brief Returns a pointer to the sequence data at the given index. The number of tokens in the sequence + * is given by OgaGenerator_GetSequenceLength + * \param[in] generator The generator to get the sequence data for the sequence at the given index. + * \return The pointer to the sequence data at the given index. The sequence data is owned by the OgaGenerator + * and will be freed when the OgaGenerator is destroyed. The caller must copy the data if it needs to + * be used after the OgaGenerator is destroyed. + */ +```c +OGA_EXPORT const int32_t* OGA_API_CALL OgaGenerator_GetSequence(const OgaGenerator* generator, size_t index); +``` + + +## Enums and structs + +```c +typedef enum OgaDeviceType { + OgaDeviceTypeAuto, + OgaDeviceTypeCPU, + OgaDeviceTypeCUDA, +} OgaDeviceType; + +typedef enum OgaDataType { + OgaDataType_int32, + OgaDataType_float32, + OgaDataType_string, // UTF8 string +} OgaDataType; + +typedef struct OgaResult OgaResult; +typedef struct OgaGeneratorParams OgaGeneratorParams; +typedef struct OgaGenerator OgaGenerator; +typedef struct OgaModel OgaModel; +typedef struct OgaBuffer OgaBuffer; +``` + +// OgaSequences is an array of token arrays where the number of token arrays can be obtained using +// OgaSequencesCount and the number of tokens in each token array can be obtained using + +```c +OgaSequencesGetSequenceCount. +typedef struct OgaSequences OgaSequences; +typedef struct OgaTokenizer OgaTokenizer; +typedef struct OgaTokenizerStream OgaTokenizerStream; +``` + +## Utility functions + +/* + * \param[in] result OgaResult that contains the error message. + * \return Error message contained in the OgaResult. The const char* is owned by the OgaResult + * and can will be freed when the OgaResult is destroyed. + */ +```c +OGA_EXPORT const char* OGA_API_CALL OgaResultGetError(OgaResult* result); +``` + +/* + * \param[in] result OgaResult to be destroyed. + */ +```c +OGA_EXPORT void OGA_API_CALL OgaDestroyResult(OgaResult*); +OGA_EXPORT void OGA_API_CALL OgaDestroyString(const char*); + +OGA_EXPORT void OGA_API_CALL OgaDestroyBuffer(OgaBuffer*); +OGA_EXPORT OgaDataType OGA_API_CALL OgaBufferGetType(const OgaBuffer*); +OGA_EXPORT size_t OGA_API_CALL OgaBufferGetDimCount(const OgaBuffer*); +OGA_EXPORT OgaResult* OGA_API_CALL OgaBufferGetDims(const OgaBuffer*, size_t* dims, size_t dim_count); +OGA_EXPORT const void* OGA_API_CALL OgaBufferGetData(const OgaBuffer*); + +OGA_EXPORT OgaResult* OGA_API_CALL OgaCreateSequences(OgaSequences** out); +``` + + +/* + * \param[in] sequences OgaSequences to be destroyed. + */ +```c +OGA_EXPORT void OGA_API_CALL OgaDestroySequences(OgaSequences* sequences); +``` + +/* + * \brief Returns the number of sequences in the OgaSequences + * \param[in] sequences + * \return The number of sequences in the OgaSequences + */ +```c +OGA_EXPORT size_t OGA_API_CALL OgaSequencesCount(const OgaSequences* sequences); +``` + +/* + * \brief Returns the number of tokens in the sequence at the given index + * \param[in] sequences + * \return The number of tokens in the sequence at the given index + */ +```c +OGA_EXPORT size_t OGA_API_CALL OgaSequencesGetSequenceCount(const OgaSequences* sequences, size_t sequence_index); +``` + +/* + * \brief Returns a pointer to the sequence data at the given index. The number of tokens in the sequence + * is given by OgaSequencesGetSequenceCount + * \param[in] sequences + * \return The pointer to the sequence data at the given index. The pointer is valid until the OgaSequences is destroyed. + */ +```c +OGA_EXPORT const int32_t* OGA_API_CALL OgaSequencesGetSequenceData(const OgaSequences* sequences, size_t sequence_index); + +OGA_EXPORT OgaResult* OGA_API_CALL OgaGeneratorParamsSetWhisperInputFeatures(OgaGeneratorParams*, const int32_t* inputs, size_t count); +OGA_EXPORT OgaResult* OGA_API_CALL OgaGeneratorParamsSetWhisperDecoderInputIDs(OgaGeneratorParams*, const int32_t* input_ids, size_t input_ids_count); +``` diff --git a/docs/genai/api/csharp.md b/docs/genai/api/csharp.md index e69de29bb2d1d..fbe74d504e5c3 100644 --- a/docs/genai/api/csharp.md +++ b/docs/genai/api/csharp.md @@ -0,0 +1,9 @@ +--- +title: C# API +description: C# API reference for ONNX Runtime GenAI +has_children: false +nav_order: 3 +--- + +# ONNX Runtime GenAI C# API + diff --git a/docs/genai/api/index.md b/docs/genai/api/index.md new file mode 100644 index 0000000000000..e69de29bb2d1d diff --git a/docs/genai/api/python.md b/docs/genai/api/python.md index e69de29bb2d1d..e68849654d891 100644 --- a/docs/genai/api/python.md +++ b/docs/genai/api/python.md @@ -0,0 +1,183 @@ +--- +title: Python API +description: Python API reference for ONNX Runtime GenAI +has_children: false +nav_order: 2 +--- + +# Python API +{: .no_toc } + +* TOC placeholder +{:toc} + +## Install and import + +The Python API is delivered by the onnxruntime-genai Python package. + +```bash +pip install onnxruntime-genai +``` + +```python +import onnxruntime_genai +``` + +## Model class + +### Load the model + +Loads the ONNX model(s) and configuration from a folder on disk. + +```python +onnxruntime_genai.Model(model_folder: str, device: onnxruntime_genai.DeviceType) -> onnxruntime_genai.Model +``` + +#### Parameters + +- `model_folder`: (required) Location of model and configuration on disk +- `device`: (optional) The device to run on. One of: + - onnxruntime_genai.CPU + - onnxruntime_genai.CUDA + - onnxruntime_genai.CPU + If not specified, defaults to XXX + +#### Return value + +### GeneratorParameters class + +```python +params=onnxruntime_genai.GeneratorParams(model: onnxruntime_genai.Model) -> onnxruntime_genai.GeneratorParams +``` + +#### Parameters + +- `model`: (required) The model that was loaded by onnxruntime_genai.Model() + +#### Return value + + +### Generate + +```python +onnxruntime_genai.Model.generate(params: GeneratorParams) -> XXXX +``` + +#### Parameters +- `params`: (Required) Created by the `GenerateParams` method. + +#### Return value + +### Generate sequence + +```python +onnxruntime_genai.Model.generate_sequence(input_ids: , params: **kwargs) +``` + +#### Parameters + +- `input_ids`: tokenized prompt +- `params`: dictionary of generation parameters + +## Tokenizer class + +### Create tokenizer + +```python +create_tokenizer(model: onnxruntime_genai.Model) -> onnxruntime_genai.Tokenizer +``` + +#### Parameters + +- `model`: (Required) The model that was loaded by the `Model()` + +#### Return value + +- `Tokenizer` + + +### Encode + +```python +onnxruntime_genai.Tokenizer.encode(XXXX) -> XXXX +``` + +#### Parameters + +#### Return value + +### Decode + +```python +onnxruntime_genai.StreamingTokenizer.decode(XXXX) -> XXXX +``` + +#### Parameters + +#### Return value + +### Encode batch + +```python +onnxruntime_genai.Tokenizer.encode_batch(XXXX) -> XXXX +``` + +#### Parameters + +#### Return value + +### Decode batch + +```python +onnxruntime_genai.decode_batch(XXXX) -> XXXX +``` + +#### Parameters + +#### Return value + + + +### Create streaming tokenizer + +```python +create_stream(model: onnxruntime_genai.Model) -> TokenizerStream +``` + +#### Parameters + +- `model`: (Required) The model that was loaded by the `Model()` + +#### Return value + +- TokenizerStream + +### Decode token stream + +```python +onnxruntime_genai.TokenizerStream.decode(token: ) -> token +``` + + + pybind11::class_(m, "Generator") + .def(pybind11::init()) + .def("is_done", &PyGenerator::IsDone) + .def("compute_logits", &PyGenerator::ComputeLogits) + .def("generate_next_token", &PyGenerator::GenerateNextToken) + .def("generate_next_token_top", &PyGenerator::GenerateNextToken_Top) + .def("generate_next_token_top_p", &PyGenerator::GenerateNextToken_TopP) + .def("generate_next_token_top_k", &PyGenerator::GenerateNextToken_TopK) + .def("generate_next_token_top_k_top_p", &PyGenerator::GenerateNextToken_TopK_TopP) + .def("get_next_tokens", &PyGenerator::GetNextTokens) + .def("get_sequence", &PyGenerator::GetSequence); + + m.def("is_cuda_available", []() { +#ifdef USE_CUDA + return true; +#else + return false; +#endif + }); +} + +} // namespace Generators \ No newline at end of file diff --git a/docs/genai/howto/build-from-source.md b/docs/genai/howto/build-from-source.md index e69de29bb2d1d..2505ab304e0df 100644 --- a/docs/genai/howto/build-from-source.md +++ b/docs/genai/howto/build-from-source.md @@ -0,0 +1,87 @@ +--- +title: Build from source +description: How to build ONNX Runtime GenAI from source +has_children: false +nav_order: 2 +--- + +# Build onnxruntime-genai from source + +## Pre-requisites + +`cmake` + +## Build steps + +1. Clone this repo + + ```bash + git clone https://github.com/microsoft/onnxruntime-genai + cd onnxruntime-genai + ``` + +2. Install ONNX Runtime + + By default, the onnxruntime-genai build expects to find the ONNX Runtime include and binaries in a folder called `ort` in the root directory of onnxruntime-genai. You can put the ONNX Runtime files in a different location and specify this location to the onnxruntime-genai build. These instructions use ORT_HOME as the location. + + * Install from release + + These instructions are for the Linux GPU build of ONNX Runtime. Replace the location with the operating system and target of choice. + + ```bash + cd $ORT_HOME + wget https://github.com/microsoft/onnxruntime/releases/download/v1.17.0/onnxruntime-linux-x64-gpu-1.17.0.tgz + tar xvzf onnxruntime-linux-x64-gpu-1.17.0.tgz + mv onnxruntime-linux-x64-gpu-1.17.0/include . + mv onnxruntime-linux-x64-gpu-1.17.0/lib . + ``` + + * Or build from source + + ``` + git clone https://github.com/microsoft/onnxruntime.git + cd onnxruntime + ``` + + Create include and lib folders in the ORT_HOME directory + + ```bash + mkdir $ORT_HOME/include + mkdir $ORT_HOME/lib + ``` + + Build from source and copy the include and libraries into ORT_HOME + + On Windows + + ```cmd + build.bat --build_shared_lib --skip_tests --parallel [--use_cuda] + copy include\onnxruntime\core\session\onnxruntime_c_api.h $ORT_HOME\include + copy build\Windows\Debug\Debug\*.dll $ORT_HOME\lib + ``` + + On Linux + + ```cmd + ./build.sh --build_shared_lib --skip_tests --parallel [--use_cuda] + cp include/onnxruntime/core/session/onnxruntime_c_api.h $ORT_HOME/include + cp build/Linux/RelWithDebInfo/libonnxruntime*.so* $ORT_HOME/lib + ``` + +3. Build onnxruntime-genai + + If you are building for CUDA, add the cuda_home argument. + + ```bash + cd .. + python build.py [--cuda_home ] + ``` + + + +4. Install Python wheel + + ```bash + cd build/wheel + pip install *.whl + ``` \ No newline at end of file diff --git a/docs/genai/howto/build-model.md b/docs/genai/howto/build-model.md index e69de29bb2d1d..bac3ab700553a 100644 --- a/docs/genai/howto/build-model.md +++ b/docs/genai/howto/build-model.md @@ -0,0 +1,125 @@ +--- +title: Build models +description: How to build models with ONNX Runtime GenAI +has_children: false +nav_order: 2 +--- + + +# Generate models using Model Builder + +The model builder greatly accelerates creating optimized and quantized ONNX models that run with ONNX Runtime GenAI. + +## Current Support +The tool currently supports the following model architectures. + +- Gemma +- LLaMA +- Mistral +- Phi + +## Usage + +### Full Usage +For all available options, please use the `-h/--help` flag. +``` +# From wheel: +python3 -m onnxruntime_genai.models.builder --help + +# From source: +python3 builder.py --help +``` + +### Original Model From Hugging Face + +This scenario is where your PyTorch model is not downloaded locally (either in the default Hugging Face cache directory or in a local folder on disk). + +``` +# From wheel: +python3 -m onnxruntime_genai.models.builder -m model_name -o /path/to/output/folder -p precision -e execution_provider -c cache_dir_to_save_hf_files + +# From source: +python3 builder.py -m model_name -o /path/to/output/folder -p precision -e execution_provider -c cache_dir_to_save_hf_files +``` + +### Original Model From Disk + +This scenario is where your PyTorch model is already downloaded locally (either in the default Hugging Face cache directory or in a local folder on disk). + +``` +# From wheel: +python3 -m onnxruntime_genai.models.builder -m model_name -o /path/to/output/folder -p precision -e execution_provider -c cache_dir_where_hf_files_are_saved + +# From source: +python3 builder.py -m model_name -o /path/to/output/folder -p precision -e execution_provider -c cache_dir_where_hf_files_are_saved +``` + +### Customized or Finetuned Model + +This scenario is where your PyTorch model has been customized or finetuned for one of the currently supported model architectures and your model can be loaded in Hugging Face. + +``` +# From wheel: +python3 -m onnxruntime_genai.models.builder -m path_to_local_folder_on_disk -o /path/to/output/folder -p precision -e execution_provider + +# From source: +python3 builder.py -m path_to_local_folder_on_disk -o /path/to/output/folder -p precision -e execution_provider +``` + +### Extra Options + +This scenario is for when you want to have control over some specific settings. The below example shows how you can pass key-value arguments to `--extra_options`. + +``` +# From wheel: +python3 -m onnxruntime_genai.models.builder -m model_name -o /path/to/output/folder -p precision -e execution_provider -c cache_dir_to_save_hf_files --extra_options filename=decoder.onnx + +# From source: +python3 builder.py -m model_name -o /path/to/output/folder -p precision -e execution_provider -c cache_dir_to_save_hf_files --extra_options filename=decoder.onnx +``` + +To see all available options through `--extra_options`, please use the `help` commands in the `Full Usage` section above. + +### Unit Testing Models + +This scenario is where your PyTorch model is already downloaded locally (either in the default Hugging Face cache directory or in a local folder on disk). If it is not already downloaded locally, here is an example of how you can download it. + +``` +from transformers import AutoModelForCausalLM, AutoTokenizer + +model_name = "your_model_name" +cache_dir = "cache_dir_to_save_hf_files" + +model = AutoModelForCausalLM.from_pretrained(model_name, cache_dir=cache_dir) +model.save_pretrained(cache_dir) + +tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir=cache_dir) +tokenizer.save_pretrained(cache_dir) +``` + +#### Option 1: Use the model builder tool directly + +This option is the simplest but it will download another copy of the PyTorch model onto disk to accommodate the change in the number of hidden layers. + +``` +# From wheel: +python3 -m onnxruntime_genai.models.builder -m model_name -o /path/to/output/folder -p precision -e execution_provider --extra_options num_hidden_layers=4 + +# From source: +python3 builder.py -m model_name -o /path/to/output/folder -p precision -e execution_provider --extra_options num_hidden_layers=4 +``` + +#### Option 2: Edit the config.json file on disk and then run the model builder tool + +1. Navigate to where the PyTorch model and its associated files are saved on disk. +2. Modify `num_hidden_layers` in `config.json` to your desired target (e.g. 4 layers). +3. Run the below command for the model builder tool. + +``` +# From wheel: +python3 -m onnxruntime_genai.models.builder -m model_name -o /path/to/output/folder -p precision -e execution_provider -c cache_dir_where_hf_files_are_saved + +# From source: +python3 builder.py -m model_name -o /path/to/output/folder -p precision -e execution_provider -c cache_dir_where_hf_files_are_saved +``` + diff --git a/docs/genai/howto/index.md b/docs/genai/howto/index.md index e69de29bb2d1d..e78f9c4088e8b 100644 --- a/docs/genai/howto/index.md +++ b/docs/genai/howto/index.md @@ -0,0 +1,5 @@ +--- +title: How to +description: How to perform specific tasks with ONNX Runtime GenAI +nav_order: 3 +--- \ No newline at end of file diff --git a/docs/reference/index.md b/docs/reference/index.md index b04fa85ef6110..c0e098a961257 100644 --- a/docs/reference/index.md +++ b/docs/reference/index.md @@ -1,7 +1,7 @@ --- title: Reference has_children: true -nav_order: 10 +nav_order: 11 redirect_from: /docs/resources --- From c8f6f78c6e1f45f8ce7e826a470024c65a86bbb2 Mon Sep 17 00:00:00 2001 From: natke Date: Tue, 20 Feb 2024 04:38:03 -0800 Subject: [PATCH 04/44] Fix navigation --- docs/ecosystem/index.md | 2 +- docs/genai/api/c.md | 5 ++++- docs/genai/index.md | 8 +++++++- docs/reference/index.md | 2 +- 4 files changed, 13 insertions(+), 4 deletions(-) diff --git a/docs/ecosystem/index.md b/docs/ecosystem/index.md index 1aa193c842cbf..1a0f95d77f4e0 100644 --- a/docs/ecosystem/index.md +++ b/docs/ecosystem/index.md @@ -1,7 +1,7 @@ --- title: Ecosystem description: See examples of how ONNX Runtime working end to end within the Azure AI and ML landscape and ecosystem -nav_order: 10 +nav_order: 9 redirect_from: /docs/tutorials/ecosystem --- # ORT Ecosystem diff --git a/docs/genai/api/c.md b/docs/genai/api/c.md index 72e6bbbcf90c2..bcbf2456058be 100644 --- a/docs/genai/api/c.md +++ b/docs/genai/api/c.md @@ -2,10 +2,13 @@ title: C API description: C API reference for ONNX Runtime GenAI has_children: false +grand_parent: Generative AI nav_order: 2 --- -# +# ONNX Runtime GenAI C API + + ## Create model diff --git a/docs/genai/index.md b/docs/genai/index.md index 1634e6fdd7337..126ee81f544ec 100644 --- a/docs/genai/index.md +++ b/docs/genai/index.md @@ -1,4 +1,10 @@ -# Generative AI with ONNX Runtime +--- +title: ONNX Runtime GenAI +description: Run generative models with ONNX Runtime GenAi +nav_order: 6 +--- + +# Generative AI Run generative AI models with ONNX Runtime. diff --git a/docs/reference/index.md b/docs/reference/index.md index c0e098a961257..b04fa85ef6110 100644 --- a/docs/reference/index.md +++ b/docs/reference/index.md @@ -1,7 +1,7 @@ --- title: Reference has_children: true -nav_order: 11 +nav_order: 10 redirect_from: /docs/resources --- From 6a50586ab145aa8aaf5f2f85c11c35afb995152b Mon Sep 17 00:00:00 2001 From: natke Date: Tue, 20 Feb 2024 04:44:00 -0800 Subject: [PATCH 05/44] Change title for GenAI nav --- docs/genai/index.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/genai/index.md b/docs/genai/index.md index 126ee81f544ec..5302fd87d6b9a 100644 --- a/docs/genai/index.md +++ b/docs/genai/index.md @@ -1,10 +1,10 @@ --- -title: ONNX Runtime GenAI +title: Generative AI description: Run generative models with ONNX Runtime GenAi nav_order: 6 --- -# Generative AI +# Generative AI with ONNX Runtime Run generative AI models with ONNX Runtime. From 658ac0b0394ebd4dd082fe18120bf4e6cb562a68 Mon Sep 17 00:00:00 2001 From: natke Date: Tue, 20 Feb 2024 05:32:38 -0800 Subject: [PATCH 06/44] Add parents --- docs/build/index.md | 2 +- docs/genai/api/c.md | 1 + docs/genai/api/csharp.md | 1 + docs/genai/api/python.md | 95 +++++++++++++++++++-------- docs/genai/howto/build-from-source.md | 2 + docs/genai/howto/build-model.md | 2 +- docs/genai/howto/install.md | 2 + 7 files changed, 77 insertions(+), 28 deletions(-) diff --git a/docs/build/index.md b/docs/build/index.md index be906d8b9cfb1..5a1719bc317a7 100644 --- a/docs/build/index.md +++ b/docs/build/index.md @@ -1,5 +1,5 @@ --- -title: Build ONNX Runtime +title: Build from source has_children: true nav_order: 5 redirect_from: /docs/how-to/build diff --git a/docs/genai/api/c.md b/docs/genai/api/c.md index bcbf2456058be..2d681503adf54 100644 --- a/docs/genai/api/c.md +++ b/docs/genai/api/c.md @@ -2,6 +2,7 @@ title: C API description: C API reference for ONNX Runtime GenAI has_children: false +parent: API docs grand_parent: Generative AI nav_order: 2 --- diff --git a/docs/genai/api/csharp.md b/docs/genai/api/csharp.md index fbe74d504e5c3..bdc6607be5a5b 100644 --- a/docs/genai/api/csharp.md +++ b/docs/genai/api/csharp.md @@ -2,6 +2,7 @@ title: C# API description: C# API reference for ONNX Runtime GenAI has_children: false +grand_parent: Generative AI nav_order: 3 --- diff --git a/docs/genai/api/python.md b/docs/genai/api/python.md index e68849654d891..ffe5fc5757add 100644 --- a/docs/genai/api/python.md +++ b/docs/genai/api/python.md @@ -2,6 +2,7 @@ title: Python API description: Python API reference for ONNX Runtime GenAI has_children: false +grand_parent: Generative AI nav_order: 2 --- @@ -44,18 +45,20 @@ onnxruntime_genai.Model(model_folder: str, device: onnxruntime_genai.DeviceType) #### Return value -### GeneratorParameters class +### Create a Generator ```python -params=onnxruntime_genai.GeneratorParams(model: onnxruntime_genai.Model) -> onnxruntime_genai.GeneratorParams +onnxruntime_genai.Model.Generator(params: GeneratorParams) -> Generator ``` #### Parameters -- `model`: (required) The model that was loaded by onnxruntime_genai.Model() +- `params`: (Required) The set of parameters that control the generation #### Return value +- `onnxruntime_genai.Generator` + ### Generate @@ -71,7 +74,7 @@ onnxruntime_genai.Model.generate(params: GeneratorParams) -> XXXX ### Generate sequence ```python -onnxruntime_genai.Model.generate_sequence(input_ids: , params: **kwargs) +onnxruntime_genai.Model.generate_sequence(input_ids: , params: ) ``` #### Parameters @@ -79,6 +82,20 @@ onnxruntime_genai.Model.generate_sequence(input_ids: , params: **kwargs) - `input_ids`: tokenized prompt - `params`: dictionary of generation parameters + +### Create GeneratorParameters class + +```python +params=onnxruntime_genai.GeneratorParams(model: onnxruntime_genai.Model) -> onnxruntime_genai.GeneratorParams +``` + +#### Parameters + +- `model`: (required) The model that was loaded by onnxruntime_genai.Model() + +#### Return value + + ## Tokenizer class ### Create tokenizer @@ -158,26 +175,52 @@ create_stream(model: onnxruntime_genai.Model) -> TokenizerStream onnxruntime_genai.TokenizerStream.decode(token: ) -> token ``` +## Generator class + +### Is generation done + +```python +onnxruntime_genai.Generator.is_done() -> bool +``` + +### Compute logits + +```python +onnxruntime_genai.Generator.compute_logits() -> +``` - pybind11::class_(m, "Generator") - .def(pybind11::init()) - .def("is_done", &PyGenerator::IsDone) - .def("compute_logits", &PyGenerator::ComputeLogits) - .def("generate_next_token", &PyGenerator::GenerateNextToken) - .def("generate_next_token_top", &PyGenerator::GenerateNextToken_Top) - .def("generate_next_token_top_p", &PyGenerator::GenerateNextToken_TopP) - .def("generate_next_token_top_k", &PyGenerator::GenerateNextToken_TopK) - .def("generate_next_token_top_k_top_p", &PyGenerator::GenerateNextToken_TopK_TopP) - .def("get_next_tokens", &PyGenerator::GetNextTokens) - .def("get_sequence", &PyGenerator::GetSequence); - - m.def("is_cuda_available", []() { -#ifdef USE_CUDA - return true; -#else - return false; -#endif - }); -} - -} // namespace Generators \ No newline at end of file +### Generate next token + +```python +onnxruntime_genai.Generator.generate_next_token() -> +``` + +### Generate next token with Top P sampling + +```python +onnxruntime_genai.Generator.generate_next_token_top_p() -> +``` + +### Generate next token with Top K sampling + +```python +onnxruntime_genai.Generator.generate_next_token_top_k() -> +``` + +### Generate next token with Top K and Top P sampling + +```python +onnxruntime_genai.Generator.generate_next_token_top_k_top_p() -> +``` + +### Get next tokens + +```python +onnxruntime_genai.Generator.generate_next_tokens() -> +``` + +### Get sequence + +```python +onnxruntime_genai.Generator.generate_next_token() -> +``` diff --git a/docs/genai/howto/build-from-source.md b/docs/genai/howto/build-from-source.md index 2505ab304e0df..bfe35b5fccdf9 100644 --- a/docs/genai/howto/build-from-source.md +++ b/docs/genai/howto/build-from-source.md @@ -2,6 +2,8 @@ title: Build from source description: How to build ONNX Runtime GenAI from source has_children: false +parent: How to +grand_parent: Generative AI nav_order: 2 --- diff --git a/docs/genai/howto/build-model.md b/docs/genai/howto/build-model.md index bac3ab700553a..29651e0446fd5 100644 --- a/docs/genai/howto/build-model.md +++ b/docs/genai/howto/build-model.md @@ -2,10 +2,10 @@ title: Build models description: How to build models with ONNX Runtime GenAI has_children: false +grand_parent: Generative AI nav_order: 2 --- - # Generate models using Model Builder The model builder greatly accelerates creating optimized and quantized ONNX models that run with ONNX Runtime GenAI. diff --git a/docs/genai/howto/install.md b/docs/genai/howto/install.md index 1b972a939ace3..e58999929131e 100644 --- a/docs/genai/howto/install.md +++ b/docs/genai/howto/install.md @@ -2,6 +2,8 @@ title: Install ONNX Runtime GenAI description: Instructions to install ONNX Runtime GenAI on your target platform in your environment has_children: false +parent: How to +grand_parent: Generative AI nav_order: 1 --- From 1e74e156f13e56c9c43e98cae883815401d27ceb Mon Sep 17 00:00:00 2001 From: natke Date: Tue, 20 Feb 2024 05:38:06 -0800 Subject: [PATCH 07/44] Add more parents --- docs/genai/api/index.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/docs/genai/api/index.md b/docs/genai/api/index.md index e69de29bb2d1d..8a851a90ced2b 100644 --- a/docs/genai/api/index.md +++ b/docs/genai/api/index.md @@ -0,0 +1,6 @@ +--- +title: API docs +description: API documentation for ONNX Runtime GenAI +parent: Generative AI +nav_order: 2 +--- \ No newline at end of file From 97887ab85de06c4f6c70063eccbafb41b7d8a5f9 Mon Sep 17 00:00:00 2001 From: natke Date: Tue, 20 Feb 2024 05:48:39 -0800 Subject: [PATCH 08/44] Add children --- docs/genai/index.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/genai/index.md b/docs/genai/index.md index 5302fd87d6b9a..215dcd5e85546 100644 --- a/docs/genai/index.md +++ b/docs/genai/index.md @@ -1,6 +1,7 @@ --- title: Generative AI -description: Run generative models with ONNX Runtime GenAi +description: Run generative models with ONNX Runtime GenAI +has_children: true nav_order: 6 --- From 14e7261a88ff7522e6f70a08bf1a7d52999a3224 Mon Sep 17 00:00:00 2001 From: natke Date: Tue, 20 Feb 2024 05:51:04 -0800 Subject: [PATCH 09/44] Header updates --- docs/genai/api/csharp.md | 1 + docs/genai/api/python.md | 1 + docs/genai/howto/build-model.md | 1 + docs/genai/howto/index.md | 1 + docs/genai/tutorials/index.md | 6 ++++++ 5 files changed, 10 insertions(+) diff --git a/docs/genai/api/csharp.md b/docs/genai/api/csharp.md index bdc6607be5a5b..9e8a4c46cb476 100644 --- a/docs/genai/api/csharp.md +++ b/docs/genai/api/csharp.md @@ -2,6 +2,7 @@ title: C# API description: C# API reference for ONNX Runtime GenAI has_children: false +parent: API docs grand_parent: Generative AI nav_order: 3 --- diff --git a/docs/genai/api/python.md b/docs/genai/api/python.md index ffe5fc5757add..c88bf8b59dedd 100644 --- a/docs/genai/api/python.md +++ b/docs/genai/api/python.md @@ -2,6 +2,7 @@ title: Python API description: Python API reference for ONNX Runtime GenAI has_children: false +parent: API docs grand_parent: Generative AI nav_order: 2 --- diff --git a/docs/genai/howto/build-model.md b/docs/genai/howto/build-model.md index 29651e0446fd5..10712e1852796 100644 --- a/docs/genai/howto/build-model.md +++ b/docs/genai/howto/build-model.md @@ -2,6 +2,7 @@ title: Build models description: How to build models with ONNX Runtime GenAI has_children: false +parent: How to grand_parent: Generative AI nav_order: 2 --- diff --git a/docs/genai/howto/index.md b/docs/genai/howto/index.md index e78f9c4088e8b..9de912fe93073 100644 --- a/docs/genai/howto/index.md +++ b/docs/genai/howto/index.md @@ -1,5 +1,6 @@ --- title: How to description: How to perform specific tasks with ONNX Runtime GenAI +parent: Generative AI nav_order: 3 --- \ No newline at end of file diff --git a/docs/genai/tutorials/index.md b/docs/genai/tutorials/index.md index e69de29bb2d1d..e7318222a1a1b 100644 --- a/docs/genai/tutorials/index.md +++ b/docs/genai/tutorials/index.md @@ -0,0 +1,6 @@ +--- +title: Tutorials +description: Build your application with ONNX Runtime GenAI +parent: Generative AI +nav_order: 1 +--- \ No newline at end of file From 3f4ac0fa47259eb99fd4db7b0ba0f48946837466 Mon Sep 17 00:00:00 2001 From: natke Date: Tue, 20 Feb 2024 05:56:28 -0800 Subject: [PATCH 10/44] Add grand children --- docs/genai/tutorials/index.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/genai/tutorials/index.md b/docs/genai/tutorials/index.md index e7318222a1a1b..4c6bdc87bba86 100644 --- a/docs/genai/tutorials/index.md +++ b/docs/genai/tutorials/index.md @@ -2,5 +2,6 @@ title: Tutorials description: Build your application with ONNX Runtime GenAI parent: Generative AI +has_children: true nav_order: 1 --- \ No newline at end of file From cce960e9dcd9d2a2e4e9274a876bf69932ae1435 Mon Sep 17 00:00:00 2001 From: natke Date: Tue, 20 Feb 2024 06:02:55 -0800 Subject: [PATCH 11/44] Save grand children --- docs/genai/api/index.md | 1 + docs/genai/howto/index.md | 1 + 2 files changed, 2 insertions(+) diff --git a/docs/genai/api/index.md b/docs/genai/api/index.md index 8a851a90ced2b..436faba7cf72b 100644 --- a/docs/genai/api/index.md +++ b/docs/genai/api/index.md @@ -2,5 +2,6 @@ title: API docs description: API documentation for ONNX Runtime GenAI parent: Generative AI +has_children: true nav_order: 2 --- \ No newline at end of file diff --git a/docs/genai/howto/index.md b/docs/genai/howto/index.md index 9de912fe93073..3ad82074dc2de 100644 --- a/docs/genai/howto/index.md +++ b/docs/genai/howto/index.md @@ -2,5 +2,6 @@ title: How to description: How to perform specific tasks with ONNX Runtime GenAI parent: Generative AI +has_children: true nav_order: 3 --- \ No newline at end of file From 6f9320b59df94695bc9264fa8935ec5cfdcf66a2 Mon Sep 17 00:00:00 2001 From: natke Date: Tue, 20 Feb 2024 06:09:46 -0800 Subject: [PATCH 12/44] Add table of contents --- docs/genai/api/c.md | 3 +++ docs/genai/api/csharp.md | 3 +++ docs/genai/howto/build-from-source.md | 4 ++++ docs/genai/howto/build-model.md | 4 ++++ docs/genai/howto/install.md | 4 ++++ 5 files changed, 18 insertions(+) diff --git a/docs/genai/api/c.md b/docs/genai/api/c.md index 2d681503adf54..5bc0df2b371d3 100644 --- a/docs/genai/api/c.md +++ b/docs/genai/api/c.md @@ -8,7 +8,10 @@ nav_order: 2 --- # ONNX Runtime GenAI C API +{: .no_toc } +* TOC placeholder +{:toc} ## Create model diff --git a/docs/genai/api/csharp.md b/docs/genai/api/csharp.md index 9e8a4c46cb476..f7a8fe327fc40 100644 --- a/docs/genai/api/csharp.md +++ b/docs/genai/api/csharp.md @@ -8,4 +8,7 @@ nav_order: 3 --- # ONNX Runtime GenAI C# API +{: .no_toc } +* TOC placeholder +{:toc} diff --git a/docs/genai/howto/build-from-source.md b/docs/genai/howto/build-from-source.md index bfe35b5fccdf9..e7479f5a2aaf3 100644 --- a/docs/genai/howto/build-from-source.md +++ b/docs/genai/howto/build-from-source.md @@ -8,6 +8,10 @@ nav_order: 2 --- # Build onnxruntime-genai from source +{: .no_toc } + +* TOC placeholder +{:toc} ## Pre-requisites diff --git a/docs/genai/howto/build-model.md b/docs/genai/howto/build-model.md index 10712e1852796..dac5db965b80d 100644 --- a/docs/genai/howto/build-model.md +++ b/docs/genai/howto/build-model.md @@ -8,6 +8,10 @@ nav_order: 2 --- # Generate models using Model Builder +{: .no_toc } + +* TOC placeholder +{:toc} The model builder greatly accelerates creating optimized and quantized ONNX models that run with ONNX Runtime GenAI. diff --git a/docs/genai/howto/install.md b/docs/genai/howto/install.md index e58999929131e..897ea9779876a 100644 --- a/docs/genai/howto/install.md +++ b/docs/genai/howto/install.md @@ -8,6 +8,10 @@ nav_order: 1 --- # Install ONNX Runtime GenAI +{: .no_toc } + +* TOC placeholder +{:toc} ## Python package From 39d10a4147b9edefd8afc80dae661a335432a0df Mon Sep 17 00:00:00 2001 From: natke Date: Tue, 20 Feb 2024 06:46:32 -0800 Subject: [PATCH 13/44] Fix link in install --- docs/genai/howto/install.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/genai/howto/install.md b/docs/genai/howto/install.md index 897ea9779876a..4592a8d3c47c7 100644 --- a/docs/genai/howto/install.md +++ b/docs/genai/howto/install.md @@ -20,7 +20,7 @@ nav_order: 1 (Temporary) 1. Build from source - Follow the instructions in [build-from-source.md] + Follow the [build from source](./build-from-source.md) instructions. 2. Install wheel @@ -36,7 +36,7 @@ nav_order: 1 (Temporary) 1. Build from source - Follow the instructions in [build-from-source.md] + Follow the [build from source](./build-from-source.md) instructions. 2. Build nuget package @@ -60,7 +60,7 @@ Unzip archive (Temporary) 1. Build from source - Follow the instructions in [build-from-source.md] + Follow the [build from source](build-from-source.md) instructions. 2. Use the following include locations to build your C application From 61b5e898ca57e5c129b1a438a6e7548a537d4d6a Mon Sep 17 00:00:00 2001 From: natke Date: Tue, 20 Feb 2024 08:57:16 -0800 Subject: [PATCH 14/44] Update Python API --- docs/genai/api/python.md | 137 ++++++++++++++++++++++++++------------- 1 file changed, 91 insertions(+), 46 deletions(-) diff --git a/docs/genai/api/python.md b/docs/genai/api/python.md index c88bf8b59dedd..a19435f674ab1 100644 --- a/docs/genai/api/python.md +++ b/docs/genai/api/python.md @@ -41,27 +41,28 @@ onnxruntime_genai.Model(model_folder: str, device: onnxruntime_genai.DeviceType) - `device`: (optional) The device to run on. One of: - onnxruntime_genai.CPU - onnxruntime_genai.CUDA - - onnxruntime_genai.CPU - If not specified, defaults to XXX + If not specified, defaults to CPU. #### Return value -### Create a Generator +`onnxruntime_genai.Model` + +### Create tokenizer object ```python -onnxruntime_genai.Model.Generator(params: GeneratorParams) -> Generator +onnxruntime_genai.Model.create_tokenizer(model: onnxruntime_genai.Model) -> onnxruntime_genai.Tokenizer ``` #### Parameters -- `params`: (Required) The set of parameters that control the generation +- `model`: (Required) The model that was loaded by the `Model()` #### Return value -- `onnxruntime_genai.Generator` +- `Tokenizer` -### Generate +### Generate method ```python onnxruntime_genai.Model.generate(params: GeneratorParams) -> XXXX @@ -72,22 +73,15 @@ onnxruntime_genai.Model.generate(params: GeneratorParams) -> XXXX #### Return value -### Generate sequence +`numpy.int32 [ batch_size, max_length]` -```python -onnxruntime_genai.Model.generate_sequence(input_ids: , params: ) -``` - -#### Parameters - -- `input_ids`: tokenized prompt -- `params`: dictionary of generation parameters +## GeneratorParams class -### Create GeneratorParameters class +### Create GeneratorParams object ```python -params=onnxruntime_genai.GeneratorParams(model: onnxruntime_genai.Model) -> onnxruntime_genai.GeneratorParams +onnxruntime_genai.GeneratorParams(model: onnxruntime_genai.Model) -> onnxruntime_genai.GeneratorParams ``` #### Parameters @@ -96,132 +90,183 @@ params=onnxruntime_genai.GeneratorParams(model: onnxruntime_genai.Model) -> onnx #### Return value +`onnxruntime_genai.GeneratorParams` ## Tokenizer class -### Create tokenizer +Tokenizer objects are created from a Model. + +### Encode ```python -create_tokenizer(model: onnxruntime_genai.Model) -> onnxruntime_genai.Tokenizer +onnxruntime_genai.Tokenizer.encode(prompt: str) -> numpy.int32 ``` #### Parameters -- `model`: (Required) The model that was loaded by the `Model()` +- `prompt`: (Required) #### Return value -- `Tokenizer` - +`numpy.int32`: an array of tokens representing the prompt -### Encode +### Decode ```python -onnxruntime_genai.Tokenizer.encode(XXXX) -> XXXX +onnxruntime_genai.Tokenizer.decode(numpy.int32) -> str ``` #### Parameters -#### Return value +`numpy.int32`: (Required) a sequence of generated tokens -### Decode -```python -onnxruntime_genai.StreamingTokenizer.decode(XXXX) -> XXXX -``` +#### Return value -#### Parameters +str: the decoded generated tokens -#### Return value ### Encode batch ```python -onnxruntime_genai.Tokenizer.encode_batch(XXXX) -> XXXX +onnxruntime_genai.Tokenizer.encode_batch(texts: list[str]) -> ``` #### Parameters +- `texts`: a list of inputs + #### Return value +`[[numpy.int32]]`: a 2 D array of tokens + ### Decode batch ```python -onnxruntime_genai.decode_batch(XXXX) -> XXXX +onnxruntime_genai.Tokenize.decode_batch(tokens: [[numpy.int32]]) -> list[str] ``` #### Parameters +- tokens + #### Return value +`texts`: a batch of decoded text + +### Create tokenizer decoding stream -### Create streaming tokenizer +Decodes one token at a time to allow for responsive user interfaces. ```python -create_stream(model: onnxruntime_genai.Model) -> TokenizerStream +onnxruntime_genai.Tokenizer.create_stream() -> TokenizerStream ``` #### Parameters -- `model`: (Required) The model that was loaded by the `Model()` +None #### Return value - TokenizerStream -### Decode token stream +## TokenizerStream class + +This class keeps track of the generated token sequence, returning the next displayable string (according to the tokenizer's vocabulary) when decode is called. Explain empty string ... + +### Decode method ```python -onnxruntime_genai.TokenizerStream.decode(token: ) -> token +onnxruntime_genai.TokenizerStream.decode(token: int32) -> str ``` +#### Parameters + +- `token`: (Required) A token to decode + +#### Returns + +`str`: Next displayable text, if at the end of displayable block?, otherwise empty string + ## Generator class +### Create a Generator + +```python +onnxruntime_genai.Generator(model: Model, params: GeneratorParams) -> Generator +``` + +#### Parameters + +- `model`: (Required) The model to use for generation +- `params`: (Required) The set of parameters that control the generation + +#### Return value + +- `onnxruntime_genai.Generator` + + ### Is generation done +Returns true when all sequences are at max length, or have reached the end of sequence. + ```python onnxruntime_genai.Generator.is_done() -> bool ``` ### Compute logits +Runs the model through one iteration. + ```python -onnxruntime_genai.Generator.compute_logits() -> +onnxruntime_genai.Generator.compute_logits() ``` ### Generate next token +Using the current set of logits and the specified generator parameters, calculates the next batch of tokens, using Top P sampling. + ```python -onnxruntime_genai.Generator.generate_next_token() -> +onnxruntime_genai.Generator.generate_next_token() ``` ### Generate next token with Top P sampling +Using the current set of logits and the specified generator parameters, calculates the next batch of tokens, using Top P sampling. + ```python -onnxruntime_genai.Generator.generate_next_token_top_p() -> +onnxruntime_genai.Generator.generate_next_token_top_p() ``` ### Generate next token with Top K sampling +Using the current set of logits and the specified generator parameters, calculates the next batch of tokens, using Top K sampling. + ```python -onnxruntime_genai.Generator.generate_next_token_top_k() -> +onnxruntime_genai.Generator.generate_next_token_top_k() ``` ### Generate next token with Top K and Top P sampling +Using the current set of logits and the specified generator parameters, calculates the next batch of tokens, using both Top K then Top P sampling. + ```python -onnxruntime_genai.Generator.generate_next_token_top_k_top_p() -> +onnxruntime_genai.Generator.generate_next_token_top_k_top_p() ``` ### Get next tokens +Returns the most recently generated tokens. + ```python -onnxruntime_genai.Generator.generate_next_tokens() -> +onnxruntime_genai.Generator.get_next_tokens() -> [numpy.int32] ``` ### Get sequence ```python -onnxruntime_genai.Generator.generate_next_token() -> +onnxruntime_genai.Generator.get_sequence(index: int) -> [numpy.int32] ``` + +- `index`: (Required) The index of the sequence in the batch to return \ No newline at end of file From ee83d2d653697418354a31078dd4327b612d7fa8 Mon Sep 17 00:00:00 2001 From: natke Date: Tue, 20 Feb 2024 09:50:50 -0800 Subject: [PATCH 15/44] More edits to Python API docs --- docs/genai/api/python.md | 72 ++++++++++++++++++++++------------------ 1 file changed, 39 insertions(+), 33 deletions(-) diff --git a/docs/genai/api/python.md b/docs/genai/api/python.md index a19435f674ab1..1d4a298486792 100644 --- a/docs/genai/api/python.md +++ b/docs/genai/api/python.md @@ -43,7 +43,7 @@ onnxruntime_genai.Model(model_folder: str, device: onnxruntime_genai.DeviceType) - onnxruntime_genai.CUDA If not specified, defaults to CPU. -#### Return value +#### Returns `onnxruntime_genai.Model` @@ -57,23 +57,23 @@ onnxruntime_genai.Model.create_tokenizer(model: onnxruntime_genai.Model) -> onnx - `model`: (Required) The model that was loaded by the `Model()` -#### Return value +#### Returns -- `Tokenizer` +- `Tokenizer`: The tokenizer object ### Generate method ```python -onnxruntime_genai.Model.generate(params: GeneratorParams) -> XXXX +onnxruntime_genai.Model.generate(params: GeneratorParams) -> numpy.ndarray[int, int] ``` #### Parameters - `params`: (Required) Created by the `GenerateParams` method. -#### Return value +#### Returns -`numpy.int32 [ batch_size, max_length]` +`numpy.ndarray[int, int]`: a two dimensional numpy array with dimensions equal to the size of the batch passed in and the maximum length of the sequence of tokens. ## GeneratorParams class @@ -88,9 +88,9 @@ onnxruntime_genai.GeneratorParams(model: onnxruntime_genai.Model) -> onnxruntime - `model`: (required) The model that was loaded by onnxruntime_genai.Model() -#### Return value +#### Returns -`onnxruntime_genai.GeneratorParams` +`onnxruntime_genai.GeneratorParams`: The GeneratorParams object ## Tokenizer class @@ -99,16 +99,16 @@ Tokenizer objects are created from a Model. ### Encode ```python -onnxruntime_genai.Tokenizer.encode(prompt: str) -> numpy.int32 +onnxruntime_genai.Tokenizer.encode(text: str) -> numpy.ndarray[numpy.int32] ``` #### Parameters -- `prompt`: (Required) +- `text`: (Required) -#### Return value +#### Returns -`numpy.int32`: an array of tokens representing the prompt +`numpy.ndarray[numpy.int32]`: an array of tokens representing the prompt ### Decode @@ -118,27 +118,27 @@ onnxruntime_genai.Tokenizer.decode(numpy.int32) -> str #### Parameters -`numpy.int32`: (Required) a sequence of generated tokens +- `numpy.ndarray[numpy.int32]`: (Required) a sequence of generated tokens -#### Return value +#### Returns -str: the decoded generated tokens +`str`: the decoded generated tokens ### Encode batch ```python -onnxruntime_genai.Tokenizer.encode_batch(texts: list[str]) -> +onnxruntime_genai.Tokenizer.encode_batch(texts: list[str]) -> numpy.ndarray[int, int] ``` #### Parameters -- `texts`: a list of inputs +- `texts`: A list of inputs -#### Return value +#### Returns -`[[numpy.int32]]`: a 2 D array of tokens +`numpy.ndarray[int, int]`: The batch of tokenized strings ### Decode batch @@ -150,14 +150,13 @@ onnxruntime_genai.Tokenize.decode_batch(tokens: [[numpy.int32]]) -> list[str] - tokens -#### Return value +#### Returns `texts`: a batch of decoded text ### Create tokenizer decoding stream -Decodes one token at a time to allow for responsive user interfaces. ```python onnxruntime_genai.Tokenizer.create_stream() -> TokenizerStream @@ -167,16 +166,17 @@ onnxruntime_genai.Tokenizer.create_stream() -> TokenizerStream None -#### Return value +#### Returns -- TokenizerStream +`onnxruntime_genai.TokenizerStream` The tokenizer stream object ## TokenizerStream class -This class keeps track of the generated token sequence, returning the next displayable string (according to the tokenizer's vocabulary) when decode is called. Explain empty string ... +This class accumulates the next displayable string (according to the tokenizer's vocabulary). ### Decode method + ```python onnxruntime_genai.TokenizerStream.decode(token: int32) -> str ``` @@ -187,7 +187,8 @@ onnxruntime_genai.TokenizerStream.decode(token: int32) -> str #### Returns -`str`: Next displayable text, if at the end of displayable block?, otherwise empty string +`str`: If a displayable string has accumulated, this method returns it. If not, this method returns the empty string. + ## Generator class @@ -202,19 +203,22 @@ onnxruntime_genai.Generator(model: Model, params: GeneratorParams) -> Generator - `model`: (Required) The model to use for generation - `params`: (Required) The set of parameters that control the generation -#### Return value +#### Returns -- `onnxruntime_genai.Generator` +`onnxruntime_genai.Generator` The Generator object ### Is generation done -Returns true when all sequences are at max length, or have reached the end of sequence. - ```python onnxruntime_genai.Generator.is_done() -> bool ``` +#### Returns + +Returns true when all sequences are at max length, or have reached the end of sequence. + + ### Compute logits Runs the model through one iteration. @@ -257,16 +261,18 @@ onnxruntime_genai.Generator.generate_next_token_top_k_top_p() ### Get next tokens -Returns the most recently generated tokens. - ```python -onnxruntime_genai.Generator.get_next_tokens() -> [numpy.int32] +onnxruntime_genai.Generator.get_next_tokens() -> numpy.ndarray[numpy.int32] ``` +Returns + +`numpy.ndarray[numpy.int32]`: The most recently generated tokens + ### Get sequence ```python -onnxruntime_genai.Generator.get_sequence(index: int) -> [numpy.int32] +onnxruntime_genai.Generator.get_sequence(index: int) -> numpy.ndarray[numpy.int32] ``` - `index`: (Required) The index of the sequence in the batch to return \ No newline at end of file From 882c23d972021dc581305ab86b92dc72ab0d1421 Mon Sep 17 00:00:00 2001 From: natke Date: Wed, 21 Feb 2024 05:23:12 -0800 Subject: [PATCH 16/44] A API docs first version --- docs/genai/api/c.md | 607 +++++++++++++++++++++++++--------- docs/genai/api/index.md | 4 +- docs/genai/howto/index.md | 4 +- docs/genai/index.md | 4 +- docs/genai/tutorials/index.md | 4 +- 5 files changed, 462 insertions(+), 161 deletions(-) diff --git a/docs/genai/api/c.md b/docs/genai/api/c.md index 5bc0df2b371d3..fa96d10f27533 100644 --- a/docs/genai/api/c.md +++ b/docs/genai/api/c.md @@ -8,241 +8,427 @@ nav_order: 2 --- # ONNX Runtime GenAI C API + +_Note: this API is in preview and is subject to change._ + {: .no_toc } * TOC placeholder {:toc} -## Create model +## Overview + +## Functions + +### Create model + +Creates a model from the given configuration directory and device type. + +#### Parameters + * Input: config_path The path to the model configuration directory. The path is expected to be encoded in UTF-8. + * Input: device_type The device type to use for the model. + * Output: out The created model. + +#### Returns + OgaResult containing the error message if the model creation failed. + -/* - * \brief Creates a model from the given configuration directory and device type. - * \param[in] config_path The path to the model configuration directory. The path is expected to be encoded in UTF-8. - * \param[in] device_type The device type to use for the model. - * \param[out] out The created model. - * \return OgaResult containing the error message if the model creation failed. - */ +### Destroy model + +#### Parameters ```c OGA_EXPORT OgaResult* OGA_API_CALL OgaCreateModel(const char* config_path, OgaDeviceType device_type, OgaModel** out); ``` -/* - * \brief Destroys the given model. - * \param[in] model The model to be destroyed. - */ +#### Parameters + + + Destroys the given model. + * Input: model The model to be destroyed. + ```c OGA_EXPORT void OGA_API_CALL OgaDestroyModel(OgaModel* model); ``` -## Create Tokenizer +### Create Tokenizer + +#### Parameters + +#### Returns ```c OGA_EXPORT OgaResult* OGA_API_CALL OgaCreateTokenizer(const OgaModel* model, OgaTokenizer** out); +``` + +### Destroy Tokenizer + +#### Parameters + +#### Returns + +```c OGA_EXPORT void OGA_API_CALL OgaDestroyTokenizer(OgaTokenizer*); +``` + +### Encode batch + +#### Parameters + +```c OGA_EXPORT OgaResult* OGA_API_CALL OgaTokenizerEncodeBatch(const OgaTokenizer*, const char** strings, size_t count, OgaSequences** out); +``` + +### Decode batch + +#### Parameters + +```c OGA_EXPORT OgaResult* OGA_API_CALL OgaTokenizerDecodeBatch(const OgaTokenizer*, const OgaSequences* tokens, const char*** out_strings); +``` + +### Destroy tokenizer strings + +#### Parameters + +```c OGA_EXPORT void OGA_API_CALL OgaTokenizerDestroyStrings(const char** strings, size_t count); ``` +### Create tokenizer stream + + +#### Parameters + +```c +OgaTokenizerStream is to decoded token strings incrementally, one token at a time. +``` + +#### Parameters -/* OgaTokenizerStream is to decoded token strings incrementally, one token at a time. - */ ```c OGA_EXPORT OgaResult* OGA_API_CALL OgaCreateTokenizerStream(const OgaTokenizer*, OgaTokenizerStream** out); +``` + +### Destroy tokenizer stream + +#### Parameters + +```c OGA_EXPORT void OGA_API_CALL OgaDestroyTokenizerStream(OgaTokenizerStream*); ``` -/* - * Decode a single token in the stream. If this results in a word being generated, it will be returned in 'out'. +### Decode stream + +Decode a single token in the stream. If this results in a word being generated, it will be + +#### Parameters + +returned in 'out'. * The caller is responsible for concatenating each chunk together to generate the complete result. * 'out' is valid until the next call to OgaTokenizerStreamDecode or when the OgaTokenizerStream is destroyed - */ + ```c OGA_EXPORT OgaResult* OGA_API_CALL OgaTokenizerStreamDecode(OgaTokenizerStream*, int32_t token, const char** out); ``` -## Create Generator +### Create Generator + +Creates a generator from the given model and generator params. -/* - * \brief Creates a generator from the given model and generator params. - * \param[in] model The model to use for generation. - * \param[in] params The parameters to use for generation. - * \param[out] out The created generator. - * \return OgaResult containing the error message if the generator creation failed. - */ +#### Parameters + + * Input: model The model to use for generation. + * Input: params The parameters to use for generation. + * Output: out The created generator. + +#### Returns +OgaResult containing the error message if the generator creation failed. + ```c OGA_EXPORT OgaResult* OGA_API_CALL OgaCreateGenerator(const OgaModel* model, const OgaGeneratorParams* params, OgaGenerator** out); ``` -/* - * \brief Destroys the given generator. - * \param[in] generator The generator to be destroyed. - */ +### Destroy generator + +Destroys the given generator. + +#### Parameters + +* Input: generator The generator to be destroyed. + +#### Returns +`void` + ```c OGA_EXPORT void OGA_API_CALL OgaDestroyGenerator(OgaGenerator* generator); ``` -## Create and set generator input parameters +### Create generator params + +Creates a OgaGeneratorParams from the given model. + +#### Parameters + +* Input: model The model to use for generation. +* Output: out The created generator params. -/* - * \brief Creates a OgaGeneratorParams from the given model. - * \param[in] model The model to use for generation. - * \param[out] out The created generator params. - * \return OgaResult containing the error message if the generator params creation failed. - */ +#### Returns + +OgaResult containing the error message if the generator params creation failed. + ```c OGA_EXPORT OgaResult* OGA_API_CALL OgaCreateGeneratorParams(const OgaModel* model, OgaGeneratorParams** out); ``` -/* - * \brief Destroys the given generator params. - * \param[in] generator_params The generator params to be destroyed. - */ +### Destroy generator params + +Destroys the given generator params. + +#### Parameters + + * Input: generator_params The generator params to be destroyed. + +#### Returns +`void` + ```c OGA_EXPORT void OGA_API_CALL OgaDestroyGeneratorParams(OgaGeneratorParams* generator_params); ``` -/* - * \brief Sets the maximum length that the generated sequence can have. - * \param[in] params The generator params to set the maximum length on. - * \param[in] max_length The maximum length of the generated sequences. - * \return OgaResult containing the error message if the setting of the maximum length failed. - */ +### Set maximum length + +Sets the maximum length that the generated sequence can have. + +#### Parameters + +* Input: params The generator params to set the maximum length on. +* Input: max_length The maximum length of the generated sequences. + +#### Returns + +`OgaResult` containing the error message if the setting of the maximum length failed. + ```c OGA_EXPORT OgaResult* OGA_API_CALL OgaGeneratorParamsSetMaxLength(OgaGeneratorParams* generator_params, size_t max_length); ``` -/* - * \brief Sets the input ids for the generator params. The input ids are used to seed the generation. - * \param[in] generator_params The generator params to set the input ids on. - * \param[in] input_ids The input ids array of size input_ids_count = batch_size * sequence_length. - * \param[in] input_ids_count The total number of input ids. - * \param[in] sequence_length The sequence length of the input ids. - * \param[in] batch_size The batch size of the input ids. - * \return OgaResult containing the error message if the setting of the input ids failed. - */ +### Set inputs + +Sets the input ids for the generator params. The input ids are used to seed the generation. + +#### Parameters + + * Input: generator_params The generator params to set the input ids on. + * Input: input_ids The input ids array of size input_ids_count = batch_size * sequence_length. + * Input: input_ids_count The total number of input ids. + * Input: sequence_length The sequence length of the input ids. + * Input: batch_size The batch size of the input ids. + +#### Returns + + OgaResult containing the error message if the setting of the input ids failed. + ```c OGA_EXPORT OgaResult* OGA_API_CALL OgaGeneratorParamsSetInputIDs(OgaGeneratorParams* generator_params, const int32_t* input_ids, size_t input_ids_count, size_t sequence_length, size_t batch_size); ``` -/* - * \brief Sets the input id sequences for the generator params. The input id sequences are used to seed the generation. - * \param[in] generator_params The generator params to set the input ids on. - * \param[in] sequences The input id sequences. - * \return OgaResult containing the error message if the setting of the input id sequences failed. - */ +### Set input sequence + +Sets the input id sequences for the generator params. The input id sequences are used to seed the generation. + +#### Parameters + + * Input: generator_params The generator params to set the input ids on. + * Input: sequences The input id sequences. + +#### Returns + + OgaResult containing the error message if the setting of the input id sequences failed. + ```c OGA_EXPORT OgaResult* OGA_API_CALL OgaGeneratorParamsSetInputSequences(OgaGeneratorParams* generator_params, const OgaSequences* sequences); ``` -## Tokenize and decode tokens +### Encode + +Encodes a single string and adds the encoded sequence of tokens to the OgaSequences. The OgaSequences must be freed with OgaDestroySequences when it is no longer needed. + +#### Parameters + +#### Returns -/* Encodes a single string and adds the encoded sequence of tokens to the OgaSequences. The OgaSequences must be freed with OgaDestroySequences - when it is no longer needed. - */ ```c OGA_EXPORT OgaResult* OGA_API_CALL OgaTokenizerEncode(const OgaTokenizer*, const char* str, OgaSequences* sequences); ``` -/* Decode a single token sequence and returns a null terminated utf8 string. out_string must be freed with OgaDestroyString - */ +### Decode + +Decode a single token sequence and returns a null terminated utf8 string. out_string must be freed with OgaDestroyString + +#### Parameters + +#### Returns + ```c OGA_EXPORT OgaResult* OGA_API_CALL OgaTokenizerDecode(const OgaTokenizer*, const int32_t* tokens, size_t token_count, const char** out_string); ``` -## Generate output +### Generate + +Generates an array of token arrays from the model execution based on the given generator params. + +#### Parameters -### High level API +* Input: model The model to use for generation. +* Input: generator_params The parameters to use for generation. +* Output: out The generated sequences of tokens. The caller is responsible for freeing the sequences using OgaDestroySequences after it is done using the sequences. -/* - * \brief Generates an array of token arrays from the model execution based on the given generator params. - * \param[in] model The model to use for generation. - * \param[in] generator_params The parameters to use for generation. - * \param[out] out The generated sequences of tokens. The caller is responsible for freeing the sequences using OgaDestroySequences - * after it is done using the sequences. - * \return OgaResult containing the error message if the generation failed. - */ +#### Returns + +OgaResult containing the error message if the generation failed. + ```c OGA_EXPORT OgaResult* OGA_API_CALL OgaGenerate(const OgaModel* model, const OgaGeneratorParams* generator_params, OgaSequences** out); ``` -/* - * \brief Creates a OgaGeneratorParams from the given model. - * \param[in] model The model to use for generation. - * \param[out] out The created generator params. - * \return OgaResult containing the error message if the generator params creation failed. - */ +### Create generator params + +Creates a OgaGeneratorParams from the given model. + +#### Parameters + +* Input: model The model to use for generation. +* Output: out The created generator params. + +#### Returns + +OgaResult containing the error message if the generator params creation failed. + ```c OGA_EXPORT OgaResult* OGA_API_CALL OgaCreateGeneratorParams(const OgaModel* model, OgaGeneratorParams** out); ``` -/* - * \brief Destroys the given generator params. - * \param[in] generator_params The generator params to be destroyed. - */ +### Destroy generator params + +Destroys the given generator params. + +#### Parameters + +* Input: generator_params The generator params to be destroyed. + +#### Returns +`void` + ```c OGA_EXPORT void OGA_API_CALL OgaDestroyGeneratorParams(OgaGeneratorParams* generator_params); ``` -### Low level API +### Check if generation has completed + +Returns true if the generator has finished generating all the sequences. + +#### Parameters +* Input: generator The generator to check if it is done with generating all sequences. -/* - * \brief Returns true if the generator has finished generating all the sequences. - * \param[in] generator The generator to check if it is done with generating all sequences. - * \return True if the generator has finished generating all the sequences, false otherwise. - */ +#### Returns + +True if the generator has finished generating all the sequences, false otherwise. + ```c OGA_EXPORT bool OGA_API_CALL OgaGenerator_IsDone(const OgaGenerator* generator); ``` -/* - * \brief Computes the logits from the model based on the input ids and the past state. The computed logits are stored in the generator. - * \param[in] generator The generator to compute the logits for. - * \return OgaResult containing the error message if the computation of the logits failed. - */ +### Run one iteration of the model + +Computes the logits from the model based on the input ids and the past state. The computed logits are stored in the generator. + +#### Parameters + +* Input: generator The generator to compute the logits for. + +#### Returns + +OgaResult containing the error message if the computation of the logits failed. + ```c OGA_EXPORT OgaResult* OGA_API_CALL OgaGenerator_ComputeLogits(OgaGenerator* generator); ``` -/* - * \brief Generates the next token based on the computed logits using the greedy search. - * \param[in] generator The generator to generate the next token for. - * \return OgaResult containing the error message if the generation of the next token failed. - */ +### Generate next token + +Generates the next token based on the computed logits using the greedy search. + +#### Parameters + + * Input: generator The generator to generate the next token for. + +#### Returns + +OgaResult containing the error message if the generation of the next token failed. + ```c OGA_EXPORT OgaResult* OGA_API_CALL OgaGenerator_GenerateNextToken_Top(OgaGenerator* generator); +``` +### Generate next token with Top K sampling + +#### Parameters + +#### Returns + +```c OGA_EXPORT OgaResult* OGA_API_CALL OgaGenerator_GenerateNextToken_TopK(OgaGenerator* generator, int k, float t); +``` + +### Generate next token with Top P sampling + +#### Parameters + +#### Returns + +```c OGA_EXPORT OgaResult* OGA_API_CALL OgaGenerator_GenerateNextToken_TopP(OgaGenerator* generator, float p, float t); ``` -/* - * \brief Returns the number of tokens in the sequence at the given index. - * \param[in] generator The generator to get the count of the tokens for the sequence at the given index. - * \return The number tokens in the sequence at the given index. - */ +### Get number of tokens + + Returns the number of tokens in the sequence at the given index. + +#### Parameters + + * Input: generator The generator to get the count of the tokens for the sequence at the given index. + * Input: index. The index at which to return the tokens + +#### Returns + +The number tokens in the sequence at the given index. + ```c OGA_EXPORT size_t OGA_API_CALL OgaGenerator_GetSequenceLength(const OgaGenerator* generator, size_t index); ``` -/* - * \brief Returns a pointer to the sequence data at the given index. The number of tokens in the sequence - * is given by OgaGenerator_GetSequenceLength - * \param[in] generator The generator to get the sequence data for the sequence at the given index. - * \return The pointer to the sequence data at the given index. The sequence data is owned by the OgaGenerator - * and will be freed when the OgaGenerator is destroyed. The caller must copy the data if it needs to - * be used after the OgaGenerator is destroyed. - */ +### Get sequence + +Returns a pointer to the sequence data at the given index. The number of tokens in the sequence is given by OgaGenerator_GetSequenceLength. + +#### Parameters + +* Input: generator The generator to get the sequence data for the sequence at the given index. The pointer to the sequence data at the given index. The sequence data is owned by the OgaGenerator and will be freed when the OgaGenerator is destroyed. The caller must copy the data if it needs to be used after the OgaGenerator is destroyed. +* Input: index. The index at which to get the sequence. + +#### Returns + +A pointer to the token sequence + ```c OGA_EXPORT const int32_t* OGA_API_CALL OgaGenerator_GetSequence(const OgaGenerator* generator, size_t index); ``` - ## Enums and structs ```c @@ -251,13 +437,17 @@ typedef enum OgaDeviceType { OgaDeviceTypeCPU, OgaDeviceTypeCUDA, } OgaDeviceType; +``` +```c typedef enum OgaDataType { OgaDataType_int32, OgaDataType_float32, OgaDataType_string, // UTF8 string } OgaDataType; +``` +```c typedef struct OgaResult OgaResult; typedef struct OgaGeneratorParams OgaGeneratorParams; typedef struct OgaGenerator OgaGenerator; @@ -265,78 +455,181 @@ typedef struct OgaModel OgaModel; typedef struct OgaBuffer OgaBuffer; ``` -// OgaSequences is an array of token arrays where the number of token arrays can be obtained using -// OgaSequencesCount and the number of tokens in each token array can be obtained using - -```c -OgaSequencesGetSequenceCount. -typedef struct OgaSequences OgaSequences; -typedef struct OgaTokenizer OgaTokenizer; -typedef struct OgaTokenizerStream OgaTokenizerStream; -``` ## Utility functions -/* - * \param[in] result OgaResult that contains the error message. - * \return Error message contained in the OgaResult. The const char* is owned by the OgaResult - * and can will be freed when the OgaResult is destroyed. - */ +### Get error message + +#### Parameters + +* Input: result OgaResult that contains the error message. + +#### Returns + +Error message contained in the OgaResult. The const char* is owned by the OgaResult and can will be freed when the OgaResult is destroyed. + ```c OGA_EXPORT const char* OGA_API_CALL OgaResultGetError(OgaResult* result); ``` -/* - * \param[in] result OgaResult to be destroyed. - */ +### Destroy result + +#### Parameters + +* Input: result OgaResult to be destroyed. + +#### Returns +`void` + ```c OGA_EXPORT void OGA_API_CALL OgaDestroyResult(OgaResult*); +``` + +### Destroy string + +#### Parameters +* Input: string to be destroyed + +#### Returns + +```c OGA_EXPORT void OGA_API_CALL OgaDestroyString(const char*); +``` + +### Destroy buffer + +#### Parameters +* Input: buffer to be destroyed + +#### Returns +`void` +```c OGA_EXPORT void OGA_API_CALL OgaDestroyBuffer(OgaBuffer*); +``` + +### Get buffer type + +#### Parameters +* Input: the buffer + +#### Returns + +The type of the buffer + +```c OGA_EXPORT OgaDataType OGA_API_CALL OgaBufferGetType(const OgaBuffer*); +``` + +### Get the number of dimensions of a buffer + +#### Parameters +* Input: the buffer + +#### Returns +The number of dimensions in the buffer + +```c OGA_EXPORT size_t OGA_API_CALL OgaBufferGetDimCount(const OgaBuffer*); +``` + +### Get buffer dimensions + +Get the dimensions of a buffer + +#### Parameters +* Input: the buffer +* Output: a dimension array + +#### Returns +`OgaResult` + +```c OGA_EXPORT OgaResult* OGA_API_CALL OgaBufferGetDims(const OgaBuffer*, size_t* dims, size_t dim_count); +``` + +### Get buffer data + +Get the data from a buffer + +#### Parameters + +#### Returns +`void` + +```c OGA_EXPORT const void* OGA_API_CALL OgaBufferGetData(const OgaBuffer*); +``` + +### Create sequences + +#### Parameters + +#### Returns + +```c OGA_EXPORT OgaResult* OGA_API_CALL OgaCreateSequences(OgaSequences** out); ``` +### Destroy sequences + +#### Parameters + +* Input: sequences OgaSequences to be destroyed. + +#### Returns +`void` + +#### Returns -/* - * \param[in] sequences OgaSequences to be destroyed. - */ ```c OGA_EXPORT void OGA_API_CALL OgaDestroySequences(OgaSequences* sequences); ``` -/* - * \brief Returns the number of sequences in the OgaSequences - * \param[in] sequences - * \return The number of sequences in the OgaSequences - */ +### Get number of sequences + +Returns the number of sequences in the OgaSequences + +#### Parameters + +* Input: sequences + +#### Returns +The number of sequences in the OgaSequences + ```c OGA_EXPORT size_t OGA_API_CALL OgaSequencesCount(const OgaSequences* sequences); ``` -/* - * \brief Returns the number of tokens in the sequence at the given index - * \param[in] sequences - * \return The number of tokens in the sequence at the given index - */ +### Get the number of tokens in a sequence + +Returns the number of tokens in the sequence at the given index + +#### Parameters + +* Input: sequences + +#### Returns + +The number of tokens in the sequence at the given index + ```c OGA_EXPORT size_t OGA_API_CALL OgaSequencesGetSequenceCount(const OgaSequences* sequences, size_t sequence_index); ``` -/* - * \brief Returns a pointer to the sequence data at the given index. The number of tokens in the sequence - * is given by OgaSequencesGetSequenceCount - * \param[in] sequences - * \return The pointer to the sequence data at the given index. The pointer is valid until the OgaSequences is destroyed. - */ +### Get sequence data + +Returns a pointer to the sequence data at the given index. The number of tokens in the sequence is given by OgaSequencesGetSequenceCount + +#### Parameters +* Input: sequences + +#### Returns + +The pointer to the sequence data at the given index. The pointer is valid until the OgaSequences is destroyed. + ```c OGA_EXPORT const int32_t* OGA_API_CALL OgaSequencesGetSequenceData(const OgaSequences* sequences, size_t sequence_index); - -OGA_EXPORT OgaResult* OGA_API_CALL OgaGeneratorParamsSetWhisperInputFeatures(OgaGeneratorParams*, const int32_t* inputs, size_t count); -OGA_EXPORT OgaResult* OGA_API_CALL OgaGeneratorParamsSetWhisperDecoderInputIDs(OgaGeneratorParams*, const int32_t* input_ids, size_t input_ids_count); ``` + diff --git a/docs/genai/api/index.md b/docs/genai/api/index.md index 436faba7cf72b..c46162eca3544 100644 --- a/docs/genai/api/index.md +++ b/docs/genai/api/index.md @@ -4,4 +4,6 @@ description: API documentation for ONNX Runtime GenAI parent: Generative AI has_children: true nav_order: 2 ---- \ No newline at end of file +--- + +_Note: this API is in preview and is subject to change._ diff --git a/docs/genai/howto/index.md b/docs/genai/howto/index.md index 3ad82074dc2de..f6442fdb83e03 100644 --- a/docs/genai/howto/index.md +++ b/docs/genai/howto/index.md @@ -4,4 +4,6 @@ description: How to perform specific tasks with ONNX Runtime GenAI parent: Generative AI has_children: true nav_order: 3 ---- \ No newline at end of file +--- + +_Note: this API is in preview and is subject to change._ diff --git a/docs/genai/index.md b/docs/genai/index.md index 215dcd5e85546..57634b8f896e3 100644 --- a/docs/genai/index.md +++ b/docs/genai/index.md @@ -1,5 +1,5 @@ --- -title: Generative AI +title: Generative AI (Preview) description: Run generative models with ONNX Runtime GenAI has_children: true nav_order: 6 @@ -7,6 +7,8 @@ nav_order: 6 # Generative AI with ONNX Runtime +_Note: this API is in preview and is subject to change._ + Run generative AI models with ONNX Runtime. This library provides the generative AI loop for ONNX models, including inference with ONNX Runtime, logits processing, search and sampling, and KV cache management. diff --git a/docs/genai/tutorials/index.md b/docs/genai/tutorials/index.md index 4c6bdc87bba86..a4479616b3bd0 100644 --- a/docs/genai/tutorials/index.md +++ b/docs/genai/tutorials/index.md @@ -4,4 +4,6 @@ description: Build your application with ONNX Runtime GenAI parent: Generative AI has_children: true nav_order: 1 ---- \ No newline at end of file +--- + +_Note: this API is in preview and is subject to change._ From c86881cb068ca870946b7bbf9d5902cac70a25c7 Mon Sep 17 00:00:00 2001 From: natke Date: Wed, 21 Feb 2024 07:31:06 -0800 Subject: [PATCH 17/44] Update to preview --- docs/genai/api/c.md | 2 +- docs/genai/api/csharp.md | 2 +- docs/genai/api/index.md | 2 +- docs/genai/api/python.md | 2 +- docs/genai/howto/build-from-source.md | 2 +- docs/genai/howto/build-model.md | 2 +- docs/genai/howto/index.md | 2 +- docs/genai/howto/install.md | 2 +- docs/genai/tutorials/index.md | 2 +- 9 files changed, 9 insertions(+), 9 deletions(-) diff --git a/docs/genai/api/c.md b/docs/genai/api/c.md index fa96d10f27533..24541598a7b3c 100644 --- a/docs/genai/api/c.md +++ b/docs/genai/api/c.md @@ -3,7 +3,7 @@ title: C API description: C API reference for ONNX Runtime GenAI has_children: false parent: API docs -grand_parent: Generative AI +grand_parent: Generative AI (Preview) nav_order: 2 --- diff --git a/docs/genai/api/csharp.md b/docs/genai/api/csharp.md index f7a8fe327fc40..93c4a49cf10ba 100644 --- a/docs/genai/api/csharp.md +++ b/docs/genai/api/csharp.md @@ -3,7 +3,7 @@ title: C# API description: C# API reference for ONNX Runtime GenAI has_children: false parent: API docs -grand_parent: Generative AI +grand_parent: Generative AI (Preview) nav_order: 3 --- diff --git a/docs/genai/api/index.md b/docs/genai/api/index.md index c46162eca3544..1684099508fa4 100644 --- a/docs/genai/api/index.md +++ b/docs/genai/api/index.md @@ -1,7 +1,7 @@ --- title: API docs description: API documentation for ONNX Runtime GenAI -parent: Generative AI +parent: Generative AI (Preview) has_children: true nav_order: 2 --- diff --git a/docs/genai/api/python.md b/docs/genai/api/python.md index 1d4a298486792..91a21f808278f 100644 --- a/docs/genai/api/python.md +++ b/docs/genai/api/python.md @@ -3,7 +3,7 @@ title: Python API description: Python API reference for ONNX Runtime GenAI has_children: false parent: API docs -grand_parent: Generative AI +grand_parent: Generative AI (Preview) nav_order: 2 --- diff --git a/docs/genai/howto/build-from-source.md b/docs/genai/howto/build-from-source.md index e7479f5a2aaf3..71c345cd9d365 100644 --- a/docs/genai/howto/build-from-source.md +++ b/docs/genai/howto/build-from-source.md @@ -3,7 +3,7 @@ title: Build from source description: How to build ONNX Runtime GenAI from source has_children: false parent: How to -grand_parent: Generative AI +grand_parent: Generative AI (Preview) nav_order: 2 --- diff --git a/docs/genai/howto/build-model.md b/docs/genai/howto/build-model.md index dac5db965b80d..e238141a98afc 100644 --- a/docs/genai/howto/build-model.md +++ b/docs/genai/howto/build-model.md @@ -3,7 +3,7 @@ title: Build models description: How to build models with ONNX Runtime GenAI has_children: false parent: How to -grand_parent: Generative AI +grand_parent: Generative AI (Preview) nav_order: 2 --- diff --git a/docs/genai/howto/index.md b/docs/genai/howto/index.md index f6442fdb83e03..06847318ef626 100644 --- a/docs/genai/howto/index.md +++ b/docs/genai/howto/index.md @@ -1,7 +1,7 @@ --- title: How to description: How to perform specific tasks with ONNX Runtime GenAI -parent: Generative AI +parent: Generative AI (Preview) has_children: true nav_order: 3 --- diff --git a/docs/genai/howto/install.md b/docs/genai/howto/install.md index 4592a8d3c47c7..7be960a6218eb 100644 --- a/docs/genai/howto/install.md +++ b/docs/genai/howto/install.md @@ -3,7 +3,7 @@ title: Install ONNX Runtime GenAI description: Instructions to install ONNX Runtime GenAI on your target platform in your environment has_children: false parent: How to -grand_parent: Generative AI +grand_parent: Generative AI (Preview) nav_order: 1 --- diff --git a/docs/genai/tutorials/index.md b/docs/genai/tutorials/index.md index a4479616b3bd0..2527c3ff67d82 100644 --- a/docs/genai/tutorials/index.md +++ b/docs/genai/tutorials/index.md @@ -1,7 +1,7 @@ --- title: Tutorials description: Build your application with ONNX Runtime GenAI -parent: Generative AI +parent: Generative AI (Preview) has_children: true nav_order: 1 --- From 27f3ae3e973ca390e3e706d41f0d935ef9b733c7 Mon Sep 17 00:00:00 2001 From: natke Date: Wed, 21 Feb 2024 07:51:30 -0800 Subject: [PATCH 18/44] C# API first draft --- docs/genai/api/csharp.md | 140 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 140 insertions(+) diff --git a/docs/genai/api/csharp.md b/docs/genai/api/csharp.md index 93c4a49cf10ba..8ba88fa650c33 100644 --- a/docs/genai/api/csharp.md +++ b/docs/genai/api/csharp.md @@ -12,3 +12,143 @@ nav_order: 3 * TOC placeholder {:toc} + +## Overview + +## Model class + +### Constructor + +```csharp +public Model(string modelPath, DeviceType deviceType) +``` + +### Generate method + +```csharp +public Sequences Generate(GeneratorParams generatorParams) +``` + +## Tokenizer class + +### Constructor + +```csharp +public Tokenizer(Model model) +``` + +### Encode method + +```csharp +public Sequences Encode(string str) +``` + +### Encode batch method + +```csharp +public Sequences EncodeBatch(string[] strings) +``` + +### Decode method + +```csharp +public string Decode(ReadOnlySpan sequence) +``` + +### Decode batch method + +```csharp +public string[] DecodeBatch(Sequences sequences) +``` + +### Create stream method + +```csharp +public TokenizerStream CreateStream() +``` + +## TokenizerStream class + +### Decode method + +```csharp +public string Decode(int token) +``` + +## GeneratorParams class + +### Constructor + +```csharp +public GeneratorParams(Model model) +``` + +### Set search option (double) + +```csharp +public void SetSearchOption(string searchOption, double value) +``` + +### Set search option (bool) method + +```csharp +public void SetSearchOption(string searchOption, bool value) +``` + +### Set input ids method + +```csharp +public void SetInputIDs(ReadOnlySpan inputIDs, ulong sequenceLength, ulong batchSize) +``` + +### Set input sequences method + +```csharp +public void SetInputSequences(Sequences sequences) +``` + + + + + +## Generator class + +### Constructor + +```csharp +public Generator(Model model, GeneratorParams generatorParams) +``` + +### Is done method + +```csharp +public bool IsDone() +``` + +### Compute logits + +```csharp +public void ComputeLogits() +``` + +### Generate next token method + +```csharp +public void GenerateNextTokenTop() +``` + + +## Sequences class + +### Num sequences member + +```csharp +public ulong NumSequences { get { return _numSequences; } } +``` + +### [] operator + +```csharp +public ReadOnlySpan this[ulong sequenceIndex] +``` + From 6d1395d5b94dd7ee8201c6f6e45c862f279027ae Mon Sep 17 00:00:00 2001 From: natke Date: Wed, 21 Feb 2024 07:52:49 -0800 Subject: [PATCH 19/44] Re-order APIs --- docs/genai/api/c.md | 2 +- docs/genai/api/csharp.md | 2 +- docs/genai/api/python.md | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/genai/api/c.md b/docs/genai/api/c.md index 24541598a7b3c..97470ba79d202 100644 --- a/docs/genai/api/c.md +++ b/docs/genai/api/c.md @@ -4,7 +4,7 @@ description: C API reference for ONNX Runtime GenAI has_children: false parent: API docs grand_parent: Generative AI (Preview) -nav_order: 2 +nav_order: 3 --- # ONNX Runtime GenAI C API diff --git a/docs/genai/api/csharp.md b/docs/genai/api/csharp.md index 8ba88fa650c33..9008c43bb393b 100644 --- a/docs/genai/api/csharp.md +++ b/docs/genai/api/csharp.md @@ -4,7 +4,7 @@ description: C# API reference for ONNX Runtime GenAI has_children: false parent: API docs grand_parent: Generative AI (Preview) -nav_order: 3 +nav_order: 2 --- # ONNX Runtime GenAI C# API diff --git a/docs/genai/api/python.md b/docs/genai/api/python.md index 91a21f808278f..c4b2c08d4e9e4 100644 --- a/docs/genai/api/python.md +++ b/docs/genai/api/python.md @@ -4,7 +4,7 @@ description: Python API reference for ONNX Runtime GenAI has_children: false parent: API docs grand_parent: Generative AI (Preview) -nav_order: 2 +nav_order: 1 --- # Python API From b58e208c5ad7b4f0d2e7ed0d4820f480b9f4636e Mon Sep 17 00:00:00 2001 From: natke Date: Wed, 21 Feb 2024 12:54:45 -0800 Subject: [PATCH 20/44] Add phi-2 Python tutorial --- docs/genai/howto/install.md | 2 +- docs/genai/reference/config.md | 8 ++++ docs/genai/reference/index.md | 0 docs/genai/tutorials/phi2-python.md | 69 +++++++++++++++++++++++++++++ 4 files changed, 78 insertions(+), 1 deletion(-) create mode 100644 docs/genai/reference/config.md create mode 100644 docs/genai/reference/index.md create mode 100644 docs/genai/tutorials/phi2-python.md diff --git a/docs/genai/howto/install.md b/docs/genai/howto/install.md index 7be960a6218eb..97adaee42c9e5 100644 --- a/docs/genai/howto/install.md +++ b/docs/genai/howto/install.md @@ -1,5 +1,5 @@ --- -title: Install ONNX Runtime GenAI +title: Install description: Instructions to install ONNX Runtime GenAI on your target platform in your environment has_children: false parent: How to diff --git a/docs/genai/reference/config.md b/docs/genai/reference/config.md new file mode 100644 index 0000000000000..e2c2f9399d8a3 --- /dev/null +++ b/docs/genai/reference/config.md @@ -0,0 +1,8 @@ +--- +title: Configuration reference +description: Reference for the ONNX Runtime Generative AI configuration file +has_children: false +parent: Reference +grand_parent: Generative AI (Preview) +nav_order: 1 +--- diff --git a/docs/genai/reference/index.md b/docs/genai/reference/index.md new file mode 100644 index 0000000000000..e69de29bb2d1d diff --git a/docs/genai/tutorials/phi2-python.md b/docs/genai/tutorials/phi2-python.md new file mode 100644 index 0000000000000..02314a8f2f162 --- /dev/null +++ b/docs/genai/tutorials/phi2-python.md @@ -0,0 +1,69 @@ +--- +title: Python phi-2 tutorial +description: Learn how to write a language generation application with ONNX Runtime GenAI in Python using the phi-2 model +has_children: false +parent: Tutorials +grand_parent: Generative AI (Preview) +nav_order: 1 +--- + +# Language generation in Python with phi-2 + +## Setup and installation + +Install the ONNX Runtime GenAI Python package using the [installation instructions](../install.md). + +## Build phi-2 ONNX model + +The onnxruntime-genai package contains a model builder that generates the phi-2 ONNX model using the weights and config on Huggingface. The tools also allows you to load locally stored weights, or convert from GGUF format. For more details, see [how to build models](../how-to/build-models.md) + +If using the `-m` option shown here, which downloads from HuggingFace, you will need to login into HuggingFace. + +```bash +pip install huggingface-hub` +huggingface-cli --login +``` + +You can build the model in different precisions. This command uses int4 as it produces the smallest model and can run on a CPU. + +```python +python -m onnxruntime_genai.models.builder -m microsoft/phi-2 -e cpu -p int4 -o ./example-models/phi2-int4-cpu +``` +You can replace the name of the output folder specified with the `-o` option with a folder of your choice. + +After you run the script, you will see a series of files generated in this folder. They include the HuggingFace configs for your reference, as well as the following generated files used by ONNX Runtime GenAI. + +`genai_config.json`: the configuration used by ONNX Runtime GenAI +`model.onnx`: the phi-2 ONNX model +`model.onnx.data`: the phi-2 ONNX model weights + +## Run the model with a sample prompt + +Run the model with the following Python script. You can change the prompt and other parameters as needed. + +```python +import onnxruntime_genai as og + +prompt = '''def print_prime(n): + """ + Print all primes between 1 and n + """''' + +model=og.Model(f'example-models/phi2-int4-cpu', og.DeviceType.CPU) + +tokenizer = model.create_tokenizer() + +tokens = tokenizer.encode(prompt) + +params=og.GeneratorParams(model) +params.set_search_options({"max_length":200}) +params.input_ids = tokens + +output_tokens=model.generate(params)[0] + +text = tokenizer.decode(output_tokens) + +print(text) +``` + +## \ No newline at end of file From d30a3a44b67293ec0b4f4aea8bd60b5966663da4 Mon Sep 17 00:00:00 2001 From: natke Date: Wed, 21 Feb 2024 12:59:29 -0800 Subject: [PATCH 21/44] Add config reference --- docs/genai/reference/config.md | 50 +++++++++++++++++++++++++++++ docs/genai/reference/index.md | 7 ++++ docs/genai/tutorials/index.md | 1 + docs/genai/tutorials/phi2-python.md | 45 ++++++++++++++++++++++++-- 4 files changed, 100 insertions(+), 3 deletions(-) diff --git a/docs/genai/reference/config.md b/docs/genai/reference/config.md index e2c2f9399d8a3..7a7478fb77c48 100644 --- a/docs/genai/reference/config.md +++ b/docs/genai/reference/config.md @@ -6,3 +6,53 @@ parent: Reference grand_parent: Generative AI (Preview) nav_order: 1 --- + +# Configuration reference + +## Example file for phi-2 + +``` +{ + "model": { + "bos_token_id": 50256, + "context_length": 2048, + "decoder": { + "filename": "model.onnx", + "head_size": 80, + "hidden_size": 2560, + "inputs": { + "input_ids": "input_ids", + "attention_mask": "attention_mask", + "position_ids": "position_ids", + "past_key_names": "past_key_values.%d.key", + "past_value_names": "past_key_values.%d.value" + }, + "outputs": { + "logits": "logits", + "present_key_names": "present.%d.key", + "present_value_names": "present.%d.value" + }, + "num_attention_heads": 32, + "num_hidden_layers": 32, + "num_key_value_heads": 32 + }, + "eos_token_id": 50256, + "pad_token_id": 50256, + "type": "phi", + "vocab_size": 51200 + }, + "search": { + "diversity_penalty": 0.0, + "length_penalty": 1.0, + "max_length": 20, + "min_length": 0, + "no_repeat_ngram_size": 0, + "num_beams": 1, + "num_return_sequences": 1, + "repetition_penalty": 1.0, + "temperature": 0.7, + "top_k": 50, + "top_p": 0.6 + } +} +``` diff --git a/docs/genai/reference/index.md b/docs/genai/reference/index.md index e69de29bb2d1d..5219e29015065 100644 --- a/docs/genai/reference/index.md +++ b/docs/genai/reference/index.md @@ -0,0 +1,7 @@ +--- +title: Reference +description: Reference information for ONNX Runtime Generative AI +parent: Generative AI (Preview) +has_children: true +nav_order: 1 +--- \ No newline at end of file diff --git a/docs/genai/tutorials/index.md b/docs/genai/tutorials/index.md index 2527c3ff67d82..c05d1a1797827 100644 --- a/docs/genai/tutorials/index.md +++ b/docs/genai/tutorials/index.md @@ -7,3 +7,4 @@ nav_order: 1 --- _Note: this API is in preview and is subject to change._ + diff --git a/docs/genai/tutorials/phi2-python.md b/docs/genai/tutorials/phi2-python.md index 02314a8f2f162..67b785676049b 100644 --- a/docs/genai/tutorials/phi2-python.md +++ b/docs/genai/tutorials/phi2-python.md @@ -26,16 +26,20 @@ huggingface-cli --login You can build the model in different precisions. This command uses int4 as it produces the smallest model and can run on a CPU. -```python +```bash python -m onnxruntime_genai.models.builder -m microsoft/phi-2 -e cpu -p int4 -o ./example-models/phi2-int4-cpu ``` You can replace the name of the output folder specified with the `-o` option with a folder of your choice. After you run the script, you will see a series of files generated in this folder. They include the HuggingFace configs for your reference, as well as the following generated files used by ONNX Runtime GenAI. -`genai_config.json`: the configuration used by ONNX Runtime GenAI `model.onnx`: the phi-2 ONNX model `model.onnx.data`: the phi-2 ONNX model weights +`genai_config.json`: the configuration used by ONNX Runtime GenAI + +You can view and change the values in the `genai_config.json` file. The model section should not be updated unless you have brought your own model and it has different parameters. + +The search parameters can be changed. For example, you might want to generate with a different temperature value. These values can also be set via the `set_search_options` method shown below. ## Run the model with a sample prompt @@ -66,4 +70,39 @@ text = tokenizer.decode(output_tokens) print(text) ``` -## \ No newline at end of file +## Run batches of prompts + +You can also run batches of prompts through the model. + +```python +prompts = [ + "This is a test.", + "Rats are awesome pets!", + "The quick brown fox jumps over the lazy dog.", + ] + +inputs = tokenizer.encode_batch(prompts) + +params=og.GeneratorParams(model) +params.input_ids = tokens + +outputs = model.generate(params)[0] + +text = tokenizer.decode(output_tokens) +``` + +## Stream the output of the tokenizer + +If you are developing an application that requires tokens to be output to the user interface one at a time, you can use the streaming tokenizer. + +```python +generator=og.Generator(model, params) +tokenizer_stream=tokenizer.create_stream() + +print(prompt, end='', flush=True) + +while not generator.is_done(): + generator.compute_logits() + generator.generate_next_token_top_p(0.7, 0.6) + print(tokenizer_stream.decode(generator.get_next_tokens()[0]), end='', flush=True) +``` From cd1c98916712a9508851dadcfaf13480f96c4fd6 Mon Sep 17 00:00:00 2001 From: natke Date: Wed, 21 Feb 2024 13:29:47 -0800 Subject: [PATCH 22/44] Update Python API docs with recent changes --- docs/genai/api/python.md | 56 ++++++++++++++++++++++++++-------------- 1 file changed, 37 insertions(+), 19 deletions(-) diff --git a/docs/genai/api/python.md b/docs/genai/api/python.md index c4b2c08d4e9e4..341bf008eef3f 100644 --- a/docs/genai/api/python.md +++ b/docs/genai/api/python.md @@ -37,8 +37,8 @@ onnxruntime_genai.Model(model_folder: str, device: onnxruntime_genai.DeviceType) #### Parameters -- `model_folder`: (required) Location of model and configuration on disk -- `device`: (optional) The device to run on. One of: +- `model_folder`: Location of model and configuration on disk +- `device`: The device to run on. One of: - onnxruntime_genai.CPU - onnxruntime_genai.CUDA If not specified, defaults to CPU. @@ -47,21 +47,6 @@ onnxruntime_genai.Model(model_folder: str, device: onnxruntime_genai.DeviceType) `onnxruntime_genai.Model` -### Create tokenizer object - -```python -onnxruntime_genai.Model.create_tokenizer(model: onnxruntime_genai.Model) -> onnxruntime_genai.Tokenizer -``` - -#### Parameters - -- `model`: (Required) The model that was loaded by the `Model()` - -#### Returns - -- `Tokenizer`: The tokenizer object - - ### Generate method ```python @@ -94,7 +79,19 @@ onnxruntime_genai.GeneratorParams(model: onnxruntime_genai.Model) -> onnxruntime ## Tokenizer class -Tokenizer objects are created from a Model. +### Create tokenizer object + +```python +onnxruntime_genai.Model.Tokenizer(model: onnxruntime_genai.Model) -> onnxruntime_genai.Tokenizer +``` + +#### Parameters + +- `model`: (Required) The model that was loaded by the `Model()` + +#### Returns + +- `Tokenizer`: The tokenizer object ### Encode @@ -113,7 +110,7 @@ onnxruntime_genai.Tokenizer.encode(text: str) -> numpy.ndarray[numpy.int32] ### Decode ```python -onnxruntime_genai.Tokenizer.decode(numpy.int32) -> str +onnxruntime_genai.Tokenizer.decode(tokens: numpy.ndarry[numpy.int32]) -> str ``` #### Parameters @@ -189,6 +186,27 @@ onnxruntime_genai.TokenizerStream.decode(token: int32) -> str `str`: If a displayable string has accumulated, this method returns it. If not, this method returns the empty string. +## Generator Params class + +### Create a Generator Params + +```python +onnxruntime_genai.GeneratorParams(model: Model) -> GeneratorParams +``` + +### Input_ids member + +```python +onnxruntime_genai.GeneratorParams.input_ids = numpy.ndarray[numpy.int32, numpy.int32] +``` + +### Set search options method + +```python +onnxruntime_genai.GeneratorParams.set_search_options(options: dict[str, Any]) +``` + +### ## Generator class From 1011d7eeb432427a305772b34686a4b19c1140b5 Mon Sep 17 00:00:00 2001 From: natke Date: Wed, 21 Feb 2024 13:31:10 -0800 Subject: [PATCH 23/44] This time with the changes --- docs/genai/api/python.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/genai/api/python.md b/docs/genai/api/python.md index 341bf008eef3f..6d842dd7d847b 100644 --- a/docs/genai/api/python.md +++ b/docs/genai/api/python.md @@ -110,7 +110,7 @@ onnxruntime_genai.Tokenizer.encode(text: str) -> numpy.ndarray[numpy.int32] ### Decode ```python -onnxruntime_genai.Tokenizer.decode(tokens: numpy.ndarry[numpy.int32]) -> str +onnxruntime_genai.Tokenizer.decode(tokens: numpy.ndarry[int]) -> str ``` #### Parameters From 30e9279e4eb6674110052728cc3b5f6443c89f2f Mon Sep 17 00:00:00 2001 From: natke Date: Wed, 21 Feb 2024 14:04:26 -0800 Subject: [PATCH 24/44] Add recent changes to the C API docs --- docs/genai/api/c.md | 222 +++++++++++++++++++------------------------- 1 file changed, 93 insertions(+), 129 deletions(-) diff --git a/docs/genai/api/c.md b/docs/genai/api/c.md index 97470ba79d202..2bbff6cde39cc 100644 --- a/docs/genai/api/c.md +++ b/docs/genai/api/c.md @@ -19,7 +19,7 @@ _Note: this API is in preview and is subject to change._ ## Overview -## Functions +## Model API ### Create model @@ -31,32 +31,55 @@ Creates a model from the given configuration directory and device type. * Output: out The created model. #### Returns - OgaResult containing the error message if the model creation failed. - + `OgaResult` containing the error message if the model creation failed. + +```c +OGA_EXPORT OgaResult* OGA_API_CALL OgaCreateModel(const char* config_path, OgaDeviceType device_type, OgaModel** out); +``` ### Destroy model +Destroys the given model. + + #### Parameters +* Input: model The model to be destroyed. + +#### Returns +`void` + ```c -OGA_EXPORT OgaResult* OGA_API_CALL OgaCreateModel(const char* config_path, OgaDeviceType device_type, OgaModel** out); +OGA_EXPORT void OGA_API_CALL OgaDestroyModel(OgaModel* model); ``` +### Generate + +Generates an array of token arrays from the model execution based on the given generator params. + #### Parameters +* Input: model The model to use for generation. +* Input: generator_params The parameters to use for generation. +* Output: out The generated sequences of tokens. The caller is responsible for freeing the sequences using OgaDestroySequences after it is done using the sequences. - Destroys the given model. - * Input: model The model to be destroyed. +#### Returns + +OgaResult containing the error message if the generation failed. ```c -OGA_EXPORT void OGA_API_CALL OgaDestroyModel(OgaModel* model); +OGA_EXPORT OgaResult* OGA_API_CALL OgaGenerate(const OgaModel* model, const OgaGeneratorParams* generator_params, OgaSequences** out); ``` +## Tokenizer API + ### Create Tokenizer #### Parameters +* Input: model. The model for which the tokenizer should be created #### Returns +`OgaResult` containing the error message if the tokenizer creation failed. ```c OGA_EXPORT OgaResult* OGA_API_CALL OgaCreateTokenizer(const OgaModel* model, OgaTokenizer** out); @@ -64,25 +87,42 @@ OGA_EXPORT OgaResult* OGA_API_CALL OgaCreateTokenizer(const OgaModel* model, Oga ### Destroy Tokenizer +```c +OGA_EXPORT void OGA_API_CALL OgaDestroyTokenizer(OgaTokenizer*); +``` +### Encode + +Encodes a single string and adds the encoded sequence of tokens to the OgaSequences. The OgaSequences must be freed with OgaDestroySequences when it is no longer needed. + #### Parameters #### Returns ```c -OGA_EXPORT void OGA_API_CALL OgaDestroyTokenizer(OgaTokenizer*); +OGA_EXPORT OgaResult* OGA_API_CALL OgaTokenizerEncode(const OgaTokenizer*, const char* str, OgaSequences* sequences); ``` -### Encode batch +### Decode + +Decode a single token sequence and returns a null terminated utf8 string. out_string must be freed with OgaDestroyString #### Parameters +#### Returns + ```c -OGA_EXPORT OgaResult* OGA_API_CALL OgaTokenizerEncodeBatch(const OgaTokenizer*, const char** strings, size_t count, OgaSequences** out); +OGA_EXPORT OgaResult* OGA_API_CALL OgaTokenizerDecode(const OgaTokenizer*, const int32_t* tokens, size_t token_count, const char** out_string); ``` -### Decode batch +### Encode batch #### Parameters +* +```c +OGA_EXPORT OgaResult* OGA_API_CALL OgaTokenizerEncodeBatch(const OgaTokenizer*, const char** strings, size_t count, TokenSequences** out); +``` + +### Decode batch ```c OGA_EXPORT OgaResult* OGA_API_CALL OgaTokenizerDecodeBatch(const OgaTokenizer*, const OgaSequences* tokens, const char*** out_strings); @@ -90,22 +130,13 @@ OGA_EXPORT OgaResult* OGA_API_CALL OgaTokenizerDecodeBatch(const OgaTokenizer*, ### Destroy tokenizer strings -#### Parameters - ```c OGA_EXPORT void OGA_API_CALL OgaTokenizerDestroyStrings(const char** strings, size_t count); ``` ### Create tokenizer stream - -#### Parameters - -```c -OgaTokenizerStream is to decoded token strings incrementally, one token at a time. -``` - -#### Parameters +OgaTokenizerStream is used to decoded token strings incrementally, one token at a time. ```c OGA_EXPORT OgaResult* OGA_API_CALL OgaCreateTokenizerStream(const OgaTokenizer*, OgaTokenizerStream** out); @@ -121,98 +152,77 @@ OGA_EXPORT void OGA_API_CALL OgaDestroyTokenizerStream(OgaTokenizerStream*); ### Decode stream -Decode a single token in the stream. If this results in a word being generated, it will be - -#### Parameters - -returned in 'out'. - * The caller is responsible for concatenating each chunk together to generate the complete result. - * 'out' is valid until the next call to OgaTokenizerStreamDecode or when the OgaTokenizerStream is destroyed +Decode a single token in the stream. If this results in a word being generated, it will be returned in 'out'. The caller is responsible for concatenating each chunk together to generate the complete result. +'out' is valid until the next call to OgaTokenizerStreamDecode or when the OgaTokenizerStream is destroyed ```c OGA_EXPORT OgaResult* OGA_API_CALL OgaTokenizerStreamDecode(OgaTokenizerStream*, int32_t token, const char** out); ``` +## Generator Params API -### Create Generator +### Create Generator Params -Creates a generator from the given model and generator params. +Creates a OgaGeneratorParams from the given model. #### Parameters - * Input: model The model to use for generation. - * Input: params The parameters to use for generation. - * Output: out The created generator. +* Input: model The model to use for generation. +* Output: out The created generator params. #### Returns -OgaResult containing the error message if the generator creation failed. + +`OgaResult` containing the error message if the generator params creation failed. ```c -OGA_EXPORT OgaResult* OGA_API_CALL OgaCreateGenerator(const OgaModel* model, const OgaGeneratorParams* params, OgaGenerator** out); +OGA_EXPORT OgaResult* OGA_API_CALL OgaCreateGeneratorParams(const OgaModel* model, OgaGeneratorParams** out); ``` -### Destroy generator +### Destroy Generator Params -Destroys the given generator. +Destroys the given generator params. #### Parameters -* Input: generator The generator to be destroyed. +* Input: generator_params The generator params to be destroyed. #### Returns `void` ```c -OGA_EXPORT void OGA_API_CALL OgaDestroyGenerator(OgaGenerator* generator); -``` - -### Create generator params - -Creates a OgaGeneratorParams from the given model. - -#### Parameters - -* Input: model The model to use for generation. -* Output: out The created generator params. - -#### Returns - -OgaResult containing the error message if the generator params creation failed. - -```c -OGA_EXPORT OgaResult* OGA_API_CALL OgaCreateGeneratorParams(const OgaModel* model, OgaGeneratorParams** out); +OGA_EXPORT void OGA_API_CALL OgaDestroyGeneratorParams(OgaGeneratorParams* generator_params); ``` -### Destroy generator params +### Set search option (number) -Destroys the given generator params. +Set a search option where the option is a number #### Parameters - - * Input: generator_params The generator params to be destroyed. +* generator_params: The generator params object to set the parameter on +* name: the name of the parameter +* value: the value to set #### Returns -`void` +`OgaResult` containing the error message if the generator params creation failed. ```c -OGA_EXPORT void OGA_API_CALL OgaDestroyGeneratorParams(OgaGeneratorParams* generator_params); +OGA_EXPORT OgaResult* OGA_API_CALL OgaGeneratorParamsSetSearchNumber(OgaGeneratorParams* generator_params, const char* name, double value); ``` -### Set maximum length +### Set search option (bool) -Sets the maximum length that the generated sequence can have. +Set a search option where the option is a bool. #### Parameters - -* Input: params The generator params to set the maximum length on. -* Input: max_length The maximum length of the generated sequences. +* generator_params: The generator params object to set the parameter on +* name: the name of the parameter +* value: the value to set #### Returns +`OgaResult` containing the error message if the generator params creation failed. -`OgaResult` containing the error message if the setting of the maximum length failed. - ```c -OGA_EXPORT OgaResult* OGA_API_CALL OgaGeneratorParamsSetMaxLength(OgaGeneratorParams* generator_params, size_t max_length); +OGA_EXPORT OgaResult* OGA_API_CALL OgaGeneratorParamsSetSearchBool(OgaGeneratorParams* generator_params, const char* name, bool value); ``` ### Set inputs @@ -229,7 +239,7 @@ Sets the input ids for the generator params. The input ids are used to seed the #### Returns - OgaResult containing the error message if the setting of the input ids failed. +`OgaResult` containing the error message if the setting of the input ids failed. ```c OGA_EXPORT OgaResult* OGA_API_CALL OgaGeneratorParamsSetInputIDs(OgaGeneratorParams* generator_params, const int32_t* input_ids, size_t input_ids_count, size_t sequence_length, size_t batch_size); @@ -252,79 +262,39 @@ Sets the input id sequences for the generator params. The input id sequences are OGA_EXPORT OgaResult* OGA_API_CALL OgaGeneratorParamsSetInputSequences(OgaGeneratorParams* generator_params, const OgaSequences* sequences); ``` -### Encode - -Encodes a single string and adds the encoded sequence of tokens to the OgaSequences. The OgaSequences must be freed with OgaDestroySequences when it is no longer needed. - -#### Parameters - -#### Returns - -```c -OGA_EXPORT OgaResult* OGA_API_CALL OgaTokenizerEncode(const OgaTokenizer*, const char* str, OgaSequences* sequences); -``` - -### Decode - -Decode a single token sequence and returns a null terminated utf8 string. out_string must be freed with OgaDestroyString - -#### Parameters - -#### Returns - -```c -OGA_EXPORT OgaResult* OGA_API_CALL OgaTokenizerDecode(const OgaTokenizer*, const int32_t* tokens, size_t token_count, const char** out_string); -``` - -### Generate - -Generates an array of token arrays from the model execution based on the given generator params. - -#### Parameters - -* Input: model The model to use for generation. -* Input: generator_params The parameters to use for generation. -* Output: out The generated sequences of tokens. The caller is responsible for freeing the sequences using OgaDestroySequences after it is done using the sequences. - -#### Returns +## Generator API -OgaResult containing the error message if the generation failed. - -```c -OGA_EXPORT OgaResult* OGA_API_CALL OgaGenerate(const OgaModel* model, const OgaGeneratorParams* generator_params, OgaSequences** out); -``` - -### Create generator params +### Create Generator -Creates a OgaGeneratorParams from the given model. +Creates a generator from the given model and generator params. #### Parameters -* Input: model The model to use for generation. -* Output: out The created generator params. + * Input: model The model to use for generation. + * Input: params The parameters to use for generation. + * Output: out The created generator. #### Returns - -OgaResult containing the error message if the generator params creation failed. +`OgaResult` containing the error message if the generator creation failed. ```c -OGA_EXPORT OgaResult* OGA_API_CALL OgaCreateGeneratorParams(const OgaModel* model, OgaGeneratorParams** out); +OGA_EXPORT OgaResult* OGA_API_CALL OgaCreateGenerator(const OgaModel* model, const OgaGeneratorParams* params, OgaGenerator** out); ``` -### Destroy generator params +### Destroy generator -Destroys the given generator params. +Destroys the given generator. #### Parameters -* Input: generator_params The generator params to be destroyed. +* Input: generator The generator to be destroyed. #### Returns `void` ```c -OGA_EXPORT void OGA_API_CALL OgaDestroyGeneratorParams(OgaGeneratorParams* generator_params); +OGA_EXPORT void OGA_API_CALL OgaDestroyGenerator(OgaGenerator* generator); ``` ### Check if generation has completed @@ -563,11 +533,6 @@ OGA_EXPORT const void* OGA_API_CALL OgaBufferGetData(const OgaBuffer*); ### Create sequences - -#### Parameters - -#### Returns - ```c OGA_EXPORT OgaResult* OGA_API_CALL OgaCreateSequences(OgaSequences** out); ``` @@ -632,4 +597,3 @@ The pointer to the sequence data at the given index. The pointer is valid until ```c OGA_EXPORT const int32_t* OGA_API_CALL OgaSequencesGetSequenceData(const OgaSequences* sequences, size_t sequence_index); ``` - From 308c58f1bdc39b738eedbeb56ea7daaf5f38ecbe Mon Sep 17 00:00:00 2001 From: natke Date: Wed, 21 Feb 2024 14:07:07 -0800 Subject: [PATCH 25/44] Update model builder doc --- docs/genai/howto/build-model.md | 53 ++++++++++++++++----------------- 1 file changed, 25 insertions(+), 28 deletions(-) diff --git a/docs/genai/howto/build-model.md b/docs/genai/howto/build-model.md index e238141a98afc..5fd3d1307ad7f 100644 --- a/docs/genai/howto/build-model.md +++ b/docs/genai/howto/build-model.md @@ -35,58 +35,58 @@ python3 -m onnxruntime_genai.models.builder --help python3 builder.py --help ``` -### Original Model From Hugging Face - +### Original PyTorch Model from Hugging Face This scenario is where your PyTorch model is not downloaded locally (either in the default Hugging Face cache directory or in a local folder on disk). - ``` # From wheel: -python3 -m onnxruntime_genai.models.builder -m model_name -o /path/to/output/folder -p precision -e execution_provider -c cache_dir_to_save_hf_files +python3 -m onnxruntime_genai.models.builder -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_to_save_hf_files # From source: -python3 builder.py -m model_name -o /path/to/output/folder -p precision -e execution_provider -c cache_dir_to_save_hf_files +python3 builder.py -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_to_save_hf_files ``` -### Original Model From Disk - +### Original PyTorch Model from Disk This scenario is where your PyTorch model is already downloaded locally (either in the default Hugging Face cache directory or in a local folder on disk). - ``` # From wheel: -python3 -m onnxruntime_genai.models.builder -m model_name -o /path/to/output/folder -p precision -e execution_provider -c cache_dir_where_hf_files_are_saved +python3 -m onnxruntime_genai.models.builder -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_where_hf_files_are_saved # From source: -python3 builder.py -m model_name -o /path/to/output/folder -p precision -e execution_provider -c cache_dir_where_hf_files_are_saved +python3 builder.py -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_where_hf_files_are_saved ``` -### Customized or Finetuned Model - +### Customized or Finetuned PyTorch Model This scenario is where your PyTorch model has been customized or finetuned for one of the currently supported model architectures and your model can be loaded in Hugging Face. +``` +# From wheel: +python3 -m onnxruntime_genai.models.builder -i path_to_local_folder_on_disk -o path_to_output_folder -p precision -e execution_provider + +# From source: +python3 builder.py -i path_to_local_folder_on_disk -o path_to_output_folder -p precision -e execution_provider +``` +### GGUF Model +This scenario is where your float16/float32 GGUF model is already on disk. ``` # From wheel: -python3 -m onnxruntime_genai.models.builder -m path_to_local_folder_on_disk -o /path/to/output/folder -p precision -e execution_provider +python3 -m onnxruntime_genai.models.builder -m model_name -i path_to_gguf_file -o path_to_output_folder -p precision -e execution_provider -c cache_dir_for_hf_files # From source: -python3 builder.py -m path_to_local_folder_on_disk -o /path/to/output/folder -p precision -e execution_provider +python3 builder.py -m model_name -i path_to_gguf_file -o path_to_output_folder -p precision -e execution_provider -c cache_dir_for_hf_files ``` ### Extra Options - This scenario is for when you want to have control over some specific settings. The below example shows how you can pass key-value arguments to `--extra_options`. - ``` # From wheel: -python3 -m onnxruntime_genai.models.builder -m model_name -o /path/to/output/folder -p precision -e execution_provider -c cache_dir_to_save_hf_files --extra_options filename=decoder.onnx +python3 -m onnxruntime_genai.models.builder -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_for_hf_files --extra_options filename=decoder.onnx # From source: -python3 builder.py -m model_name -o /path/to/output/folder -p precision -e execution_provider -c cache_dir_to_save_hf_files --extra_options filename=decoder.onnx +python3 builder.py -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_for_hf_files --extra_options filename=decoder.onnx ``` - To see all available options through `--extra_options`, please use the `help` commands in the `Full Usage` section above. ### Unit Testing Models - This scenario is where your PyTorch model is already downloaded locally (either in the default Hugging Face cache directory or in a local folder on disk). If it is not already downloaded locally, here is an example of how you can download it. ``` @@ -103,15 +103,13 @@ tokenizer.save_pretrained(cache_dir) ``` #### Option 1: Use the model builder tool directly - This option is the simplest but it will download another copy of the PyTorch model onto disk to accommodate the change in the number of hidden layers. - ``` # From wheel: -python3 -m onnxruntime_genai.models.builder -m model_name -o /path/to/output/folder -p precision -e execution_provider --extra_options num_hidden_layers=4 +python3 -m onnxruntime_genai.models.builder -m model_name -o path_to_output_folder -p precision -e execution_provider --extra_options num_hidden_layers=4 # From source: -python3 builder.py -m model_name -o /path/to/output/folder -p precision -e execution_provider --extra_options num_hidden_layers=4 +python3 builder.py -m model_name -o path_to_output_folder -p precision -e execution_provider --extra_options num_hidden_layers=4 ``` #### Option 2: Edit the config.json file on disk and then run the model builder tool @@ -122,9 +120,8 @@ python3 builder.py -m model_name -o /path/to/output/folder -p precision -e execu ``` # From wheel: -python3 -m onnxruntime_genai.models.builder -m model_name -o /path/to/output/folder -p precision -e execution_provider -c cache_dir_where_hf_files_are_saved +python3 -m onnxruntime_genai.models.builder -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_where_hf_files_are_saved # From source: -python3 builder.py -m model_name -o /path/to/output/folder -p precision -e execution_provider -c cache_dir_where_hf_files_are_saved -``` - +python3 builder.py -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_where_hf_files_are_saved +``` \ No newline at end of file From 09c7647dcb1db1483004bbc02a25c8281e737b2e Mon Sep 17 00:00:00 2001 From: natke Date: Wed, 21 Feb 2024 14:14:15 -0800 Subject: [PATCH 26/44] Add subject to change note --- docs/genai/api/csharp.md | 3 +++ docs/genai/api/python.md | 5 ++++- docs/genai/reference/config.md | 3 +++ docs/genai/reference/index.md | 4 +++- 4 files changed, 13 insertions(+), 2 deletions(-) diff --git a/docs/genai/api/csharp.md b/docs/genai/api/csharp.md index 9008c43bb393b..b5cd486c6bcb0 100644 --- a/docs/genai/api/csharp.md +++ b/docs/genai/api/csharp.md @@ -8,6 +8,9 @@ nav_order: 2 --- # ONNX Runtime GenAI C# API + +_Note: this API is in preview and is subject to change._ + {: .no_toc } * TOC placeholder diff --git a/docs/genai/api/python.md b/docs/genai/api/python.md index 6d842dd7d847b..b4d1490f6f591 100644 --- a/docs/genai/api/python.md +++ b/docs/genai/api/python.md @@ -8,6 +8,9 @@ nav_order: 1 --- # Python API + +_Note: this API is in preview and is subject to change._ + {: .no_toc } * TOC placeholder @@ -186,7 +189,7 @@ onnxruntime_genai.TokenizerStream.decode(token: int32) -> str `str`: If a displayable string has accumulated, this method returns it. If not, this method returns the empty string. -## Generator Params class +## GeneratorParams class ### Create a Generator Params diff --git a/docs/genai/reference/config.md b/docs/genai/reference/config.md index 7a7478fb77c48..b52d080f83611 100644 --- a/docs/genai/reference/config.md +++ b/docs/genai/reference/config.md @@ -9,6 +9,9 @@ nav_order: 1 # Configuration reference +_Note: this API is in preview and is subject to change._ + + ## Example file for phi-2 ``` diff --git a/docs/genai/reference/index.md b/docs/genai/reference/index.md index 5219e29015065..b418bfbf1966a 100644 --- a/docs/genai/reference/index.md +++ b/docs/genai/reference/index.md @@ -4,4 +4,6 @@ description: Reference information for ONNX Runtime Generative AI parent: Generative AI (Preview) has_children: true nav_order: 1 ---- \ No newline at end of file +--- + +_Note: this API is in preview and is subject to change._ From b0ee1e0e42a6a2fc9c05af0bd300eb61e14fcb23 Mon Sep 17 00:00:00 2001 From: natke Date: Wed, 21 Feb 2024 14:15:30 -0800 Subject: [PATCH 27/44] Move config reference down --- docs/genai/reference/config.md | 2 +- docs/genai/reference/index.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/genai/reference/config.md b/docs/genai/reference/config.md index b52d080f83611..28cb146de5a7c 100644 --- a/docs/genai/reference/config.md +++ b/docs/genai/reference/config.md @@ -1,5 +1,5 @@ --- -title: Configuration reference +title: Config reference description: Reference for the ONNX Runtime Generative AI configuration file has_children: false parent: Reference diff --git a/docs/genai/reference/index.md b/docs/genai/reference/index.md index b418bfbf1966a..f34d266dedbaf 100644 --- a/docs/genai/reference/index.md +++ b/docs/genai/reference/index.md @@ -3,7 +3,7 @@ title: Reference description: Reference information for ONNX Runtime Generative AI parent: Generative AI (Preview) has_children: true -nav_order: 1 +nav_order: 4 --- _Note: this API is in preview and is subject to change._ From 45981c4aad40505b295ed0c0944ff8cb16ea0557 Mon Sep 17 00:00:00 2001 From: natke Date: Wed, 21 Feb 2024 14:16:10 -0800 Subject: [PATCH 28/44] Highlight json --- docs/genai/reference/config.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/genai/reference/config.md b/docs/genai/reference/config.md index 28cb146de5a7c..5a9b15b4258ff 100644 --- a/docs/genai/reference/config.md +++ b/docs/genai/reference/config.md @@ -14,7 +14,7 @@ _Note: this API is in preview and is subject to change._ ## Example file for phi-2 -``` +```json { "model": { "bos_token_id": 50256, From 72f5cdcd49a1547cdb44ca0ec37fe71d925e0999 Mon Sep 17 00:00:00 2001 From: Nat Kershaw Date: Fri, 1 Mar 2024 21:28:52 -0800 Subject: [PATCH 29/44] Fix broken links --- docs/genai/tutorials/phi2-python.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/genai/tutorials/phi2-python.md b/docs/genai/tutorials/phi2-python.md index 67b785676049b..faec3cc98e66e 100644 --- a/docs/genai/tutorials/phi2-python.md +++ b/docs/genai/tutorials/phi2-python.md @@ -11,11 +11,11 @@ nav_order: 1 ## Setup and installation -Install the ONNX Runtime GenAI Python package using the [installation instructions](../install.md). +Install the ONNX Runtime GenAI Python package using the [installation instructions](../howto/install.md). ## Build phi-2 ONNX model -The onnxruntime-genai package contains a model builder that generates the phi-2 ONNX model using the weights and config on Huggingface. The tools also allows you to load locally stored weights, or convert from GGUF format. For more details, see [how to build models](../how-to/build-models.md) +The onnxruntime-genai package contains a model builder that generates the phi-2 ONNX model using the weights and config on Huggingface. The tools also allows you to load locally stored weights, or convert from GGUF format. For more details, see [how to build models](../howto/build-models.md) If using the `-m` option shown here, which downloads from HuggingFace, you will need to login into HuggingFace. From 46ca857007c172c351bf18e433066cc30a44f5d1 Mon Sep 17 00:00:00 2001 From: Nat Kershaw Date: Mon, 11 Mar 2024 10:29:53 -0700 Subject: [PATCH 30/44] Fix list --- docs/genai/tutorials/phi2-python.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/genai/tutorials/phi2-python.md b/docs/genai/tutorials/phi2-python.md index faec3cc98e66e..b368d0695f3df 100644 --- a/docs/genai/tutorials/phi2-python.md +++ b/docs/genai/tutorials/phi2-python.md @@ -33,9 +33,9 @@ You can replace the name of the output folder specified with the `-o` option wit After you run the script, you will see a series of files generated in this folder. They include the HuggingFace configs for your reference, as well as the following generated files used by ONNX Runtime GenAI. -`model.onnx`: the phi-2 ONNX model -`model.onnx.data`: the phi-2 ONNX model weights -`genai_config.json`: the configuration used by ONNX Runtime GenAI +- `model.onnx`: the phi-2 ONNX model +- `model.onnx.data`: the phi-2 ONNX model weights +- `genai_config.json`: the configuration used by ONNX Runtime GenAI You can view and change the values in the `genai_config.json` file. The model section should not be updated unless you have brought your own model and it has different parameters. From 50016b428ba292e5385f08ffc99e426a1d3072c4 Mon Sep 17 00:00:00 2001 From: Nat Kershaw Date: Mon, 11 Mar 2024 16:18:04 -0700 Subject: [PATCH 31/44] Update install instructions with RCs --- docs/genai/howto/install.md | 64 +++++++++++++------------------------ 1 file changed, 22 insertions(+), 42 deletions(-) diff --git a/docs/genai/howto/install.md b/docs/genai/howto/install.md index 97adaee42c9e5..15e8445ea2809 100644 --- a/docs/genai/howto/install.md +++ b/docs/genai/howto/install.md @@ -13,63 +13,43 @@ nav_order: 1 * TOC placeholder {:toc} -## Python package +## Python package release candidates -(Coming soon) `pip install onnxruntime-genai` +```bash +pip install onnxruntime-genai --pre --index-url= +https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-genai/pypi/simple/` +``` -(Temporary) -1. Build from source +Append `-cuda` for the library that is optimized for CUDA environments - Follow the [build from source](./build-from-source.md) instructions. +```bash +pip install onnxruntime-genai-cuda --pre --index-url= +https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-genai/pypi/simple/` -2. Install wheel +``` - ```bash - cd build/wheel - pip install onnxruntime-genai*.whl - ``` +## Nuget package release candidates -## C# package +To install the NuGet release candidates, add a new package source in Visual Studio, go to `Project` -> `Manage NuGet Packages`. -(Coming soon) `dotnet add package Microsoft.ML.OnnxRuntime.GenAI` +1. Click on the `Settings` cog icon -(Temporary) -1. Build from source +2. Click the `+` button to add a new package source - Follow the [build from source](./build-from-source.md) instructions. + - Change the Name to `onnxruntime-genai` + - Change the Source to `https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-genai/nuget/v3/index.json` -2. Build nuget package +3. Check the `Include prerelease` button - ```cmd - nuget.exe pack Microsoft.ML.OnnxRuntimeGenAI.nuspec -Prop version=0.1.0 -Prop id="Microsoft.ML.OnnxRuntimeGenAI.Gpu" - ``` +4. Add the `Microsoft.ML.OnnxRuntimeGenAI` package -3. Install the nuget package +5. Add the `Microsoft.ML.OnnxRuntime` package - ```cmd - dotnet add package .. local instructions - ``` +To run with CUDA, use the following packages instead: +- `Microsoft.ML.OnnxRuntimeGenAI.Cuda` +- `Microsoft.ML.OnnxRuntime.Gpu` -## C artifacts - -(Coming soon) Download release archive - -Unzip archive - -(Temporary) -1. Build from source - - Follow the [build from source](build-from-source.md) instructions. - - -2. Use the following include locations to build your C application - - * - -3. Use the following library locations to build your C application - - * From 4f8c9af7bc42142ca8423413e7e67abc37a1921d Mon Sep 17 00:00:00 2001 From: Nat Kershaw Date: Mon, 11 Mar 2024 16:52:41 -0700 Subject: [PATCH 32/44] Fix Build ONNX Runtime nav --- docs/build/index.md | 2 +- docs/genai/howto/install.md | 1 + 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/build/index.md b/docs/build/index.md index 5a1719bc317a7..be906d8b9cfb1 100644 --- a/docs/build/index.md +++ b/docs/build/index.md @@ -1,5 +1,5 @@ --- -title: Build from source +title: Build ONNX Runtime has_children: true nav_order: 5 redirect_from: /docs/how-to/build diff --git a/docs/genai/howto/install.md b/docs/genai/howto/install.md index 15e8445ea2809..f37151ed2374b 100644 --- a/docs/genai/howto/install.md +++ b/docs/genai/howto/install.md @@ -16,6 +16,7 @@ nav_order: 1 ## Python package release candidates ```bash +pip install numpy pip install onnxruntime-genai --pre --index-url= https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-genai/pypi/simple/` ``` From a6b5a84905aa4f50c3d4dfc26f1b2c52455b151b Mon Sep 17 00:00:00 2001 From: "Nat Kershaw (MSFT)" Date: Tue, 12 Mar 2024 15:03:15 -0700 Subject: [PATCH 33/44] Update docs/genai/tutorials/phi2-python.md Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com> --- docs/genai/tutorials/phi2-python.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/genai/tutorials/phi2-python.md b/docs/genai/tutorials/phi2-python.md index b368d0695f3df..2087edcceb487 100644 --- a/docs/genai/tutorials/phi2-python.md +++ b/docs/genai/tutorials/phi2-python.md @@ -15,7 +15,7 @@ Install the ONNX Runtime GenAI Python package using the [installation instructio ## Build phi-2 ONNX model -The onnxruntime-genai package contains a model builder that generates the phi-2 ONNX model using the weights and config on Huggingface. The tools also allows you to load locally stored weights, or convert from GGUF format. For more details, see [how to build models](../howto/build-models.md) +The onnxruntime-genai package contains a model builder that generates the phi-2 ONNX model using the weights and config on Huggingface. The tools also allows you to download the weights from Hugging Face, load locally stored weights, or convert from GGUF format. For more details, see [how to build models](../howto/build-models.md) If using the `-m` option shown here, which downloads from HuggingFace, you will need to login into HuggingFace. From 6b9585108428d1c048f34fdca43470fc89cdcbbb Mon Sep 17 00:00:00 2001 From: "Nat Kershaw (MSFT)" Date: Tue, 12 Mar 2024 15:03:23 -0700 Subject: [PATCH 34/44] Update docs/genai/tutorials/phi2-python.md Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com> --- docs/genai/tutorials/phi2-python.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/genai/tutorials/phi2-python.md b/docs/genai/tutorials/phi2-python.md index 2087edcceb487..49d78c88b5faf 100644 --- a/docs/genai/tutorials/phi2-python.md +++ b/docs/genai/tutorials/phi2-python.md @@ -17,7 +17,7 @@ Install the ONNX Runtime GenAI Python package using the [installation instructio The onnxruntime-genai package contains a model builder that generates the phi-2 ONNX model using the weights and config on Huggingface. The tools also allows you to download the weights from Hugging Face, load locally stored weights, or convert from GGUF format. For more details, see [how to build models](../howto/build-models.md) -If using the `-m` option shown here, which downloads from HuggingFace, you will need to login into HuggingFace. +If using the `-m` option shown here, you will need to login into Hugging Face. ```bash pip install huggingface-hub` From c3657d89953a10e1476f02efd3ee16c5cbb29b44 Mon Sep 17 00:00:00 2001 From: Maanav Dalal Date: Tue, 12 Mar 2024 17:13:39 -0700 Subject: [PATCH 35/44] Update phi2-python.md with fixed typo --- docs/genai/tutorials/phi2-python.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/genai/tutorials/phi2-python.md b/docs/genai/tutorials/phi2-python.md index 49d78c88b5faf..dce7150643af2 100644 --- a/docs/genai/tutorials/phi2-python.md +++ b/docs/genai/tutorials/phi2-python.md @@ -15,7 +15,7 @@ Install the ONNX Runtime GenAI Python package using the [installation instructio ## Build phi-2 ONNX model -The onnxruntime-genai package contains a model builder that generates the phi-2 ONNX model using the weights and config on Huggingface. The tools also allows you to download the weights from Hugging Face, load locally stored weights, or convert from GGUF format. For more details, see [how to build models](../howto/build-models.md) +The onnxruntime-genai package contains a model builder that generates the phi-2 ONNX model using the weights and config on Huggingface. The tools also allows you to download the weights from Hugging Face, load locally stored weights, or convert from GGUF format. For more details, see [how to build models](../howto/build-model.md) If using the `-m` option shown here, you will need to login into Hugging Face. From 42080cf345dc8ecab6cff84df0e1b0a3656bdc95 Mon Sep 17 00:00:00 2001 From: Nat Kershaw Date: Wed, 13 Mar 2024 19:45:14 -0700 Subject: [PATCH 36/44] Update config and remove device type from API docs. --- docs/genai/api/c.md | 11 +--- docs/genai/api/csharp.md | 2 +- docs/genai/api/python.md | 2 +- docs/genai/howto/build-model.md | 12 +++++ docs/genai/reference/config.md | 83 ++++++++++++++++++++++++++++- docs/genai/tutorials/phi2-python.md | 2 +- 6 files changed, 97 insertions(+), 15 deletions(-) diff --git a/docs/genai/api/c.md b/docs/genai/api/c.md index 2bbff6cde39cc..63f1dfe801e79 100644 --- a/docs/genai/api/c.md +++ b/docs/genai/api/c.md @@ -27,14 +27,13 @@ Creates a model from the given configuration directory and device type. #### Parameters * Input: config_path The path to the model configuration directory. The path is expected to be encoded in UTF-8. - * Input: device_type The device type to use for the model. * Output: out The created model. #### Returns `OgaResult` containing the error message if the model creation failed. ```c -OGA_EXPORT OgaResult* OGA_API_CALL OgaCreateModel(const char* config_path, OgaDeviceType device_type, OgaModel** out); +OGA_EXPORT OgaResult* OGA_API_CALL OgaCreateModel(const char* config_path, OgaModel** out); ``` ### Destroy model @@ -401,14 +400,6 @@ OGA_EXPORT const int32_t* OGA_API_CALL OgaGenerator_GetSequence(const OgaGenerat ## Enums and structs -```c -typedef enum OgaDeviceType { - OgaDeviceTypeAuto, - OgaDeviceTypeCPU, - OgaDeviceTypeCUDA, -} OgaDeviceType; -``` - ```c typedef enum OgaDataType { OgaDataType_int32, diff --git a/docs/genai/api/csharp.md b/docs/genai/api/csharp.md index b5cd486c6bcb0..86b566f451cc2 100644 --- a/docs/genai/api/csharp.md +++ b/docs/genai/api/csharp.md @@ -23,7 +23,7 @@ _Note: this API is in preview and is subject to change._ ### Constructor ```csharp -public Model(string modelPath, DeviceType deviceType) +public Model(string modelPath) ``` ### Generate method diff --git a/docs/genai/api/python.md b/docs/genai/api/python.md index b4d1490f6f591..52adeac3cba69 100644 --- a/docs/genai/api/python.md +++ b/docs/genai/api/python.md @@ -35,7 +35,7 @@ import onnxruntime_genai Loads the ONNX model(s) and configuration from a folder on disk. ```python -onnxruntime_genai.Model(model_folder: str, device: onnxruntime_genai.DeviceType) -> onnxruntime_genai.Model +onnxruntime_genai.Model(model_folder: str) -> onnxruntime_genai.Model ``` #### Parameters diff --git a/docs/genai/howto/build-model.md b/docs/genai/howto/build-model.md index 5fd3d1307ad7f..408710d34ed61 100644 --- a/docs/genai/howto/build-model.md +++ b/docs/genai/howto/build-model.md @@ -86,6 +86,18 @@ python3 builder.py -m model_name -o path_to_output_folder -p precision -e execut ``` To see all available options through `--extra_options`, please use the `help` commands in the `Full Usage` section above. +### Config Only +This scenario is for when you already have your optimized and/or quantized ONNX model and you need to create the config files to run with ONNX Runtime GenAI. +``` +# From wheel: +python3 -m onnxruntime_genai.models.builder -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_for_hf_files --extra_options config_only=true + +# From source: +python3 builder.py -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_for_hf_files --extra_options config_only=true +``` + +Afterwards, please open the `genai_config.json` file in the output folder and modify the fields as needed for your model. You should store your ONNX model in the output folder as well. + ### Unit Testing Models This scenario is where your PyTorch model is already downloaded locally (either in the default Hugging Face cache directory or in a local folder on disk). If it is not already downloaded locally, here is an example of how you can download it. diff --git a/docs/genai/reference/config.md b/docs/genai/reference/config.md index 5a9b15b4258ff..6c82d4903945e 100644 --- a/docs/genai/reference/config.md +++ b/docs/genai/reference/config.md @@ -11,6 +11,8 @@ nav_order: 1 _Note: this API is in preview and is subject to change._ +A configuration file called genai_config.json is generated automatically if the model is generated with the model builder. If you provide your own model, you can copy the example below and modify it for your scenario. + ## Example file for phi-2 @@ -20,6 +22,14 @@ _Note: this API is in preview and is subject to change._ "bos_token_id": 50256, "context_length": 2048, "decoder": { + "session_options": { + "log_id": "onnxruntime-genai", + "provider_options": [ + { + "cuda": {} + } + ] + }, "filename": "model.onnx", "head_size": 80, "hidden_size": 2560, @@ -46,16 +56,85 @@ _Note: this API is in preview and is subject to change._ }, "search": { "diversity_penalty": 0.0, + "do_sample": false, + "early_stopping": true, "length_penalty": 1.0, "max_length": 20, "min_length": 0, "no_repeat_ngram_size": 0, "num_beams": 1, "num_return_sequences": 1, + "past_present_share_buffer": true, "repetition_penalty": 1.0, - "temperature": 0.7, + "temperature": 1.0, "top_k": 50, - "top_p": 0.6 + "top_p": 1.0 } } ``` + +## Configuration + +### Model section + +#### General model config + +* _type_: The type of model. Can be phi, llama or gpt. + +* _vocab_size_: The size of the vocabulary that the model processes ie the number of tokens in the vocabulary. + +* _bos_token_id_: The id of the beginning of sequence token. + +* _eos_token_id_: The id of the end of sequence token. + +* _pad_token_: The id of the padding token. + +* _context_length_: The maxinum length of sequence that the model can process. + +* _pad_token_: The id of the padding token. + +#### Session options + +These are the options that are passed to ONNX Runtime, which runs the model on each token generation iteration. + +* _provider_options_: a priortized list of execution targets on which to run the model. If running on CPU, this option is not present. A list of execution provider specific configurations can be specified inside the provider item. + +* _log_id_: a prefix to output when logging + + +Then For each model in the pipeline there is one section, named by the model. + +#### Decoder model config + +* _filename_: The name of the model file. + +* _inputs_: The names of each of the inputs. Sequences of model inputs can contain a wildcard representing the index in the sequence. + +* _outputs_: The names of each of the outputs. + +* _num_attention_heads: The number of attention heads in the model. + +* _head_size_: The size of the attention heads. + +* _hidden_size_: The size of the hidden layers. + +* _num_key_value_heads_: The number of key value heads. + + +### Search section + +* _max_length_: The maximum length that the model will generate. +* _min_length_: The minimum length that the model will generate. +* _do_sample_: +* _num_beams_: The number of beams to apply when generating the output sequence using beam search. +* _num_sequences_: The number of sequences to generate. Returns the sequences with the highest scores in order. +words are repeated. +* _temperature_: +* _top_k_: +* _top_p_: +* _early_stopping_ : +* _repetition_penalty_: The penalty to apply when +* _length_penalty_: +* _diversity_penalty_: +* _no_repeat_ngram_size_: +* _past_present_share_buffer_: diff --git a/docs/genai/tutorials/phi2-python.md b/docs/genai/tutorials/phi2-python.md index b368d0695f3df..f245fe9ee5e50 100644 --- a/docs/genai/tutorials/phi2-python.md +++ b/docs/genai/tutorials/phi2-python.md @@ -53,7 +53,7 @@ prompt = '''def print_prime(n): Print all primes between 1 and n """''' -model=og.Model(f'example-models/phi2-int4-cpu', og.DeviceType.CPU) +model=og.Model(f'example-models/phi2-int4-cpu') tokenizer = model.create_tokenizer() From 31cd6653617fdb2c525f982e5a26ba3067eda5d9 Mon Sep 17 00:00:00 2001 From: Nat Kershaw Date: Thu, 14 Mar 2024 11:17:35 -0700 Subject: [PATCH 37/44] Updated config definitions --- docs/genai/reference/config.md | 30 +++++++++++++++++++++--------- 1 file changed, 21 insertions(+), 9 deletions(-) diff --git a/docs/genai/reference/config.md b/docs/genai/reference/config.md index 6c82d4903945e..42d88f1b4f687 100644 --- a/docs/genai/reference/config.md +++ b/docs/genai/reference/config.md @@ -124,17 +124,29 @@ Then For each model in the pipeline there is one section, named by the model. ### Search section * _max_length_: The maximum length that the model will generate. + * _min_length_: The minimum length that the model will generate. + * _do_sample_: -* _num_beams_: The number of beams to apply when generating the output sequence using beam search. + +* _num_beams_: The number of beams to apply when generating the output sequence using beam search. If num_beams=1, then generation is performed using greedy search. + +* _early_stopping_ : Whether to stop the beam search when at least num_beams sentences are finished per batch or not. Defaults to false. + * _num_sequences_: The number of sequences to generate. Returns the sequences with the highest scores in order. -words are repeated. -* _temperature_: -* _top_k_: -* _top_p_: -* _early_stopping_ : -* _repetition_penalty_: The penalty to apply when -* _length_penalty_: + +* _temperature_: The temperature value scales the probability of each token so that probable tokens become more likely while less probable ones become less likely. This value can have a range 0 < `temperature` ≤ 1. When temperature is equal to `1`, it has no effect. + +* _top_k_: Only includes tokens that do fall within the list of the `K` most probable tokens. + +* _top_p_: Only includes the most probable tokens with probabilities that add up to `P` or higher. Defaults to `1`, which includes all of the tokens. + +* _repetition_penalty_: Discounts the scores of previously generated tokens if set to a value greater than `1`. Defaults to `1`. + +* _length_penalty_: Controls the length of the output generated. Value less than `1` encourages the generation to produce shorter sequences. Values greater than `1` encourages longer sequences. Defaults to `1`. + * _diversity_penalty_: + * _no_repeat_ngram_size_: -* _past_present_share_buffer_: + +* _past_present_share_buffer_: If set to true, the past and present buffer are shared for efficiency. From a74585fe3be714a2e26f18be5d3178f8fd43ac64 Mon Sep 17 00:00:00 2001 From: Nat Kershaw Date: Thu, 14 Mar 2024 11:21:55 -0700 Subject: [PATCH 38/44] Remove duplicate pad token definition --- docs/genai/reference/config.md | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/genai/reference/config.md b/docs/genai/reference/config.md index 42d88f1b4f687..95b143c84489e 100644 --- a/docs/genai/reference/config.md +++ b/docs/genai/reference/config.md @@ -91,7 +91,6 @@ A configuration file called genai_config.json is generated automatically if the * _context_length_: The maxinum length of sequence that the model can process. -* _pad_token_: The id of the padding token. #### Session options From d8d4fb6cbc0b3369756f9884085549219e9d8a85 Mon Sep 17 00:00:00 2001 From: Nat Kershaw Date: Thu, 14 Mar 2024 11:30:41 -0700 Subject: [PATCH 39/44] Bolden config items --- docs/genai/reference/config.md | 60 +++++++++++++++++----------------- 1 file changed, 30 insertions(+), 30 deletions(-) diff --git a/docs/genai/reference/config.md b/docs/genai/reference/config.md index 95b143c84489e..19ee41fbf7bc4 100644 --- a/docs/genai/reference/config.md +++ b/docs/genai/reference/config.md @@ -79,73 +79,73 @@ A configuration file called genai_config.json is generated automatically if the #### General model config -* _type_: The type of model. Can be phi, llama or gpt. +* **_type_**: The type of model. Can be phi, llama or gpt. -* _vocab_size_: The size of the vocabulary that the model processes ie the number of tokens in the vocabulary. +* **_vocab_size_**: The size of the vocabulary that the model processes ie the number of tokens in the vocabulary. -* _bos_token_id_: The id of the beginning of sequence token. +* **_bos_token_id_**: The id of the beginning of sequence token. -* _eos_token_id_: The id of the end of sequence token. +* **_eos_token_id_**: The id of the end of sequence token. -* _pad_token_: The id of the padding token. +* **_pad_token_**: The id of the padding token. -* _context_length_: The maxinum length of sequence that the model can process. +* **_context_length_**: The maxinum length of sequence that the model can process. #### Session options These are the options that are passed to ONNX Runtime, which runs the model on each token generation iteration. -* _provider_options_: a priortized list of execution targets on which to run the model. If running on CPU, this option is not present. A list of execution provider specific configurations can be specified inside the provider item. +* **_provider_options_**: a priortized list of execution targets on which to run the model. If running on CPU, this option is not present. A list of execution provider specific configurations can be specified inside the provider item. -* _log_id_: a prefix to output when logging +* **_log_id_**: a prefix to output when logging Then For each model in the pipeline there is one section, named by the model. #### Decoder model config -* _filename_: The name of the model file. +* **_filename_**: The name of the model file. -* _inputs_: The names of each of the inputs. Sequences of model inputs can contain a wildcard representing the index in the sequence. +* **_inputs_**: The names of each of the inputs. Sequences of model inputs can contain a wildcard representing the index in the sequence. -* _outputs_: The names of each of the outputs. +* **_outputs_**: The names of each of the outputs. -* _num_attention_heads: The number of attention heads in the model. +* **_num_attention_heads_**: The number of attention heads in the model. -* _head_size_: The size of the attention heads. +* **_head_size_**: The size of the attention heads. -* _hidden_size_: The size of the hidden layers. +* **_hidden_size_**: The size of the hidden layers. -* _num_key_value_heads_: The number of key value heads. +* **_num_key_value_heads_**: The number of key value heads. -### Search section +### Generation search section -* _max_length_: The maximum length that the model will generate. +* **_max_length_**: The maximum length that the model will generate. -* _min_length_: The minimum length that the model will generate. +* **_min_length_**: The minimum length that the model will generate. -* _do_sample_: +* **_do_sample_**: -* _num_beams_: The number of beams to apply when generating the output sequence using beam search. If num_beams=1, then generation is performed using greedy search. +* **_num_beams_**: The number of beams to apply when generating the output sequence using beam search. If num_beams=1, then generation is performed using greedy search. -* _early_stopping_ : Whether to stop the beam search when at least num_beams sentences are finished per batch or not. Defaults to false. +* **_early_stopping_**: Whether to stop the beam search when at least num_beams sentences are finished per batch or not. Defaults to false. -* _num_sequences_: The number of sequences to generate. Returns the sequences with the highest scores in order. +* **_num_sequences_**: The number of sequences to generate. Returns the sequences with the highest scores in order. -* _temperature_: The temperature value scales the probability of each token so that probable tokens become more likely while less probable ones become less likely. This value can have a range 0 < `temperature` ≤ 1. When temperature is equal to `1`, it has no effect. +* **_temperature_**: The temperature value scales the probability of each token so that probable tokens become more likely while less probable ones become less likely. This value can have a range 0 < `temperature` ≤ 1. When temperature is equal to `1`, it has no effect. -* _top_k_: Only includes tokens that do fall within the list of the `K` most probable tokens. +* **_top_k_**: Only includes tokens that do fall within the list of the `K` most probable tokens. -* _top_p_: Only includes the most probable tokens with probabilities that add up to `P` or higher. Defaults to `1`, which includes all of the tokens. +* **_top_p_**: Only includes the most probable tokens with probabilities that add up to `P` or higher. Defaults to `1`, which includes all of the tokens. -* _repetition_penalty_: Discounts the scores of previously generated tokens if set to a value greater than `1`. Defaults to `1`. +* **_repetition_penalty_**: Discounts the scores of previously generated tokens if set to a value greater than `1`. Defaults to `1`. -* _length_penalty_: Controls the length of the output generated. Value less than `1` encourages the generation to produce shorter sequences. Values greater than `1` encourages longer sequences. Defaults to `1`. +* **_length_penalty_**: Controls the length of the output generated. Value less than `1` encourages the generation to produce shorter sequences. Values greater than `1` encourages longer sequences. Defaults to `1`. -* _diversity_penalty_: +* **_diversity_penalty_**: -* _no_repeat_ngram_size_: +* **_no_repeat_ngram_size_**: -* _past_present_share_buffer_: If set to true, the past and present buffer are shared for efficiency. +* **_past_present_share_buffer_**: If set to true, the past and present buffer are shared for efficiency. From debc89b71265dd16b06ed036062f8d6f59623914 Mon Sep 17 00:00:00 2001 From: Nat Kershaw Date: Thu, 14 Mar 2024 13:19:38 -0700 Subject: [PATCH 40/44] Add more definitions --- docs/genai/reference/config.md | 29 +++++++++++++++++++++++------ 1 file changed, 23 insertions(+), 6 deletions(-) diff --git a/docs/genai/reference/config.md b/docs/genai/reference/config.md index 19ee41fbf7bc4..0f13fc84cdfda 100644 --- a/docs/genai/reference/config.md +++ b/docs/genai/reference/config.md @@ -126,7 +126,7 @@ Then For each model in the pipeline there is one section, named by the model. * **_min_length_**: The minimum length that the model will generate. -* **_do_sample_**: +* **_do_sample_**: Enables Top P / Top K generation. * **_num_beams_**: The number of beams to apply when generating the output sequence using beam search. If num_beams=1, then generation is performed using greedy search. @@ -136,16 +136,33 @@ Then For each model in the pipeline there is one section, named by the model. * **_temperature_**: The temperature value scales the probability of each token so that probable tokens become more likely while less probable ones become less likely. This value can have a range 0 < `temperature` ≤ 1. When temperature is equal to `1`, it has no effect. -* **_top_k_**: Only includes tokens that do fall within the list of the `K` most probable tokens. +* **_top_k_**: Only includes tokens that do fall within the list of the `K` most probable tokens. Range is 1 to the vocabulary size. -* **_top_p_**: Only includes the most probable tokens with probabilities that add up to `P` or higher. Defaults to `1`, which includes all of the tokens. +* **_top_p_**: Only includes the most probable tokens with probabilities that add up to `P` or higher. Defaults to `1`, which includes all of the tokens. Range is 0 < top P <= 1. * **_repetition_penalty_**: Discounts the scores of previously generated tokens if set to a value greater than `1`. Defaults to `1`. * **_length_penalty_**: Controls the length of the output generated. Value less than `1` encourages the generation to produce shorter sequences. Values greater than `1` encourages longer sequences. Defaults to `1`. -* **_diversity_penalty_**: +* **_diversity_penalty_**: Not implemented. -* **_no_repeat_ngram_size_**: +* **_no_repeat_ngram_size_**: Not implemented. -* **_past_present_share_buffer_**: If set to true, the past and present buffer are shared for efficiency. +* **_past_present_share_buffer_**: If set to true, the past and present buffer are shared for efficiency. + +## Search combinations + +1. Beam search + + - num beams > 1 + - do_sample = False + +2. Greedy search + + - num_beams = 1 + - do_sample = False + +3. Top P / Top K + + - do_sample = True + \ No newline at end of file From 354d211228ca28be234518885fe06cb68791e6b3 Mon Sep 17 00:00:00 2001 From: Nat Kershaw Date: Thu, 14 Mar 2024 14:30:17 -0700 Subject: [PATCH 41/44] Add summary section for different search and sampling --- docs/genai/reference/config.md | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/docs/genai/reference/config.md b/docs/genai/reference/config.md index 0f13fc84cdfda..8cc6fc1c71026 100644 --- a/docs/genai/reference/config.md +++ b/docs/genai/reference/config.md @@ -11,8 +11,12 @@ nav_order: 1 _Note: this API is in preview and is subject to change._ -A configuration file called genai_config.json is generated automatically if the model is generated with the model builder. If you provide your own model, you can copy the example below and modify it for your scenario. +A configuration file called genai_config.json is generated automatically if the model is generated with the model builder. If you provide your own model, you can copy the example below and modify it for your scenario. +{: .no_toc } + +* TOC placeholder +{:toc} ## Example file for phi-2 @@ -126,20 +130,20 @@ Then For each model in the pipeline there is one section, named by the model. * **_min_length_**: The minimum length that the model will generate. -* **_do_sample_**: Enables Top P / Top K generation. +* **_do_sample_**: Enables Top P / Top K generation. When set to true, generation uses the top P and top K values. When set to false, generation uses beam search or greedy search. -* **_num_beams_**: The number of beams to apply when generating the output sequence using beam search. If num_beams=1, then generation is performed using greedy search. +* **_num_beams_**: The number of beams to apply when generating the output sequence using beam search. If num_beams=1, then generation is performed using greedy search. If num_beans > 1, then generation is performed using beam search. * **_early_stopping_**: Whether to stop the beam search when at least num_beams sentences are finished per batch or not. Defaults to false. * **_num_sequences_**: The number of sequences to generate. Returns the sequences with the highest scores in order. -* **_temperature_**: The temperature value scales the probability of each token so that probable tokens become more likely while less probable ones become less likely. This value can have a range 0 < `temperature` ≤ 1. When temperature is equal to `1`, it has no effect. - * **_top_k_**: Only includes tokens that do fall within the list of the `K` most probable tokens. Range is 1 to the vocabulary size. * **_top_p_**: Only includes the most probable tokens with probabilities that add up to `P` or higher. Defaults to `1`, which includes all of the tokens. Range is 0 < top P <= 1. +* **_temperature_**: The temperature value scales the probability of each token so that probable tokens become more likely while less probable ones become less likely. This value can have a range 0 < `temperature` ≤ 1. When temperature is equal to `1`, it has no effect. + * **_repetition_penalty_**: Discounts the scores of previously generated tokens if set to a value greater than `1`. Defaults to `1`. * **_length_penalty_**: Controls the length of the output generated. Value less than `1` encourages the generation to produce shorter sequences. Values greater than `1` encourages longer sequences. Defaults to `1`. From 35b6400b1626362679febdcd7810c8726d730193 Mon Sep 17 00:00:00 2001 From: Nat Kershaw Date: Thu, 14 Mar 2024 14:42:27 -0700 Subject: [PATCH 42/44] Spell check --- docs/genai/reference/config.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/genai/reference/config.md b/docs/genai/reference/config.md index 8cc6fc1c71026..c8d7e34fcf4e4 100644 --- a/docs/genai/reference/config.md +++ b/docs/genai/reference/config.md @@ -93,19 +93,19 @@ A configuration file called genai_config.json is generated automatically if the * **_pad_token_**: The id of the padding token. -* **_context_length_**: The maxinum length of sequence that the model can process. +* **_context_length_**: The maximum length of sequence that the model can process. #### Session options These are the options that are passed to ONNX Runtime, which runs the model on each token generation iteration. -* **_provider_options_**: a priortized list of execution targets on which to run the model. If running on CPU, this option is not present. A list of execution provider specific configurations can be specified inside the provider item. +* **_provider_options_**: a prioritized list of execution targets on which to run the model. If running on CPU, this option is not present. A list of execution provider specific configurations can be specified inside the provider item. -* **_log_id_**: a prefix to output when logging +* **_log_id_**: a prefix to output when logging. -Then For each model in the pipeline there is one section, named by the model. +Then for each model in the pipeline there is one section, named by the model. #### Decoder model config @@ -130,7 +130,7 @@ Then For each model in the pipeline there is one section, named by the model. * **_min_length_**: The minimum length that the model will generate. -* **_do_sample_**: Enables Top P / Top K generation. When set to true, generation uses the top P and top K values. When set to false, generation uses beam search or greedy search. +* **_do_sample_**: Enables Top P / Top K generation. When set to true, generation uses the configured `top_p` and `top_k` values. When set to false, generation uses beam search or greedy search. * **_num_beams_**: The number of beams to apply when generating the output sequence using beam search. If num_beams=1, then generation is performed using greedy search. If num_beans > 1, then generation is performed using beam search. @@ -140,7 +140,7 @@ Then For each model in the pipeline there is one section, named by the model. * **_top_k_**: Only includes tokens that do fall within the list of the `K` most probable tokens. Range is 1 to the vocabulary size. -* **_top_p_**: Only includes the most probable tokens with probabilities that add up to `P` or higher. Defaults to `1`, which includes all of the tokens. Range is 0 < top P <= 1. +* **_top_p_**: Only includes the most probable tokens with probabilities that add up to `P` or higher. Defaults to `1`, which includes all of the tokens. Range is 0 to 1, exclusive of 0. * **_temperature_**: The temperature value scales the probability of each token so that probable tokens become more likely while less probable ones become less likely. This value can have a range 0 < `temperature` ≤ 1. When temperature is equal to `1`, it has no effect. From 2219f02dd025842c7333124d6efc239b588a48eb Mon Sep 17 00:00:00 2001 From: Nat Kershaw Date: Thu, 14 Mar 2024 14:53:13 -0700 Subject: [PATCH 43/44] num_sequences -> num_return_sequences --- docs/genai/reference/config.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/genai/reference/config.md b/docs/genai/reference/config.md index c8d7e34fcf4e4..20b8673701e0a 100644 --- a/docs/genai/reference/config.md +++ b/docs/genai/reference/config.md @@ -136,7 +136,7 @@ Then for each model in the pipeline there is one section, named by the model. * **_early_stopping_**: Whether to stop the beam search when at least num_beams sentences are finished per batch or not. Defaults to false. -* **_num_sequences_**: The number of sequences to generate. Returns the sequences with the highest scores in order. +* **_num_return_sequences_**: The number of sequences to generate. Returns the sequences with the highest scores in order. * **_top_k_**: Only includes tokens that do fall within the list of the `K` most probable tokens. Range is 1 to the vocabulary size. From d3bc3d37d7f5ca53918cec73789f8115e7997e0a Mon Sep 17 00:00:00 2001 From: Nat Kershaw Date: Thu, 14 Mar 2024 14:55:26 -0700 Subject: [PATCH 44/44] Not implemented -> not supported --- docs/genai/reference/config.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/genai/reference/config.md b/docs/genai/reference/config.md index 20b8673701e0a..ce3dd138b8eeb 100644 --- a/docs/genai/reference/config.md +++ b/docs/genai/reference/config.md @@ -148,9 +148,9 @@ Then for each model in the pipeline there is one section, named by the model. * **_length_penalty_**: Controls the length of the output generated. Value less than `1` encourages the generation to produce shorter sequences. Values greater than `1` encourages longer sequences. Defaults to `1`. -* **_diversity_penalty_**: Not implemented. +* **_diversity_penalty_**: Not supported. -* **_no_repeat_ngram_size_**: Not implemented. +* **_no_repeat_ngram_size_**: Not supported. * **_past_present_share_buffer_**: If set to true, the past and present buffer are shared for efficiency.