-
Notifications
You must be signed in to change notification settings - Fork 229
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Chatbot with Gradio, FastApi Endpoint, Langchain Integration (#1246)
* add a background server for RequestManager * . * make incr_decoding work * make spec_infer work * format * update python inference * fix python issues * bug fix * add a Legion future to capture the termination of the background server * gradio finished * chatbot gradio version 2 * chainlit1 * chainlit2 * fastapi done * fastapi incr_decoding * langchain example & wrapper class * langchain example & wrapper class1 * added documentation * entrypoint * del apikey * delete extra files * rag search fixed some bugs * fixed rag search issues * updates before rebase * minor changes * reorganize files * Add thread safety for background server. * Simplify backend server design. * resolve conflict. * specinfer usecases with issues labeled * specinfer usecases with issues labeled 2 * fixed issues with prompt template * fix issues with rag specinfer * Add server task timeout. * register callbacks to terminate background worker at exit or termination * [Python] enable decoding multiple requests * update README.md and default configuration * fix issues with gradio and prompt template * fix issues with rag * adjusted fastapi entrypoint * update documentation * resole conflicts * issues fix * adjustments on usecases and api entrypoints * remove redundent changes * testing CI * Enable backtrace * restore newlines * version * add back misdeleted line * legion verion --------- Co-authored-by: Zhihao Jia <[email protected]> Co-authored-by: Gabriele Oliaro <[email protected]> Co-authored-by: zwang86 <[email protected]> Co-authored-by: Zeyu Wang <[email protected]> Co-authored-by: xinhaoc <[email protected]>
- Loading branch information
1 parent
d73bba1
commit abf9fb8
Showing
23 changed files
with
2,013 additions
and
9 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,64 @@ | ||
:tocdepth: 1 | ||
******** | ||
Chatbot | ||
******** | ||
|
||
The chatbot use case involves setting up a conversational AI model using FlexFlow Serve, capable of engaging in interactive dialogues with users. | ||
|
||
Requirements | ||
============ | ||
|
||
- FlexFlow Serve setup with required configurations. | ||
- Gradio or any interactive interface tool. | ||
|
||
Implementation | ||
============== | ||
|
||
1. FlexFlow Initialization | ||
Initialize FlexFlow Serve with desired configurations and specific LLM model. | ||
|
||
2. Gradio Interface Setup | ||
Define a function for response generation based on user inputs. Setup Gradio Chat Interface for interaction. | ||
|
||
.. code-block:: python | ||
def generate_response(user_input): | ||
result = llm.generate(user_input) | ||
return result.output_text.decode('utf-8') | ||
3. Running the Interface | ||
Launch the Gradio interface and interact with the model by entering text inputs. | ||
|
||
.. image:: /imgs/gradio_interface.png | ||
:alt: Gradio Chatbot Interface | ||
:align: center | ||
|
||
4. Shutdown | ||
Stop the FlexFlow server after interaction. | ||
|
||
Example | ||
======= | ||
|
||
Complete code example can be found here: | ||
|
||
1. `Chatbot Example with incremental decoding <https://github.com/flexflow/FlexFlow/blob/chatbot-2/inference/python/usecases/gradio_incr.py>`__ | ||
|
||
2. `Chatbot Example with speculative inference <https://github.com/flexflow/FlexFlow/blob/chatbot-2/inference/python/usecases/gradio_specinfer.py>`__ | ||
|
||
|
||
Example Implementation: | ||
|
||
.. code-block:: python | ||
import gradio as gr | ||
import flexflow.serve as ff | ||
ff.init(num_gpus=2, memory_per_gpu=14000, ...) | ||
def generate_response(user_input): | ||
result = llm.generate(user_input) | ||
return result.output_text.decode('utf-8') | ||
iface = gr.ChatInterface(fn=generate_response) | ||
iface.launch() |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
:tocdepth: 1 | ||
**************** | ||
Prompt Template | ||
**************** | ||
|
||
Prompt templates guide the model's response generation. This use case demonstrates setting up FlexFlow Serve to integrate with Langchain and using prompt templates to handle dynamic prompt templates. | ||
|
||
Requirements | ||
============ | ||
|
||
- FlexFlow Serve setup with appropriate configurations. | ||
- Langchain integration with templates for prompt management. | ||
|
||
Implementation | ||
============== | ||
|
||
1. FlexFlow Initialization | ||
Initialize and configure FlexFlow Serve. | ||
|
||
2. LLM Setup | ||
Compile and start the server for text generation. | ||
|
||
3. Prompt Template Setup | ||
Setup a prompt template for guiding model's responses. | ||
|
||
4. Response Generation | ||
Use the LLM with the prompt template to generate a response. | ||
|
||
5. Shutdown | ||
Stop the FlexFlow server after generating the response. | ||
|
||
Example | ||
======= | ||
|
||
Complete code example can be found here: | ||
|
||
1. `Prompt Template Example with incremental decoding <https://github.com/flexflow/FlexFlow/blob/chatbot-2/inference/python/usecases/prompt_template_incr.py>`__ | ||
|
||
2. `Prompt Template Example with speculative inference <https://github.com/flexflow/FlexFlow/blob/chatbot-2/inference/python/usecases/prompt_template_specinfer.py>`__ | ||
|
||
|
||
Example Implementation: | ||
|
||
.. code-block:: python | ||
import flexflow.serve as ff | ||
from langchain.prompts import PromptTemplate | ||
ff_llm = FlexFlowLLM(...) | ||
ff_llm.compile_and_start(...) | ||
template = "Question: {question}\nAnswer:" | ||
prompt = PromptTemplate(template=template, input_variables=["question"]) | ||
response = ff_llm.generate("Who was the US president in 1997?") |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,90 @@ | ||
:tocdepth: 1 | ||
******** | ||
RAG Q&A | ||
******** | ||
|
||
Retrieval Augmented Generation (RAG) combines language models with external knowledge. This use case integrates RAG with FlexFlow Serve for Q&A with documents. | ||
|
||
Requirements | ||
============ | ||
|
||
- FlexFlow Serve setup. | ||
- Retriever setup for RAG. | ||
|
||
Implementation | ||
============== | ||
|
||
1. FlexFlow Initialization | ||
Initialize and configure FlexFlow Serve. | ||
|
||
2. Data Retrieval Setup | ||
Setup a retriever for sourcing information relevant to user queries. | ||
|
||
3. RAG Integration | ||
Integrate the retriever with FlexFlow Serve. | ||
|
||
4. Response Generation | ||
Use the LLM with RAG to generate responses based on model's knowledge and retrieved information. | ||
|
||
5. Shutdown | ||
The FlexFlow server automatically shuts down after generating the response. | ||
|
||
Example | ||
======= | ||
|
||
A complete code example for a web-document Q&A using FlexFlow can be found here: | ||
|
||
1. `Rag Q&A Example with incremental decoding <https://github.com/flexflow/FlexFlow/blob/chatbot-2/inference/python/usecases/rag_incr.py>`__ | ||
|
||
2. `Rag Q&A Example with speculative inference <https://github.com/flexflow/FlexFlow/blob/chatbot-2/inference/python/usecases/rag_specinfer.py>`__ | ||
|
||
|
||
Example Implementation: | ||
|
||
.. code-block:: python | ||
# imports | ||
# compile and start server | ||
ff_llm = FlexFlowLLM(...) | ||
gen_config = ff.GenerationConfig(...) | ||
ff_llm.compile_and_start(...) | ||
ff_llm_wrapper = FF_LLM_wrapper(flexflow_llm=ff_llm) | ||
# Load web page content | ||
loader = WebBaseLoader("https://example.com/data") | ||
data = loader.load() | ||
# Split text | ||
text_splitter = RecursiveCharacterTextSplitter(...) | ||
all_splits = text_splitter.split_documents(data) | ||
# Initialize embeddings | ||
embeddings = OpenAIEmbeddings(...) | ||
# Create VectorStore | ||
vectorstore = Chroma.from_documents(all_splits, embeddings) | ||
# Use VectorStore as a retriever | ||
retriever = vectorstore.as_retriever() | ||
# Apply similarity search | ||
question = "Example Question" | ||
docs = vectorstore.similarity_search(question) | ||
max_chars_per_doc = 100 | ||
docs_text = ''.join([docs[i].page_content[:max_chars_per_doc] for i in range(len(docs))]) | ||
# Using a Prompt Template | ||
prompt_rag = PromptTemplate.from_template( | ||
"Summarize the main themes in these retrieved docs: {docs_text}" | ||
) | ||
# Build Chain | ||
llm_chain_rag = LLMChain(llm=ff_llm_wrapper, prompt=prompt_rag) | ||
# Run | ||
rag_result = llm_chain_rag(docs_text) | ||
# Stop the server | ||
ff_llm.stop_server() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
************************** | ||
FlexFlow Serve Python API | ||
************************** | ||
|
||
.. toctree:: | ||
serve_fastapi | ||
serve_gradioapi |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,106 @@ | ||
:tocdepth: 1 | ||
*********************** | ||
FlexFlow Serve FastAPI | ||
*********************** | ||
|
||
Introduction | ||
============ | ||
|
||
The Python API for FlexFlow Serve enables users to initialize, manage and interact with large language models (LLMs) via FastAPI or Gradio. | ||
|
||
Requirements | ||
------------ | ||
|
||
- FlexFlow Serve setup with necessary configurations. | ||
- FastAPI and Uvicorn for running the API server. | ||
|
||
API Configuration | ||
================= | ||
|
||
Users can configure the API using FastAPI to handle requests and manage the model. | ||
|
||
1. FastAPI Application Initialization | ||
Initialize the FastAPI application to create API endpoints. | ||
|
||
2. Request Model Definition | ||
Define the model for API requests using Pydantic. | ||
|
||
3. Global Variable for LLM Model | ||
Declare a global variable to store the LLM model. | ||
|
||
Example | ||
------- | ||
|
||
.. code-block:: python | ||
from fastapi import FastAPI | ||
from pydantic import BaseModel | ||
import flexflow.serve as ff | ||
app = FastAPI() | ||
class PromptRequest(BaseModel): | ||
prompt: str | ||
llm = None | ||
Endpoint Creation | ||
================= | ||
|
||
Create API endpoints for LLM interactions to handle generation requests. | ||
|
||
1. Initialize Model on Startup | ||
Use the FastAPI event handler to initialize and compile the LLM model when the API server starts. | ||
|
||
2. Generate Response Endpoint | ||
Create a POST endpoint to generate responses based on the user's prompt. | ||
|
||
Example | ||
------- | ||
|
||
.. code-block:: python | ||
@app.on_event("startup") | ||
async def startup_event(): | ||
global llm | ||
# Initialize and compile the LLM model | ||
llm.compile( | ||
generation_config, | ||
# ... other params as needed | ||
) | ||
llm.start_server() | ||
@app.post("/generate/") | ||
async def generate(prompt_request: PromptRequest): | ||
# ... exception handling | ||
full_output = llm.generate([prompt_request.prompt])[0].output_text.decode('utf-8') | ||
# ... split prompt and response text for returning results | ||
return {"prompt": prompt_request.prompt, "response": full_output} | ||
Running and Testing | ||
=================== | ||
|
||
Instructions for running and testing the FastAPI server. | ||
|
||
1. Run the FastAPI Server | ||
Use Uvicorn to run the FastAPI server with specified host and port. | ||
|
||
2. Testing the API | ||
Make requests to the API endpoints and verify the responses. | ||
|
||
Example | ||
------- | ||
|
||
.. code-block:: bash | ||
# Running within the inference/python folder: | ||
uvicorn entrypoint.fastapi_incr:app --reload --port 3000 | ||
Full API Entrypoint Code | ||
========================= | ||
|
||
A complete code example for a web-document Q&A using FlexFlow can be found here: | ||
|
||
1. `FastAPI Example with incremental decoding <https://github.com/flexflow/FlexFlow/blob/chatbot-2/inference/python/entrypoint/fastapi_incr.py>`__ | ||
|
||
2. `FastAPI Example with speculative inference <https://github.com/flexflow/FlexFlow/blob/chatbot-2/inference/python//entrypoint/fastapi_specinfer.py>`__ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
:tocdepth: 1 | ||
************************* | ||
FlexFlow Serve Gradio API | ||
************************* | ||
|
||
Introduction | ||
============ | ||
|
||
Users can also set up the API endpoints with a Gradio Chatbot Interface. | ||
|
||
Requirements | ||
------------ | ||
|
||
- FlexFlow Serve setup with necessary configurations. | ||
- Running the gradio chatbot interface. | ||
|
||
Example | ||
======== | ||
|
||
In a running gradio chatbot interface, hit the "Use via API" button on the bottom left. | ||
|
||
.. image:: /imgs/gradio_interface.png | ||
:alt: Gradio Chatbot Interface | ||
:align: center | ||
|
||
Users can easily access an API endpoint for sending prompts to the model. | ||
|
||
.. image:: /imgs/gradio_api.png | ||
:alt: Gradio API | ||
:align: center |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
******************* | ||
Serving Usecases | ||
******************* | ||
|
||
.. toctree:: | ||
chatbot | ||
prompt_template | ||
rag |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,3 +3,4 @@ weights | |
tokenizers | ||
prompt | ||
output | ||
.env |
Oops, something went wrong.