Skip to content

Commit

Permalink
Chatbot with Gradio, FastApi Endpoint, Langchain Integration (#1246)
Browse files Browse the repository at this point in the history
* add a background server for RequestManager

* .

* make incr_decoding work

* make spec_infer work

* format

* update python inference

* fix python issues

* bug fix

* add a Legion future to capture the termination of the background server

* gradio finished

* chatbot gradio version 2

* chainlit1

* chainlit2

* fastapi done

* fastapi incr_decoding

* langchain example & wrapper class

* langchain example & wrapper class1

* added documentation

* entrypoint

* del apikey

* delete extra files

* rag search fixed some bugs

* fixed rag search issues

* updates before rebase

* minor changes

* reorganize files

* Add thread safety for background server.

* Simplify backend server design.

* resolve conflict.

* specinfer usecases with issues labeled

* specinfer usecases with issues labeled 2

* fixed issues with prompt template

* fix issues with rag specinfer

* Add server task timeout.

* register callbacks to terminate background worker at exit or termination

* [Python] enable decoding multiple requests

* update README.md and default configuration

* fix issues with gradio and prompt template

* fix issues with rag

* adjusted fastapi entrypoint

* update documentation

* resole conflicts

* issues fix

* adjustments on usecases and api entrypoints

* remove redundent changes

* testing CI

* Enable backtrace

* restore newlines

* version

* add back misdeleted line

* legion verion

---------

Co-authored-by: Zhihao Jia <[email protected]>
Co-authored-by: Gabriele Oliaro <[email protected]>
Co-authored-by: zwang86 <[email protected]>
Co-authored-by: Zeyu Wang <[email protected]>
Co-authored-by: xinhaoc <[email protected]>
  • Loading branch information
6 people authored Jan 26, 2024
1 parent d73bba1 commit abf9fb8
Show file tree
Hide file tree
Showing 23 changed files with 2,013 additions and 9 deletions.
3 changes: 0 additions & 3 deletions SERVE.md
Original file line number Diff line number Diff line change
Expand Up @@ -187,9 +187,6 @@ We provide five prompt datasets for evaluating FlexFlow Serve: [Chatbot instruct
FlexFlow Serve is still under active development. We currently focus on the following tasks and strongly welcome all contributions from bug fixes to new features and extensions.

* AMD benchmarking. We are actively working on benchmarking FlexFlow Serve on AMD GPUs and comparing it with the performance on NVIDIA GPUs.
* Chatbot prompt templates and Multi-round conversations
* Support for FastAPI server
* Integration with LangChain for document question answering

## Acknowledgements
This project is initiated by members from CMU, Stanford, and UCSD. We will be continuing developing and supporting FlexFlow Serve. Please cite FlexFlow Serve as:
Expand Down
64 changes: 64 additions & 0 deletions docs/source/chatbot.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
:tocdepth: 1
********
Chatbot
********

The chatbot use case involves setting up a conversational AI model using FlexFlow Serve, capable of engaging in interactive dialogues with users.

Requirements
============

- FlexFlow Serve setup with required configurations.
- Gradio or any interactive interface tool.

Implementation
==============

1. FlexFlow Initialization
Initialize FlexFlow Serve with desired configurations and specific LLM model.

2. Gradio Interface Setup
Define a function for response generation based on user inputs. Setup Gradio Chat Interface for interaction.

.. code-block:: python
def generate_response(user_input):
result = llm.generate(user_input)
return result.output_text.decode('utf-8')
3. Running the Interface
Launch the Gradio interface and interact with the model by entering text inputs.

.. image:: /imgs/gradio_interface.png
:alt: Gradio Chatbot Interface
:align: center

4. Shutdown
Stop the FlexFlow server after interaction.

Example
=======

Complete code example can be found here:

1. `Chatbot Example with incremental decoding <https://github.com/flexflow/FlexFlow/blob/chatbot-2/inference/python/usecases/gradio_incr.py>`__

2. `Chatbot Example with speculative inference <https://github.com/flexflow/FlexFlow/blob/chatbot-2/inference/python/usecases/gradio_specinfer.py>`__


Example Implementation:

.. code-block:: python
import gradio as gr
import flexflow.serve as ff
ff.init(num_gpus=2, memory_per_gpu=14000, ...)
def generate_response(user_input):
result = llm.generate(user_input)
return result.output_text.decode('utf-8')
iface = gr.ChatInterface(fn=generate_response)
iface.launch()
Binary file added docs/source/imgs/gradio_api.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/imgs/gradio_interface.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 2 additions & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@ Welcome to FlexFlow's documentation!
:caption: FlexFlow Serve

serve_overview
serve_usecases
serve_api

.. toctree::
:caption: FlexFlow Train
Expand Down
55 changes: 55 additions & 0 deletions docs/source/prompt_template.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
:tocdepth: 1
****************
Prompt Template
****************

Prompt templates guide the model's response generation. This use case demonstrates setting up FlexFlow Serve to integrate with Langchain and using prompt templates to handle dynamic prompt templates.

Requirements
============

- FlexFlow Serve setup with appropriate configurations.
- Langchain integration with templates for prompt management.

Implementation
==============

1. FlexFlow Initialization
Initialize and configure FlexFlow Serve.

2. LLM Setup
Compile and start the server for text generation.

3. Prompt Template Setup
Setup a prompt template for guiding model's responses.

4. Response Generation
Use the LLM with the prompt template to generate a response.

5. Shutdown
Stop the FlexFlow server after generating the response.

Example
=======

Complete code example can be found here:

1. `Prompt Template Example with incremental decoding <https://github.com/flexflow/FlexFlow/blob/chatbot-2/inference/python/usecases/prompt_template_incr.py>`__

2. `Prompt Template Example with speculative inference <https://github.com/flexflow/FlexFlow/blob/chatbot-2/inference/python/usecases/prompt_template_specinfer.py>`__


Example Implementation:

.. code-block:: python
import flexflow.serve as ff
from langchain.prompts import PromptTemplate
ff_llm = FlexFlowLLM(...)
ff_llm.compile_and_start(...)
template = "Question: {question}\nAnswer:"
prompt = PromptTemplate(template=template, input_variables=["question"])
response = ff_llm.generate("Who was the US president in 1997?")
90 changes: 90 additions & 0 deletions docs/source/rag.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
:tocdepth: 1
********
RAG Q&A
********

Retrieval Augmented Generation (RAG) combines language models with external knowledge. This use case integrates RAG with FlexFlow Serve for Q&A with documents.

Requirements
============

- FlexFlow Serve setup.
- Retriever setup for RAG.

Implementation
==============

1. FlexFlow Initialization
Initialize and configure FlexFlow Serve.

2. Data Retrieval Setup
Setup a retriever for sourcing information relevant to user queries.

3. RAG Integration
Integrate the retriever with FlexFlow Serve.

4. Response Generation
Use the LLM with RAG to generate responses based on model's knowledge and retrieved information.

5. Shutdown
The FlexFlow server automatically shuts down after generating the response.

Example
=======

A complete code example for a web-document Q&A using FlexFlow can be found here:

1. `Rag Q&A Example with incremental decoding <https://github.com/flexflow/FlexFlow/blob/chatbot-2/inference/python/usecases/rag_incr.py>`__

2. `Rag Q&A Example with speculative inference <https://github.com/flexflow/FlexFlow/blob/chatbot-2/inference/python/usecases/rag_specinfer.py>`__


Example Implementation:

.. code-block:: python
# imports
# compile and start server
ff_llm = FlexFlowLLM(...)
gen_config = ff.GenerationConfig(...)
ff_llm.compile_and_start(...)
ff_llm_wrapper = FF_LLM_wrapper(flexflow_llm=ff_llm)
# Load web page content
loader = WebBaseLoader("https://example.com/data")
data = loader.load()
# Split text
text_splitter = RecursiveCharacterTextSplitter(...)
all_splits = text_splitter.split_documents(data)
# Initialize embeddings
embeddings = OpenAIEmbeddings(...)
# Create VectorStore
vectorstore = Chroma.from_documents(all_splits, embeddings)
# Use VectorStore as a retriever
retriever = vectorstore.as_retriever()
# Apply similarity search
question = "Example Question"
docs = vectorstore.similarity_search(question)
max_chars_per_doc = 100
docs_text = ''.join([docs[i].page_content[:max_chars_per_doc] for i in range(len(docs))])
# Using a Prompt Template
prompt_rag = PromptTemplate.from_template(
"Summarize the main themes in these retrieved docs: {docs_text}"
)
# Build Chain
llm_chain_rag = LLMChain(llm=ff_llm_wrapper, prompt=prompt_rag)
# Run
rag_result = llm_chain_rag(docs_text)
# Stop the server
ff_llm.stop_server()
7 changes: 7 additions & 0 deletions docs/source/serve_api.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
**************************
FlexFlow Serve Python API
**************************

.. toctree::
serve_fastapi
serve_gradioapi
106 changes: 106 additions & 0 deletions docs/source/serve_fastapi.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
:tocdepth: 1
***********************
FlexFlow Serve FastAPI
***********************

Introduction
============

The Python API for FlexFlow Serve enables users to initialize, manage and interact with large language models (LLMs) via FastAPI or Gradio.

Requirements
------------

- FlexFlow Serve setup with necessary configurations.
- FastAPI and Uvicorn for running the API server.

API Configuration
=================

Users can configure the API using FastAPI to handle requests and manage the model.

1. FastAPI Application Initialization
Initialize the FastAPI application to create API endpoints.

2. Request Model Definition
Define the model for API requests using Pydantic.

3. Global Variable for LLM Model
Declare a global variable to store the LLM model.

Example
-------

.. code-block:: python
from fastapi import FastAPI
from pydantic import BaseModel
import flexflow.serve as ff
app = FastAPI()
class PromptRequest(BaseModel):
prompt: str
llm = None
Endpoint Creation
=================

Create API endpoints for LLM interactions to handle generation requests.

1. Initialize Model on Startup
Use the FastAPI event handler to initialize and compile the LLM model when the API server starts.

2. Generate Response Endpoint
Create a POST endpoint to generate responses based on the user's prompt.

Example
-------

.. code-block:: python
@app.on_event("startup")
async def startup_event():
global llm
# Initialize and compile the LLM model
llm.compile(
generation_config,
# ... other params as needed
)
llm.start_server()
@app.post("/generate/")
async def generate(prompt_request: PromptRequest):
# ... exception handling
full_output = llm.generate([prompt_request.prompt])[0].output_text.decode('utf-8')
# ... split prompt and response text for returning results
return {"prompt": prompt_request.prompt, "response": full_output}
Running and Testing
===================

Instructions for running and testing the FastAPI server.

1. Run the FastAPI Server
Use Uvicorn to run the FastAPI server with specified host and port.

2. Testing the API
Make requests to the API endpoints and verify the responses.

Example
-------

.. code-block:: bash
# Running within the inference/python folder:
uvicorn entrypoint.fastapi_incr:app --reload --port 3000
Full API Entrypoint Code
=========================

A complete code example for a web-document Q&A using FlexFlow can be found here:

1. `FastAPI Example with incremental decoding <https://github.com/flexflow/FlexFlow/blob/chatbot-2/inference/python/entrypoint/fastapi_incr.py>`__

2. `FastAPI Example with speculative inference <https://github.com/flexflow/FlexFlow/blob/chatbot-2/inference/python//entrypoint/fastapi_specinfer.py>`__
30 changes: 30 additions & 0 deletions docs/source/serve_gradioapi.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
:tocdepth: 1
*************************
FlexFlow Serve Gradio API
*************************

Introduction
============

Users can also set up the API endpoints with a Gradio Chatbot Interface.

Requirements
------------

- FlexFlow Serve setup with necessary configurations.
- Running the gradio chatbot interface.

Example
========

In a running gradio chatbot interface, hit the "Use via API" button on the bottom left.

.. image:: /imgs/gradio_interface.png
:alt: Gradio Chatbot Interface
:align: center

Users can easily access an API endpoint for sending prompts to the model.

.. image:: /imgs/gradio_api.png
:alt: Gradio API
:align: center
8 changes: 8 additions & 0 deletions docs/source/serve_usecases.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
*******************
Serving Usecases
*******************

.. toctree::
chatbot
prompt_template
rag
1 change: 1 addition & 0 deletions inference/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@ weights
tokenizers
prompt
output
.env
Loading

0 comments on commit abf9fb8

Please sign in to comment.