Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chatbot with Gradio, FastApi Endpoint, Langchain Integration #1246

Merged
merged 70 commits into from
Jan 26, 2024
Merged
Show file tree
Hide file tree
Changes from 66 commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
c33ec8d
add a background server for RequestManager
jiazhihao Nov 2, 2023
9ec4cdb
.
jiazhihao Nov 4, 2023
8260fd8
make incr_decoding work
jiazhihao Nov 4, 2023
9bbc806
make spec_infer work
jiazhihao Nov 5, 2023
3b6f7a9
format
jiazhihao Nov 5, 2023
5ebc914
update python inference
jiazhihao Nov 5, 2023
e1d606f
resolve merge conflict
jiazhihao Nov 5, 2023
be42e20
fix python issues
jiazhihao Nov 5, 2023
400d5bd
bug fix
jiazhihao Nov 5, 2023
2a17173
Merge branch 'inference' into background_worker
goliaro Nov 6, 2023
56f9f2b
Merge branch 'inference' into background_worker
jiazhihao Nov 10, 2023
0713433
add a Legion future to capture the termination of the background server
jiazhihao Nov 10, 2023
499fab8
Merge branch 'inference' into background_worker
jiazhihao Nov 15, 2023
d908b1a
Merge branch 'inference' into background_worker
zwang86 Nov 17, 2023
938a2d6
Merge branch 'inference' into background_worker
zwang86 Nov 28, 2023
7125f95
Merge branch 'inference' into background_worker
zwang86 Dec 1, 2023
0219245
gradio finished
april-yyt Dec 4, 2023
7404652
chatbot gradio version 2
april-yyt Dec 8, 2023
48aa14d
chainlit1
april-yyt Dec 8, 2023
d7f9ed5
chainlit2
april-yyt Dec 8, 2023
9d0d3ec
fastapi done
april-yyt Dec 8, 2023
889cdf8
fastapi incr_decoding
april-yyt Dec 8, 2023
1b2eac7
langchain example & wrapper class
april-yyt Dec 9, 2023
ad0a42a
langchain example & wrapper class1
april-yyt Dec 9, 2023
f1f7e9d
added documentation
april-yyt Dec 10, 2023
91c7e94
Merge branch 'inference' into background_worker
zwang86 Dec 11, 2023
6cdd948
Merge branch 'inference' into background_worker
zwang86 Dec 13, 2023
b4fe796
entrypoint
april-yyt Dec 13, 2023
0d9c08e
del apikey
april-yyt Dec 13, 2023
bb3acdf
delete extra files
april-yyt Dec 13, 2023
efdb532
rag search fixed some bugs
april-yyt Dec 21, 2023
a1d6e5c
fixed rag search issues
april-yyt Dec 30, 2023
326d953
updates before rebase
april-yyt Jan 4, 2024
f97240a
minor changes
april-yyt Jan 4, 2024
8485edd
Merge branch 'inference' into background_worker
zwang86 Jan 5, 2024
e469f82
reorganize files
april-yyt Jan 5, 2024
c497ec2
Add thread safety for background server.
zwang86 Jan 5, 2024
99cc9ac
Simplify backend server design.
zwang86 Jan 5, 2024
4b4d1a9
resolve conflict.
zwang86 Jan 5, 2024
7b8fd28
specinfer usecases with issues labeled
april-yyt Jan 5, 2024
2496a15
specinfer usecases with issues labeled 2
april-yyt Jan 5, 2024
439696c
fixed issues with prompt template
april-yyt Jan 5, 2024
4568722
fix issues with rag specinfer
april-yyt Jan 5, 2024
3bd11ae
merge background worker
april-yyt Jan 5, 2024
70212f6
Merge branch 'inference' into background_worker
zwang86 Jan 12, 2024
a58aa6d
Add server task timeout.
zwang86 Jan 12, 2024
1725c81
Merge branch 'inference' of https://github.com/flexflow/FlexFlow into…
jiazhihao Jan 12, 2024
4dd98bb
Merge branch 'inference' of https://github.com/flexflow/FlexFlow into…
jiazhihao Jan 12, 2024
0bce49a
register callbacks to terminate background worker at exit or termination
jiazhihao Jan 12, 2024
058308c
[Python] enable decoding multiple requests
jiazhihao Jan 13, 2024
37feea4
update README.md and default configuration
jiazhihao Jan 13, 2024
d8a4988
fix issues with gradio and prompt template
april-yyt Jan 13, 2024
4c2acbb
fix issues with rag
april-yyt Jan 13, 2024
e451b30
adjusted fastapi entrypoint
april-yyt Jan 13, 2024
e275958
update documentation
april-yyt Jan 13, 2024
33279b7
Merge remote-tracking branch 'origin/background_worker' into chatbot-2
april-yyt Jan 13, 2024
a232328
resole conflicts
april-yyt Jan 13, 2024
c10bb08
merge background-worker branch
april-yyt Jan 13, 2024
8fcb40d
issues fix
april-yyt Jan 13, 2024
f38165c
resolve conflicts from inference
april-yyt Jan 14, 2024
9d1a901
adjustments on usecases and api entrypoints
april-yyt Jan 14, 2024
437577e
remove redundent changes
april-yyt Jan 14, 2024
59c3e9c
testing CI
april-yyt Jan 19, 2024
05a2907
Merge branch 'inference' into chatbot-2
april-yyt Jan 19, 2024
ed3cf46
Enable backtrace
april-yyt Jan 19, 2024
56c923e
restore newlines
goliaro Jan 19, 2024
57c2e22
version
xinhaoc Jan 25, 2024
7892e40
Merge branch 'inference' into chatbot-2
april-yyt Jan 25, 2024
93cf72f
add back misdeleted line
april-yyt Jan 25, 2024
587a0d2
legion verion
xinhaoc Jan 26, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 0 additions & 3 deletions SERVE.md
Original file line number Diff line number Diff line change
Expand Up @@ -187,9 +187,6 @@ We provide five prompt datasets for evaluating FlexFlow Serve: [Chatbot instruct
FlexFlow Serve is still under active development. We currently focus on the following tasks and strongly welcome all contributions from bug fixes to new features and extensions.

* AMD benchmarking. We are actively working on benchmarking FlexFlow Serve on AMD GPUs and comparing it with the performance on NVIDIA GPUs.
* Chatbot prompt templates and Multi-round conversations
* Support for FastAPI server
* Integration with LangChain for document question answering

## Acknowledgements
This project is initiated by members from CMU, Stanford, and UCSD. We will be continuing developing and supporting FlexFlow Serve. Please cite FlexFlow Serve as:
Expand Down
2 changes: 1 addition & 1 deletion deps/legion
Submodule legion updated from 626b55 to 2b7248
64 changes: 64 additions & 0 deletions docs/source/chatbot.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
:tocdepth: 1
********
Chatbot
********

The chatbot use case involves setting up a conversational AI model using FlexFlow Serve, capable of engaging in interactive dialogues with users.

Requirements
============

- FlexFlow Serve setup with required configurations.
- Gradio or any interactive interface tool.

Implementation
==============

1. FlexFlow Initialization
Initialize FlexFlow Serve with desired configurations and specific LLM model.

2. Gradio Interface Setup
Define a function for response generation based on user inputs. Setup Gradio Chat Interface for interaction.

.. code-block:: python

def generate_response(user_input):
result = llm.generate(user_input)
return result.output_text.decode('utf-8')


3. Running the Interface
Launch the Gradio interface and interact with the model by entering text inputs.

.. image:: /imgs/gradio_interface.png
:alt: Gradio Chatbot Interface
:align: center

4. Shutdown
Stop the FlexFlow server after interaction.

Example
=======

Complete code example can be found here:

1. `Chatbot Example with incremental decoding <https://github.com/flexflow/FlexFlow/blob/chatbot-2/inference/python/usecases/gradio_incr.py>`__

2. `Chatbot Example with speculative inference <https://github.com/flexflow/FlexFlow/blob/chatbot-2/inference/python/usecases/gradio_specinfer.py>`__


Example Implementation:

.. code-block:: python

import gradio as gr
import flexflow.serve as ff

ff.init(num_gpus=2, memory_per_gpu=14000, ...)

def generate_response(user_input):
result = llm.generate(user_input)
return result.output_text.decode('utf-8')

iface = gr.ChatInterface(fn=generate_response)
iface.launch()
Binary file added docs/source/imgs/gradio_api.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/imgs/gradio_interface.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 2 additions & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@ Welcome to FlexFlow's documentation!
:caption: FlexFlow Serve

serve_overview
serve_usecases
serve_api

.. toctree::
:caption: FlexFlow Train
Expand Down
55 changes: 55 additions & 0 deletions docs/source/prompt_template.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
:tocdepth: 1
****************
Prompt Template
****************

Prompt templates guide the model's response generation. This use case demonstrates setting up FlexFlow Serve to integrate with Langchain and using prompt templates to handle dynamic prompt templates.

Requirements
============

- FlexFlow Serve setup with appropriate configurations.
- Langchain integration with templates for prompt management.

Implementation
==============

1. FlexFlow Initialization
Initialize and configure FlexFlow Serve.

2. LLM Setup
Compile and start the server for text generation.

3. Prompt Template Setup
Setup a prompt template for guiding model's responses.

4. Response Generation
Use the LLM with the prompt template to generate a response.

5. Shutdown
Stop the FlexFlow server after generating the response.

Example
=======

Complete code example can be found here:

1. `Prompt Template Example with incremental decoding <https://github.com/flexflow/FlexFlow/blob/chatbot-2/inference/python/usecases/prompt_template_incr.py>`__

2. `Prompt Template Example with speculative inference <https://github.com/flexflow/FlexFlow/blob/chatbot-2/inference/python/usecases/prompt_template_specinfer.py>`__


Example Implementation:

.. code-block:: python

import flexflow.serve as ff
from langchain.prompts import PromptTemplate

ff_llm = FlexFlowLLM(...)
ff_llm.compile_and_start(...)

template = "Question: {question}\nAnswer:"
prompt = PromptTemplate(template=template, input_variables=["question"])

response = ff_llm.generate("Who was the US president in 1997?")
90 changes: 90 additions & 0 deletions docs/source/rag.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
:tocdepth: 1
********
RAG Q&A
********

Retrieval Augmented Generation (RAG) combines language models with external knowledge. This use case integrates RAG with FlexFlow Serve for Q&A with documents.

Requirements
============

- FlexFlow Serve setup.
- Retriever setup for RAG.

Implementation
==============

1. FlexFlow Initialization
Initialize and configure FlexFlow Serve.

2. Data Retrieval Setup
Setup a retriever for sourcing information relevant to user queries.

3. RAG Integration
Integrate the retriever with FlexFlow Serve.

4. Response Generation
Use the LLM with RAG to generate responses based on model's knowledge and retrieved information.

5. Shutdown
The FlexFlow server automatically shuts down after generating the response.

Example
=======

A complete code example for a web-document Q&A using FlexFlow can be found here:

1. `Rag Q&A Example with incremental decoding <https://github.com/flexflow/FlexFlow/blob/chatbot-2/inference/python/usecases/rag_incr.py>`__

2. `Rag Q&A Example with speculative inference <https://github.com/flexflow/FlexFlow/blob/chatbot-2/inference/python/usecases/rag_specinfer.py>`__


Example Implementation:

.. code-block:: python

# imports

# compile and start server
ff_llm = FlexFlowLLM(...)
gen_config = ff.GenerationConfig(...)
ff_llm.compile_and_start(...)
ff_llm_wrapper = FF_LLM_wrapper(flexflow_llm=ff_llm)


# Load web page content
loader = WebBaseLoader("https://example.com/data")
data = loader.load()

# Split text
text_splitter = RecursiveCharacterTextSplitter(...)
all_splits = text_splitter.split_documents(data)

# Initialize embeddings
embeddings = OpenAIEmbeddings(...)

# Create VectorStore
vectorstore = Chroma.from_documents(all_splits, embeddings)

# Use VectorStore as a retriever
retriever = vectorstore.as_retriever()

# Apply similarity search
question = "Example Question"
docs = vectorstore.similarity_search(question)
max_chars_per_doc = 100
docs_text = ''.join([docs[i].page_content[:max_chars_per_doc] for i in range(len(docs))])

# Using a Prompt Template
prompt_rag = PromptTemplate.from_template(
"Summarize the main themes in these retrieved docs: {docs_text}"
)

# Build Chain
llm_chain_rag = LLMChain(llm=ff_llm_wrapper, prompt=prompt_rag)

# Run
rag_result = llm_chain_rag(docs_text)

# Stop the server
ff_llm.stop_server()
7 changes: 7 additions & 0 deletions docs/source/serve_api.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
**************************
FlexFlow Serve Python API
**************************

.. toctree::
serve_fastapi
serve_gradioapi
106 changes: 106 additions & 0 deletions docs/source/serve_fastapi.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
:tocdepth: 1
***********************
FlexFlow Serve FastAPI
***********************

Introduction
============

The Python API for FlexFlow Serve enables users to initialize, manage and interact with large language models (LLMs) via FastAPI or Gradio.

Requirements
------------

- FlexFlow Serve setup with necessary configurations.
- FastAPI and Uvicorn for running the API server.

API Configuration
=================

Users can configure the API using FastAPI to handle requests and manage the model.

1. FastAPI Application Initialization
Initialize the FastAPI application to create API endpoints.

2. Request Model Definition
Define the model for API requests using Pydantic.

3. Global Variable for LLM Model
Declare a global variable to store the LLM model.

Example
-------

.. code-block:: python

from fastapi import FastAPI
from pydantic import BaseModel
import flexflow.serve as ff

app = FastAPI()

class PromptRequest(BaseModel):
prompt: str

llm = None

Endpoint Creation
=================

Create API endpoints for LLM interactions to handle generation requests.

1. Initialize Model on Startup
Use the FastAPI event handler to initialize and compile the LLM model when the API server starts.

2. Generate Response Endpoint
Create a POST endpoint to generate responses based on the user's prompt.

Example
-------

.. code-block:: python

@app.on_event("startup")
async def startup_event():
global llm
# Initialize and compile the LLM model
llm.compile(
generation_config,
# ... other params as needed
)
llm.start_server()

@app.post("/generate/")
async def generate(prompt_request: PromptRequest):
# ... exception handling
full_output = llm.generate([prompt_request.prompt])[0].output_text.decode('utf-8')
# ... split prompt and response text for returning results
return {"prompt": prompt_request.prompt, "response": full_output}

Running and Testing
===================

Instructions for running and testing the FastAPI server.

1. Run the FastAPI Server
Use Uvicorn to run the FastAPI server with specified host and port.

2. Testing the API
Make requests to the API endpoints and verify the responses.

Example
-------

.. code-block:: bash

# Running within the inference/python folder:
uvicorn entrypoint.fastapi_incr:app --reload --port 3000

Full API Entrypoint Code
=========================

A complete code example for a web-document Q&A using FlexFlow can be found here:

1. `FastAPI Example with incremental decoding <https://github.com/flexflow/FlexFlow/blob/chatbot-2/inference/python/entrypoint/fastapi_incr.py>`__

2. `FastAPI Example with speculative inference <https://github.com/flexflow/FlexFlow/blob/chatbot-2/inference/python//entrypoint/fastapi_specinfer.py>`__
30 changes: 30 additions & 0 deletions docs/source/serve_gradioapi.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
:tocdepth: 1
*************************
FlexFlow Serve Gradio API
*************************

Introduction
============

Users can also set up the API endpoints with a Gradio Chatbot Interface.

Requirements
------------

- FlexFlow Serve setup with necessary configurations.
- Running the gradio chatbot interface.

Example
========

In a running gradio chatbot interface, hit the "Use via API" button on the bottom left.

.. image:: /imgs/gradio_interface.png
:alt: Gradio Chatbot Interface
:align: center

Users can easily access an API endpoint for sending prompts to the model.

.. image:: /imgs/gradio_api.png
:alt: Gradio API
:align: center
8 changes: 8 additions & 0 deletions docs/source/serve_usecases.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
*******************
Serving Usecases
*******************

.. toctree::
chatbot
prompt_template
rag
1 change: 1 addition & 0 deletions inference/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@ weights
tokenizers
prompt
output
.env
Loading
Loading