flexflow · jiazhihao · Jan 26, 2024 · Nov 2, 2023 · Nov 4, 2023 · Nov 4, 2023
diff --git a/SERVE.md b/SERVE.md
@@ -187,9 +187,6 @@ We provide five prompt datasets for evaluating FlexFlow Serve: [Chatbot instruct
 FlexFlow Serve is still under active development. We currently focus on the following tasks and strongly welcome all contributions from bug fixes to new features and extensions.
 
 * AMD benchmarking. We are actively working on benchmarking FlexFlow Serve on AMD GPUs and comparing it with the performance on NVIDIA GPUs.
-* Chatbot prompt templates and Multi-round conversations
-* Support for FastAPI server
-* Integration with LangChain for document question answering
 
 ## Acknowledgements
 This project is initiated by members from CMU, Stanford, and UCSD. We will be continuing developing and supporting FlexFlow Serve. Please cite FlexFlow Serve as:

diff --git a/deps/legion b/deps/legion
diff --git a/docs/source/chatbot.rst b/docs/source/chatbot.rst
@@ -0,0 +1,64 @@
+:tocdepth: 1
+********
+Chatbot
+********
+
+The chatbot use case involves setting up a conversational AI model using FlexFlow Serve, capable of engaging in interactive dialogues with users.
+
+Requirements
+============
+
+- FlexFlow Serve setup with required configurations.
+- Gradio or any interactive interface tool.
+
+Implementation
+==============
+
+1. FlexFlow Initialization
+   Initialize FlexFlow Serve with desired configurations and specific LLM model.
+
+2. Gradio Interface Setup
+   Define a function for response generation based on user inputs. Setup Gradio Chat Interface for interaction. 
+
+   .. code-block:: python
+
+      def generate_response(user_input):
+         result = llm.generate(user_input)
+         return result.output_text.decode('utf-8')
+
+
+3. Running the Interface
+   Launch the Gradio interface and interact with the model by entering text inputs.
+
+   .. image:: /imgs/gradio_interface.png
+      :alt: Gradio Chatbot Interface
+      :align: center
+
+4. Shutdown
+   Stop the FlexFlow server after interaction.
+
+Example
+=======
+
+Complete code example can be found here: 
+
+1. `Chatbot Example with incremental decoding <https://github.com/flexflow/FlexFlow/blob/chatbot-2/inference/python/usecases/gradio_incr.py>`__
+
+2. `Chatbot Example with speculative inference <https://github.com/flexflow/FlexFlow/blob/chatbot-2/inference/python/usecases/gradio_specinfer.py>`__
+
+
+Example Implementation:
+
+   .. code-block:: python
+
+      import gradio as gr
+      import flexflow.serve as ff
+
+      ff.init(num_gpus=2, memory_per_gpu=14000, ...)
+
+      def generate_response(user_input):
+         result = llm.generate(user_input)
+         return result.output_text.decode('utf-8')
+
+      iface = gr.ChatInterface(fn=generate_response)
+      iface.launch()
diff --git a/docs/source/imgs/gradio_api.png b/docs/source/imgs/gradio_api.png
diff --git a/docs/source/imgs/gradio_interface.png b/docs/source/imgs/gradio_interface.png
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -18,6 +18,8 @@ Welcome to FlexFlow's documentation!
    :caption: FlexFlow Serve
 
    serve_overview
+   serve_usecases
+   serve_api
 
 .. toctree::
    :caption: FlexFlow Train

diff --git a/docs/source/prompt_template.rst b/docs/source/prompt_template.rst
@@ -0,0 +1,55 @@
+:tocdepth: 1
+****************
+Prompt Template
+****************
+
+Prompt templates guide the model's response generation. This use case demonstrates setting up FlexFlow Serve to integrate with Langchain and using prompt templates to handle dynamic prompt templates.
+
+Requirements
+============
+
+- FlexFlow Serve setup with appropriate configurations.
+- Langchain integration with templates for prompt management.
+
+Implementation
+==============
+
+1. FlexFlow Initialization
+   Initialize and configure FlexFlow Serve.
+
+2. LLM Setup
+   Compile and start the server for text generation.
+
+3. Prompt Template Setup
+   Setup a prompt template for guiding model's responses.
+
+4. Response Generation
+   Use the LLM with the prompt template to generate a response.
+
+5. Shutdown
+   Stop the FlexFlow server after generating the response.
+
+Example
+=======
+
+Complete code example can be found here: 
+
+1. `Prompt Template Example with incremental decoding <https://github.com/flexflow/FlexFlow/blob/chatbot-2/inference/python/usecases/prompt_template_incr.py>`__
+
+2. `Prompt Template Example with speculative inference <https://github.com/flexflow/FlexFlow/blob/chatbot-2/inference/python/usecases/prompt_template_specinfer.py>`__
+
+
+Example Implementation:
+
+   .. code-block:: python
+
+      import flexflow.serve as ff
+      from langchain.prompts import PromptTemplate
+
+      ff_llm = FlexFlowLLM(...)
+      ff_llm.compile_and_start(...)
+
+      template = "Question: {question}\nAnswer:"
+      prompt = PromptTemplate(template=template, input_variables=["question"])
+
+      response = ff_llm.generate("Who was the US president in 1997?")
diff --git a/docs/source/rag.rst b/docs/source/rag.rst
@@ -0,0 +1,90 @@
+:tocdepth: 1
+********
+RAG Q&A
+********
+
+Retrieval Augmented Generation (RAG) combines language models with external knowledge. This use case integrates RAG with FlexFlow Serve for Q&A with documents.
+
+Requirements
+============
+
+- FlexFlow Serve setup.
+- Retriever setup for RAG.
+
+Implementation
+==============
+
+1. FlexFlow Initialization
+   Initialize and configure FlexFlow Serve.
+
+2. Data Retrieval Setup
+   Setup a retriever for sourcing information relevant to user queries.
+
+3. RAG Integration
+   Integrate the retriever with FlexFlow Serve.
+
+4. Response Generation
+   Use the LLM with RAG to generate responses based on model's knowledge and retrieved information.
+
+5. Shutdown
+   The FlexFlow server automatically shuts down after generating the response.
+
+Example
+=======
+
+A complete code example for a web-document Q&A using FlexFlow can be found here: 
+
+1. `Rag Q&A Example with incremental decoding <https://github.com/flexflow/FlexFlow/blob/chatbot-2/inference/python/usecases/rag_incr.py>`__
+
+2. `Rag Q&A Example with speculative inference <https://github.com/flexflow/FlexFlow/blob/chatbot-2/inference/python/usecases/rag_specinfer.py>`__
+
+
+Example Implementation:
+
+   .. code-block:: python
+
+      # imports
+
+      # compile and start server
+      ff_llm = FlexFlowLLM(...)
+      gen_config = ff.GenerationConfig(...)
+      ff_llm.compile_and_start(...)
+      ff_llm_wrapper = FF_LLM_wrapper(flexflow_llm=ff_llm)
+
+
+      # Load web page content
+      loader = WebBaseLoader("https://example.com/data")
+      data = loader.load()
+
+      # Split text
+      text_splitter = RecursiveCharacterTextSplitter(...)
+      all_splits = text_splitter.split_documents(data)
+
+      # Initialize embeddings
+      embeddings = OpenAIEmbeddings(...) 
+
+      # Create VectorStore
+      vectorstore = Chroma.from_documents(all_splits, embeddings)
+
+      # Use VectorStore as a retriever
+      retriever = vectorstore.as_retriever()
+
+      # Apply similarity search 
+      question = "Example Question"
+      docs = vectorstore.similarity_search(question)
+      max_chars_per_doc = 100
+      docs_text = ''.join([docs[i].page_content[:max_chars_per_doc] for i in range(len(docs))])
+
+      # Using a Prompt Template
+      prompt_rag = PromptTemplate.from_template(
+         "Summarize the main themes in these retrieved docs: {docs_text}"
+      )
+
+      # Build Chain
+      llm_chain_rag = LLMChain(llm=ff_llm_wrapper, prompt=prompt_rag)
+
+      # Run
+      rag_result = llm_chain_rag(docs_text)
+
+      # Stop the server
+      ff_llm.stop_server()
diff --git a/docs/source/serve_api.rst b/docs/source/serve_api.rst
@@ -0,0 +1,7 @@
+**************************
+FlexFlow Serve Python API
+**************************
+
+.. toctree::
+   serve_fastapi
+   serve_gradioapi
diff --git a/docs/source/serve_fastapi.rst b/docs/source/serve_fastapi.rst
@@ -0,0 +1,106 @@
+:tocdepth: 1
+***********************
+FlexFlow Serve FastAPI
+***********************
+
+Introduction
+============
+
+The Python API for FlexFlow Serve enables users to initialize, manage and interact with large language models (LLMs) via FastAPI or Gradio.
+
+Requirements
+------------
+
+- FlexFlow Serve setup with necessary configurations.
+- FastAPI and Uvicorn for running the API server.
+
+API Configuration
+=================
+
+Users can configure the API using FastAPI to handle requests and manage the model.
+
+1. FastAPI Application Initialization
+   Initialize the FastAPI application to create API endpoints.
+
+2. Request Model Definition
+   Define the model for API requests using Pydantic.
+
+3. Global Variable for LLM Model
+   Declare a global variable to store the LLM model.
+
+Example
+-------
+
+.. code-block:: python
+
+   from fastapi import FastAPI
+   from pydantic import BaseModel
+   import flexflow.serve as ff
+
+   app = FastAPI()
+
+   class PromptRequest(BaseModel):
+       prompt: str
+
+   llm = None
+
+Endpoint Creation
+=================
+
+Create API endpoints for LLM interactions to handle generation requests.
+
+1. Initialize Model on Startup
+   Use the FastAPI event handler to initialize and compile the LLM model when the API server starts.
+
+2. Generate Response Endpoint
+   Create a POST endpoint to generate responses based on the user's prompt.
+
+Example
+-------
+
+.. code-block:: python
+
+   @app.on_event("startup")
+   async def startup_event():
+      global llm
+      # Initialize and compile the LLM model
+      llm.compile(
+         generation_config,
+         # ... other params as needed
+      )
+      llm.start_server()
+
+   @app.post("/generate/")
+   async def generate(prompt_request: PromptRequest):
+      # ... exception handling
+      full_output = llm.generate([prompt_request.prompt])[0].output_text.decode('utf-8')
+      # ... split prompt and response text for returning results
+      return {"prompt": prompt_request.prompt, "response": full_output}
+
+Running and Testing
+===================
+
+Instructions for running and testing the FastAPI server.
+
+1. Run the FastAPI Server
+   Use Uvicorn to run the FastAPI server with specified host and port.
+
+2. Testing the API
+   Make requests to the API endpoints and verify the responses.
+
+Example
+-------
+
+.. code-block:: bash
+
+   # Running within the inference/python folder:
+   uvicorn entrypoint.fastapi_incr:app --reload --port 3000
+
+Full API Entrypoint Code 
+=========================
+
+A complete code example for a web-document Q&A using FlexFlow can be found here: 
+
+1. `FastAPI Example with incremental decoding <https://github.com/flexflow/FlexFlow/blob/chatbot-2/inference/python/entrypoint/fastapi_incr.py>`__
+
+2. `FastAPI Example with speculative inference <https://github.com/flexflow/FlexFlow/blob/chatbot-2/inference/python//entrypoint/fastapi_specinfer.py>`__
diff --git a/docs/source/serve_gradioapi.rst b/docs/source/serve_gradioapi.rst
@@ -0,0 +1,30 @@
+:tocdepth: 1
+*************************
+FlexFlow Serve Gradio API
+*************************
+
+Introduction
+============
+
+Users can also set up the API endpoints with a Gradio Chatbot Interface.
+
+Requirements
+------------
+
+- FlexFlow Serve setup with necessary configurations.
+- Running the gradio chatbot interface.
+
+Example
+========
+
+In a running gradio chatbot interface, hit the "Use via API" button on the bottom left.
+
+   .. image:: /imgs/gradio_interface.png
+      :alt: Gradio Chatbot Interface
+      :align: center
+
+Users can easily access an API endpoint for sending prompts to the model.
+
+   .. image:: /imgs/gradio_api.png
+      :alt: Gradio API
+      :align: center
diff --git a/docs/source/serve_usecases.rst b/docs/source/serve_usecases.rst
@@ -0,0 +1,8 @@
+*******************
+Serving Usecases
+*******************
+
+.. toctree::
+   chatbot
+   prompt_template
+   rag
diff --git a/inference/.gitignore b/inference/.gitignore
@@ -3,3 +3,4 @@ weights
 tokenizers
 prompt
 output
+.env