This repo implements a production-ready, scalable Retrieval Augmented Generation (RAG)-powered LLM-based Open Generative (or Extractive) context-aware Question-Answering (QA) App that:
- Takes as input a new
query
(orquestion
); - Implements vector similarity search within the embedding space by seeking relevant contexts corresponding to the incoming
query
in the vector database; - Passes the relevant contexts as well as the
input query
to LLM; - LLM then produces the
answer
to the inputquery
while being aware of the relevant contexts related to the requestedquery
.
This project also includes Fine-tuning
a 20B
parameters Large Language Model (LLM) in a multi-GPU cluster environment by leveraging the distributed training paradigm. Moreover, this repo develops scalable major ML workloads for contexts
(load
, embed
, and index
the contexts
in the vector database) across multiple workers with different compute resources and serves the LLM App in a highly robust and scalable manner.
The below diagram shows the architectural design of this RAG-powered LLM App:
Python
Streamlit
PEFT
(for Parameter-Efficient Fine-Tuning)Accelerate
Ray
(for distributed LLM Fine-Tuning)Datasets
Transformers
PyTorch
Numpy
Scikit-Learn
Deta
(To access Deta Vector Database)LangChain
FastAPI
(To serve production-ready LLM App)
Squad dataset is used to fine-tune Eleuther AI's GPT-Neo 20B LLM model, which comprises Title
, Question
, Answer
, and Context
for each of the 98.2k
dataset IDs
.
- The
Fine-Tuning
process forGPT-Neo
LLM model can be found infinetune.py
file. - The code to create RAG-powered LLM Agent for
QA
task can be seen inqa_agent.py
file. - To build the agent as production-ready API for
QA
task, it's worth delving deep intoserve.py
file. - To seek prospects of using
Streamlit
to deploy the LLM app, head tostreamlit.py
file. - All hyperparameters to control
fine-tuning
of the model are provided in the givenconfig.py
file.
To learn more about how to use this LLM RAG-powered QA App, consider watching the following video: