Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new RAG pipeline project #97

Merged
merged 35 commits into from
Apr 9, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
02fc428
first commit, new llm project
strickvl Mar 25, 2024
603c597
add indexing functionality
strickvl Mar 26, 2024
05b488b
finish basic pipeline functionality
strickvl Mar 26, 2024
aedc529
Merge remote-tracking branch 'origin/main' into feature/new-rag-pipeline
strickvl Mar 26, 2024
9c50f80
update llm_utils and format
strickvl Mar 26, 2024
c85da66
refactored the url scraper + utils
strickvl Mar 26, 2024
362c572
refactoring part 2
strickvl Mar 26, 2024
998d593
fix DB update functionality
strickvl Mar 26, 2024
35cf8c4
add option to switch out the llm within the CLI
strickvl Mar 26, 2024
80e8286
use litellm and drop garbage logs
strickvl Mar 26, 2024
d19ce77
formatting
strickvl Mar 26, 2024
74bf3a8
remove unused title + url
strickvl Mar 26, 2024
642209b
rip out langchain completely
strickvl Mar 27, 2024
d4dd39d
error handling and debug statements
strickvl Mar 27, 2024
9f7cd9e
add code inspo acknowledgements
strickvl Mar 27, 2024
8964005
add and update docstrings
strickvl Mar 27, 2024
7ec4393
remove unused code and use zenml urls
strickvl Mar 27, 2024
aad2972
use smaller embedding model
strickvl Mar 27, 2024
cc320fc
update the dimensionality to match the new embedding model
strickvl Mar 27, 2024
830de9f
no cache for embeddings generation
strickvl Mar 27, 2024
0c7cd69
fix constant
strickvl Mar 27, 2024
fb8a2ea
visualise embeddings
strickvl Mar 27, 2024
5341331
tiny tweaks to params
strickvl Mar 27, 2024
18580d1
Merge branch 'feature/new-rag-pipeline' of github.com:zenml-io/zenml-…
strickvl Mar 27, 2024
4ed2552
add images
strickvl Apr 2, 2024
66824d8
update pipeline code to abstract out DB creds
strickvl Apr 8, 2024
a70f50b
add images
strickvl Apr 8, 2024
0705761
final README updates
strickvl Apr 8, 2024
b6c487d
add RAG pipeline image
strickvl Apr 8, 2024
5806d2d
Merge branch 'main' into feature/new-rag-pipeline
strickvl Apr 8, 2024
a11e4f4
formatting
strickvl Apr 8, 2024
3897798
add super simple RAG pipeline
strickvl Apr 8, 2024
6ba2874
even more basic RAG
strickvl Apr 8, 2024
0662ec5
add a third irrelevant question
strickvl Apr 8, 2024
6f342a5
Refactor preprocess_text and answer_question functions
strickvl Apr 8, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added llm-complete-guide/.assets/tsne.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added llm-complete-guide/.assets/umap.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
9 changes: 9 additions & 0 deletions llm-complete-guide/.dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
*
!/pipelines/**
!/steps/**
!/materializers/**
!/evaluate/**
!/finetune/**
!/generate/**
!/lit_gpt/**
!/scripts/**
15 changes: 15 additions & 0 deletions llm-complete-guide/LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
Apache Software License 2.0

Copyright (c) ZenML GmbH 2024. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
144 changes: 144 additions & 0 deletions llm-complete-guide/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
# 🦜 Production-ready RAG pipelines for chat applications

This project showcases how you can work up from a simple RAG pipeline to a more complex setup that
involves finetuning embeddings, reranking retrieved documents, and even finetuning the
LLM itself. We'll do this all for a use case relevant to ZenML: a question
answering system that can provide answers to common questions about ZenML. This
will help you understand how to apply the concepts covered in this guide to your
own projects.

![](.assets/rag-pipeline-zenml-cloud.png)

Contained within this project is all the code needed to run the full pipelines.
You can follow along [in our guide](https://docs.zenml.io/user-guide/llmops-guide/) to understand the decisions and tradeoffs
behind the pipeline and step code contained here. You'll build a solid understanding of how to leverage
LLMs in your MLOps workflows using ZenML, enabling you to build powerful,
scalable, and maintainable LLM-powered applications.

This project contains all the pipeline and step code necessary to follow along
with the guide. You'll need a PostgreSQL database to store the embeddings; full
strickvl marked this conversation as resolved.
Show resolved Hide resolved
instructions are provided below for how to set that up.

## 🙏🏻 Inspiration and Credit

The RAG pipeline relies on code from [this Timescale
blog](https://www.timescale.com/blog/postgresql-as-a-vector-database-create-store-and-query-openai-embeddings-with-pgvector/)
that showcased using PostgreSQL as a vector database. We adapted it for our use
case and adapted it to work with Supabase.

## 🏃 How to run

This project showcases production-ready pipelines so we use some cloud
infrastructure to manage the assets. You can run the pipelines locally using a
local PostgreSQL database, but we encourage you to use a cloud database for
production use cases.

### Connecting to ZenML Cloud

If you run the pipeline using ZenML Cloud you'll have access to the managed
strickvl marked this conversation as resolved.
Show resolved Hide resolved
dashboard which will allow you to get started quickly. We offer a free trial so
you can try out the platform without any cost. Visit the [ZenML Cloud
dashboard](https://cloud.zenml.io/) to get started.

### Setting up Supabase

[Supabase](https://supabase.com/) is a cloud provider that provides a PostgreSQL database. It's simple to
use and has a free tier that should be sufficient for this project. Once you've
created a Supabase account and organisation, you'll need to create a new
project.

![](.assets/supabase-create-project.png)

You'll then want to connect to this database instance by getting the connection
string from the Supabase dashboard.

![](.assets/supabase-connection-string.png)

You'll then use these details to populate some environment variables where the pipeline code expects them:

```shell
export ZENML_SUPABASE_USER=<your-supabase-user>
export ZENML_SUPABASE_HOST=<your-supabase-host>
export ZENML_SUPABASE_PORT=<your-supabase-port>
```

You'll want to save the Supabase database password as a ZenML secret so that it
isn't stored in plaintext. You can do this by running the following command:

```shell
zenml secret create supabase_postgres_db --password="YOUR_PASSWORD"
```

### Running the RAG pipeline

To run the pipeline, you can use the `run.py` script. This script will allow you
to run the pipelines in the correct order. You can run the script with the
following command:

```shell
python run.py --basic-rag
strickvl marked this conversation as resolved.
Show resolved Hide resolved
```

This will run the basic RAG pipeline, which scrapes the ZenML documentation and stores the embeddings in the Supabase database.

### Querying your RAG pipeline assets

Once the pipeline has run successfully, you can query the assets in the Supabase
database using the `--rag-query` flag as well as passing in the model you'd like
strickvl marked this conversation as resolved.
Show resolved Hide resolved
to use for the LLM.

In order to use the default LLM for this query, you'll need an account
and an API key from OpenAI specified as another environment variable:

```shell
export OPENAI_API_KEY=<your-openai-api-key>
strickvl marked this conversation as resolved.
Show resolved Hide resolved
```

When you're ready to make the query, run the following command:

```shell
python run.py --rag-query "how do I use a custom materializer inside my own zenml steps? i.e. how do I set it? inside the @step decorator?" --model=gpt4
strickvl marked this conversation as resolved.
Show resolved Hide resolved
```

Alternative options for LLMs to use include:

- `gpt4`
- `gpt35`
- `claude3`
- `claudehaiku`

Note that Claude will require a different API key from Anthropic. See [the
`litellm` docs](https://docs.litellm.ai/docs/providers/anthropic) on how to set this up.

## ☁️ Running with a remote stack
strickvl marked this conversation as resolved.
Show resolved Hide resolved

The basic RAG pipeline will run using a local stack, but if you want to improve
strickvl marked this conversation as resolved.
Show resolved Hide resolved
the speed of the embeddings step you might want to consider using a cloud
orchestrator. Please follow the instructions in [our basic cloud setup guides](https://docs.zenml.io/user-guide/cloud-guide)
(currently available for [AWS](https://docs.zenml.io/user-guide/cloud-guide/aws-guide) and [GCP](https://docs.zenml.io/user-guide/cloud-guide/gcp-guide)) to learn how you can run the pipelines on
a remote stack.

## 📜 Project Structure

The project loosely follows [the recommended ZenML project structure](https://docs.zenml.io/user-guide/starter-guide/follow-best-practices):

```
.
├── LICENSE # License file
├── README.md # This file
├── constants.py # Constants for the project
├── pipelines
│   ├── __init__.py
│   └── llm_basic_rag.py # Basic RAG pipeline
├── requirements.txt # Requirements file
├── run.py # Script to run the pipelines
├── steps
│   ├── __init__.py
│   ├── populate_index.py # Step to populate the index
│   ├── url_scraper.py # Step to scrape the URLs
│   ├── url_scraping_utils.py # Utilities for the URL scraper
│   └── web_url_loader.py # Step to load the URLs
└── utils
├── __init__.py
└── llm_utils.py # Utilities related to the LLM
```
19 changes: 19 additions & 0 deletions llm-complete-guide/constants.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Vector Store constants
CHUNK_SIZE = 500
CHUNK_OVERLAP = 50
EMBEDDING_DIMENSIONALITY = (
384 # Update this to match the dimensionality of the new model
)

# Scraping constants
RATE_LIMIT = 5 # Maximum number of requests per second

# LLM Utils constants
OPENAI_MODEL = "gpt-3.5-turbo"
EMBEDDINGS_MODEL = "sentence-transformers/all-MiniLM-L12-v2"
MODEL_NAME_MAP = {
"gpt4": "gpt-4-0125-preview",
"gpt35": "gpt-3.5-turbo",
"claude3": "claude-3-opus-20240229",
"claudehaiku": "claude-3-haiku-20240307",
}
16 changes: 16 additions & 0 deletions llm-complete-guide/materializers/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Apache Software License 2.0
#
# Copyright (c) ZenML GmbH 2024. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
87 changes: 87 additions & 0 deletions llm-complete-guide/most_basic_rag_pipeline.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
import os
import re
import string

from openai import OpenAI


def preprocess_text(text):
text = text.lower()
text = text.translate(str.maketrans("", "", string.punctuation))
text = re.sub(r"\s+", " ", text).strip()
return text


def tokenize(text):
return preprocess_text(text).split()


def retrieve_relevant_chunks(query, corpus, top_n=2):
query_tokens = set(tokenize(query))
similarities = []
for chunk in corpus:
chunk_tokens = set(tokenize(chunk))
similarity = len(query_tokens.intersection(chunk_tokens)) / len(
query_tokens.union(chunk_tokens)
)
similarities.append((chunk, similarity))
similarities.sort(key=lambda x: x[1], reverse=True)
return [chunk for chunk, _ in similarities[:top_n]]


def answer_question(query, corpus, top_n=2):
relevant_chunks = retrieve_relevant_chunks(query, corpus, top_n)
if not relevant_chunks:
return "I don't have enough information to answer the question."

context = "\n".join(relevant_chunks)
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
chat_completion = client.chat.completions.create(
messages=[
{
"role": "system",
"content": f"Based on the provided context, answer the following question: {query}\n\nContext:\n{context}",
},
{
"role": "user",
"content": query,
},
],
model="gpt-3.5-turbo",
)

return chat_completion.choices[0].message.content.strip()


# Sci-fi themed corpus about "ZenML World"
corpus = [
"The luminescent forests of ZenML World are inhabited by glowing Zenbots that emit a soft, pulsating light as they roam the enchanted landscape.",
"In the neon skies of ZenML World, Cosmic Butterflies flutter gracefully, their iridescent wings leaving trails of stardust in their wake.",
"Telepathic Treants, ancient sentient trees, communicate through the quantum neural network that spans the entire surface of ZenML World, sharing wisdom and knowledge.",
"Deep within the melodic caverns of ZenML World, Fractal Fungi emit pulsating tones that resonate through the crystalline structures, creating a symphony of otherworldly sounds.",
"Near the ethereal waterfalls of ZenML World, Holographic Hummingbirds hover effortlessly, their translucent wings refracting the prismatic light into mesmerizing patterns.",
"Gravitational Geckos, masters of anti-gravity, traverse the inverted cliffs of ZenML World, defying the laws of physics with their extraordinary abilities.",
"Plasma Phoenixes, majestic creatures of pure energy, soar above the chromatic canyons of ZenML World, their fiery trails painting the sky in a dazzling display of colors.",
"Along the prismatic shores of ZenML World, Crystalline Crabs scuttle and burrow, their transparent exoskeletons refracting the light into a kaleidoscope of hues.",
]

# Preprocess the corpus
corpus = [preprocess_text(sentence) for sentence in corpus]

# Ask questions
question1 = "What are Plasma Phoenixes?"
answer1 = answer_question(question1, corpus)
print(f"Question: {question1}")
print(f"Answer: {answer1}")

question2 = (
"What kinds of creatures live on the prismatic shores of ZenML World?"
)
answer2 = answer_question(question2, corpus)
print(f"Question: {question2}")
print(f"Answer: {answer2}")

irrelevant_question_3 = "What is the capital of Panglossia?"
answer3 = answer_question(irrelevant_question_3, corpus)
print(f"Question: {irrelevant_question_3}")
print(f"Answer: {answer3}")
17 changes: 17 additions & 0 deletions llm-complete-guide/pipelines/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Apache Software License 2.0
#
# Copyright (c) ZenML GmbH 2024. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
from pipelines.llm_basic_rag import llm_basic_rag
26 changes: 26 additions & 0 deletions llm-complete-guide/pipelines/llm_basic_rag.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
from steps.populate_index import (
generate_embeddings,
index_generator,
preprocess_documents,
)
from steps.url_scraper import url_scraper
from steps.web_url_loader import web_url_loader
from zenml import pipeline


@pipeline
def llm_basic_rag() -> None:
"""Executes the pipeline to train a basic RAG model.

This function performs the following steps:
1. Scrapes URLs using the url_scraper function.
2. Loads documents from the scraped URLs using the web_url_loader function.
3. Preprocesses the loaded documents using the preprocess_documents function.
4. Generates embeddings for the preprocessed documents using the generate_embeddings function.
5. Generates an index for the embeddings and documents using the index_generator function.
"""
urls = url_scraper()
docs = web_url_loader(urls=urls)
processed_docs = preprocess_documents(documents=docs)
embeddings = generate_embeddings(split_documents=processed_docs)
index_generator(embeddings=embeddings, documents=docs)
13 changes: 13 additions & 0 deletions llm-complete-guide/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
zenml
langchain-community
ratelimit
langchain>=0.0.325
langchain-openai
pgvector
psycopg2-binary
beautifulsoup4
unstructured
pandas
numpy
sentence-transformers
litellm
Loading
Loading