Skip to content

Real-time enterprise grade RAG pipeline using Pulsar and Cassandra (with Astra Streaming and Astra DB, named as a Leader in the Forrester Wave for Vector DBs)

Notifications You must be signed in to change notification settings

michelderu/wikipedia-langflow-streamlit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Enterprise-grade real-time RAG pipeline on Wikipedia 🌎

This project is part of the following Github projects:

Introduction

Wikipedia is an amazing source of information 🧠. With all the real-time additions and updates of articles, it's a valuable source of information about what's happening in the world 🌍. Perhaps even faster than the news 📰. And that's what this project is all about: Accessing the most relevant articles from Wikipedia to answer your questions.

Additionally, this project is a good example of how to build a rock-solid, scalable, and performant enterprise architecture 🚀. It makes use of the following proven technologies:

🤩 Notable concepts used in this project are:

  • Back-end ⏪
    • Publishing Wikipedia updates in real-time to a Pulsar Topic - Fire and forget with delivery guarantees.
    • Pulsar Functions: Enriching the data and JSON structure of the Wikipedia articles.
    • Using a Pulsar Sink (function) to store the data in Astra DB using the Data API.
  • Front-end ⏩
    • Langflow
      • A simple flow to read any web page, extract the content, chunk it up, and store it in Astra DB inclduing the embeddings to enable Vector Search.
      • A simple RAG flow for a Conversational Interface with Wikipedia.
      • A REST endpoint to access the RAG pipeline for easy integration into any front-end
    • Streamlit
      • Using -just- Vector Search to classify data into news topics in real-time with no lag.
      • Using Instructor + an LLM to enrich the data further including Sentiment Analysis.
      • Subscribing to the Pulsar Topic showing real-time Wikipedia updates flowing in.
      • Astra Vector DB: A Forrester Wave Leader in the Vector Database category.
      • Astra Vectorize: Auto-generate embeddings with vectorize.
      • Providing a Chat with Wikipedia using an LLM.
      • Providing a Q&A interface with Wikipedia using the Langflow REST API.

Why is real-time streaming so important?

A lot of people are struggling to make the leap from RAG prototyping to production hardened RAG pipelines. Streaming solves that.

Streaming provides a no-more-sleepness-nights fire-and-forget way of updating your data, with guarantees for delivery.

Additionally, it fully decouples apps and backbones which still keep working if one or the other is temporarily unavailable.

Screenshots

Application Interface Application Interface Application Interface

The architecture

This application is the back-end for the Wikipedia - What's up in the world? project. It consists of two parts:

  1. A Pulsar Streaming project that consists of the following components:
    • A Pulsar producer that produces the Wikipedia articles to a Pulsar topic.
    • A Pulsar function that enriches the Wikipedia articles with and OpenAI LLM.
    • A Pulsar sink that stores the enriched Wikipedia articles in an Astra DB collection.
  2. A Front-end with Langflow and Streamlit (THIS PROJECT) that allows you to search the Wikipedia articles and chat with the articles.

Architecture

How to run Langflow

Configuration

This assumes you have already created an Astra DB account, a database and a Vectorize-enabled collection. See the steps here.

Browse to Langflow and click on Settings. Create two new variables of type credential:

  • ASTRA_DB_APPLICATION_TOKEN
  • OPENAI_API_KEY

You can configure these variable to automatically be used in relevant fields in the components of Langflow as such: Langflow create variable

Create the flows

Click on the 🔽 icon in the top-middle, click Import and select the langflow-ingest-rag-flows.json file in this repository.

Build the ingest flow

In the top flow, take the following steps:

  1. Paste a URL (like https://en.wikipedia.org/wiki/Retrieval-augmented_generation) into the URL field.
  2. Make sure that your OPENAI_API_KEY variable is set in the OpenAI Embeddings component.
  3. Make sure that your ASTRA_DB_APPLICATION_TOKEN variable is set in the Astra DB component, the correct database and collection are selected.
  4. Now press the ▶️ button on the Astra DB component.

Check the results

Browse to Astra DB, select your database and check that the collection has been updated with the new documents.

Additionally you can run a Semantic Search query like What is RAG? and see that the results are semantically similar to your question!

Build the RAG flow

In the bottom flow, take the following steps:

  1. Make sure that your OPENAI_API_KEY variable is set in the OpenAI Embeddings component.
  2. Make sure that your ASTRA_DB_APPLICATION_TOKEN variable is set in the Astra DB component, the correct database and collection are selected.
  3. Make sure that your OPENAI_API_KEY variable is set in the OpenAI component.
  4. Press the ▶️ button on the Chat Output component
  5. Now, click the Playground button on the bottom-left and see the chat interface in action!

Get the REST API

You're now ready to use the RAG pipeline in your own projects!

To get the REST API, click on the REST API button on the bottom-left of the RAG flow and note down the URL for later use.

How to run the Streamlit front-end

Configuration

First you need to configure the .streamlit/secrets.toml file with the correct values. In order to do so, you need to:

Create a virtual environment

python -m venv .venv
source .venv/bin/activate

Or use your favorite IDE's built-in function to create a virtual environment.

Install the dependencies

pip install -r requirements.txt

Run the application

Be sure to have the back-end producing some articles before running the front-end.

streamlit run app.py

Integrate with Langflow

You can use the Langflow RAG flow previously created to power the Streamlit application. To do so, paste the REST API URL into the repsective field on the Chat with the World tab.

About

Real-time enterprise grade RAG pipeline using Pulsar and Cassandra (with Astra Streaming and Astra DB, named as a Leader in the Forrester Wave for Vector DBs)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages