ironhack-labs · danisiaj · Nov 8, 2024
diff --git a/Presentation RAG.pptx b/Presentation RAG.pptx
diff --git a/RAG langchain model report.pdf b/RAG langchain model report.pdf
diff --git a/README.md b/README.md
@@ -1,115 +1,55 @@
-![logo_ironhack_blue 7](https://user-images.githubusercontent.com/23629340/40541063-a07a0a8a-601a-11e8-91b5-2f13e4e6b441.png)
-
-# Retrieval Augmented Generation (RAG) Challenge
-
-## Introduction
-Retrieval Augmented Generation (RAG) is a novel approach that combines the strengths of retrieval-based and generation-based models to provide accurate and contextually relevant responses. By leveraging a vector database to retrieve relevant documents and a large language model (LLM) to generate responses, RAG can significantly enhance the capabilities of applications in various domains such as customer support, knowledge management, and content creation.
-
-## Project Overview
-
-This project is structured to provide hands-on experience in implementing a RAG system. Students will work through stages from dataset selection to connection to external artefacts (VectorDB, APIs), gaining a comprehensive understanding of RAG’s components and their integration.
-
-### 1. Dataset Selection
-
-Select a dataset suitable for your RAG application. Possible options include:
-- **Learning Material**: A collection of books, slide decks on a specific topic
-- **News articles**: A dataset containing articles on various topics.
-- **Product Reviews**: Reviews of products along with follow-up responses.
-
-**Bonus:** Consider using Multimodal datasets like text+images or text+audio
-
-Check the end of this file for dataset examples
-
-### 2. Exploratory Data Analysis (EDA)
-Perform an EDA on the chosen dataset to understand its structure, content, and the challenges it presents. Document your findings and initial thoughts on how the data can be leveraged in a RAG system.
-
-### 3. Embedding and Storing Chunks
-
-#### 3.A Embed Your Chunks of Documents
-- **Objective**: Transform your chunks of documents into embeddings that can be stored in a VectorDB.
-- **Suggested Tool**: [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) (for English content).
-
-**Bonus** Consider using the Embedding model from OpenAI, just be attentive to costs.
-
-#### 3.B Connection to Vector DB
-- **Objective**: Connect to a vector database to store and retrieve document embeddings.
-- **Suggested Tool**: [ChromaDB](https://www.trychroma.com/).
-- **Steps**:
-  1. Pre-process the dataset to generate embeddings for each document using a suitable model (e.g., Sentence Transformers).
-  2. Store these embeddings in ChromaDB.
-  3. Implement retrieval logic to fetch relevant documents based on a query.
-
-**Bonus:** Consider using a Cloud service to store your embeddings like Azure AI Search or Weaviate. Be attentive to potential costs.
-
-#### 3.C AI Frameworks
-- **Consider Using**: Frameworks like [LangChain](https://python.langchain.com/docs/integrations/vectorstores/chroma) and [LlamaIndex](https://gpt-index.readthedocs.io/en/latest/examples/vector_stores/ChromaIndexDemo.html) for easier integration.
-
-### 4. Connecting to LLM
-- **Objective**: Connect to a Large Language Model to generate responses based on retrieved documents.
-- **Suggested Tool**: [OpenAI API](https://platform.openai.com/docs/api-reference/introduction).
-- **Steps**:
-  1. Set up access to the OpenAI API or an alternative LLM API.
-  2. Develop the logic to combine retrieved documents with the query to generate a response.
-  3. Implement and test the end-to-end RAG pipeline.
-
-- **Bonus**: Connect to an API through a cloud service like AzureOpenAI, AWS Bedrock, or Google Vertex AI. Please note that the setup for this will be much more complex and not all might have a free tier model.
-
-### 5. Evaluation
-- **Objective**: Evaluate the performance of your RAG system in two ways.
-  1. **Yourself**: Test the system multiple times to understand its performance and usability.
-  2. **LLM as a judge (Bonus)**: Use an LLM as a judge to generate questions and evaluate your RAG's answers.
-- **Steps**:
-  1. Create a test set of queries and expected responses.
-  2. Measure the performance of your RAG system against these queries.
-  3. Analyze and document the strengths and weaknesses of your system.
-
-### 6. Deployment (Bonus)
-- **Objective**: Deploy the RAG system as a web application or API.
-- **Tools**: Use frameworks like Flask or FastAPI for the backend and Streamlit for the frontend.
-- **Steps**:
-  1. Develop a simple web interface to interact with your RAG system.
-  2. Deploy the application on a cloud platform such as AWS, GCP, or Heroku.
-
-## Resources
-- [ChromaDB Documentation](https://www.trychroma.com/docs)
-- [OpenAI API Documentation](https://platform.openai.com/docs/api-reference/introduction)
-- [Sentence Transformers](https://www.sbert.net/)
-- [Flask](https://flask.palletsprojects.com/)
-- [Streamlit](https://streamlit.io/)
-
-## Deliverables
-1. **Python Code**: Provide well-documented Python code implementing the RAG system.
-2. **Report**: Submit a detailed report documenting your EDA findings, connection setups, evaluation metrics, and conclusions about the system's performance.
-3. **Presentation**: Prepare a short presentation covering the project, from dataset analysis to the final evaluation. Include visual aids such as charts and example responses.
-
-## Bonus
-- **Interactive Demo**: Provide an interactive demo of your RAG system during the presentation.
-
-This project will equip you with practical skills in implementing and evaluating a Retrieval Augmented Generation system, preparing you for advanced applications in the field of natural language processing.
-
----
-
-# Retrieval-Augmented Generation (RAG) Demo Project Datasets
-
-For this demo project, students will explore the capabilities of Retrieval-Augmented Generation (RAG) systems. Below is a curated list of datasets suitable for various RAG applications, including question-answering, semantic search, and response generation.
-
-## Datasets
-
-### 1. [Common Crawl (News and Web Data)](https://github.com/commoncrawl/)
-   - **Description**: This dataset comprises web-scraped data from a wide array of sources. It's excellent for general knowledge retrieval tasks and question-answering.
-
-### 2. [Paperswithcode Text Datasets](https://paperswithcode.com/datasets?mod=texts&page=1)
-   - **Description**: Portal with many datasets that can be applied to RAG.
-
-### 3. [Biology scientific papers](https://www.researchgate.net/topic/Biological-Science/publications)
-- **Description**: Download a few Biology papers to build a RAG system on Biology topics
-
-### 4. [Puerto Rico news articles](https://github.com/ironhack-labs/project-5-2-genai-rag/data)
-- **Description**: 15 years of crawled Puerto Rico news articles about the region.
-
-### 5. [Financial Laws Collection](https://github.com/ironhack-labs/project-5-2-genai-rag/data)
-- **Description**: Collection of 11 documents on Financial legistaltion in Europe.
-
----
-
-Each of these datasets provides a unique opportunity to experiment with RAG systems and explore how retrieval impacts the quality and relevance of generated responses.
+## **RAG Langchain Model with OpenAI API** 
+
+### _Project Overview_
+This repository contains the code and documentation for a Retrieval Augmented Generation (RAG) model, developed by Dani Siaj and Carlos Rodríguez. This model enables users to upload a PDF document, ask questions, and receive coherent, complete, and relevant responses generated by an integrated large language model (LLM).
+
+The RAG model dynamically generates a prompt from the user's query, incorporating instructions, context, and restrictions to create specific, contextually aware responses.
+
+### _Content_
+The uploaded PDF is a 9-page document containing information on food allergies, symptoms, and management, sourced from the American College of Allergy, Asthma, and Immunology (ACAAI). This document includes only textual content—no tables or images are present.
+
+### _Model Architecture_
+#### Model Selection
+The model architecture is centered around OpenAIEmbeddings API as the text transformer. Key libraries used include:
+
+* Langchain: For text extraction and model chaining.
+* Chroma DB: To create and manage the vector store.
+
+#### Components
+* Document Loader: PyPDFLoader handles document uploads and text parsing.
+* Embeddings: OpenAIEmbeddings transforms text into vector representations.
+* Text Extraction: RecursiveTextCharacterSplitter and ChromaDB handle text processing and vectorization.
+
+### _Chain Architecture_
+### * _Retrieval of Information_: 
+User queries retrieve a set of k documents (where k=3 in the code) from the ChromaDB vector store using similarity_search().
+### * _Prompt Engineering_:
+* A context is built using the selected documents.
+* This context is passed to the dynamic prompt-generating function.
+* A specific, context-aware prompt is created based on the user’s query.
+### * _LLM Implementation_: 
+The prompt is sent to the LLM via the OpenAI API to generate the desired response.
+### _Model Evaluation_
+A second LLM model is used as a "judge" to evaluate the generated responses based on the following criteria:
+
+ * Relevance(0-5)
+ * Accuracy(0-5)
+ * Completeness(0-5)
+ * Clarity(0-5)
+
+Through prompt engineering, a dedicated evaluation prompt is used to assess the quality of each response.
+
+### _Streamlit App_
+The model is deployed on Streamlit, providing a user-friendly interface. Users can input questions and receive responses formatted in Markdown, followed by the LLM-based evaluation. This design enhances user experience by providing both a direct answer and an automated quality assessment.
+
+### _Conclusions_
+Conclusion 1: The RAG model demonstrated high efficiency in terms of response time and relevance to user queries.
+Conclusion 2: The limited size of the document restricts extensive testing. Future evaluations will include larger files for a more comprehensive assessment.
+
+### _Repository_
+* Data folder: where the PDF documents and the Chroma DB is stored
+* main.py where all the code is organized
+* Pptx presentation
+* ReadMe.md
+* Requirements.txt with all the neccessary libraries for this project
+* Streamlit_RAG.py to deploy the code in Streamlit platform and test it in an application
diff --git a/data/References for Evaluation.csv b/data/References for Evaluation.csv
@@ -0,0 +1,21 @@
+Question,Answer
+"What are the most common food allergens?","The most common food allergens include milk, eggs, peanuts, tree nuts, fish, shellfish, wheat, soy, and sesame."
+"Can you outgrow food allergies?","Yes, children may outgrow allergies to milk, egg, soy, and wheat, but peanut, tree nut, fish, and shellfish allergies often persist."
+"How is a food allergy diagnosed?","Diagnosis involves a medical history review, symptom documentation, skin or blood tests to check for food-specific IgE antibodies, and sometimes an oral food challenge."
+"What is anaphylaxis?","Anaphylaxis is a severe, life-threatening allergic reaction that can impair breathing, cause a drop in blood pressure, and may be fatal without prompt treatment."
+"How can I prevent food allergies?","Prevention strategies include delaying the introduction of solid foods to young infants and introducing peanut-containing foods to high-risk infants around 4-6 months."
+"What treatments are available for food allergies?","Currently, the main treatment is avoidance of allergenic foods. There are new therapies such as Palforzia for peanut allergies and a skin patch under FDA review."
+"Can food allergens remain on objects?","Yes, food allergens can remain on surfaces and may cause a skin reaction if touched, but severe reactions occur primarily from ingestion."
+"Can you develop food allergies as an adult?","Yes, food allergies can develop in adulthood, most commonly to shellfish, tree nuts, peanuts, and fish."
+"What symptoms indicate a food allergy?","Symptoms can range from hives, swelling, gastrointestinal distress, to more severe reactions like anaphylaxis."
+"How long do food allergy symptoms take to appear?","Symptoms often appear within minutes to two hours of ingestion but can be delayed in some cases, especially in children."
+"What is oral allergy syndrome?","Oral allergy syndrome is a reaction caused by cross-reactive allergens found in pollen and certain foods, leading to itchiness in the mouth or throat."
+"Is gluten allergy common?","There is no actual gluten allergy; however, wheat allergy and celiac disease are related conditions. Celiac disease is serious and requires strict gluten avoidance."
+"How can I manage food allergies?","Management involves strict avoidance of allergens, reading food labels, and using an epinephrine auto-injector for emergencies."
+"How do I use an epinephrine auto-injector?","Administer the auto-injector at the first sign of a severe allergic reaction. Ensure you're familiar with the device and have easy access to it."
+"How expensive is food allergy testing?","Costs for testing vary widely based on the procedure and insurance coverage. It’s typically conducted for individuals with a history of reactions."
+"Are there any dietary restrictions for allergens?","Yes, individuals must avoid foods known to cause allergic reactions and may need to relay this information in dining situations."
+"What are cross-reactive allergens?","Cross-reactive allergens are similar proteins that can cause a reaction in those allergic to a related food, such as tree nuts and peanuts."
+"Can food allergies cause gastrointestinal issues?","Yes, food allergies can lead to gastrointestinal reactions like vomiting, diarrhea, and abdominal pain as part of the allergic response."
+"What are precautionary labeling statements?","Precautionary labeling statements indicate potential allergen contamination but lack standard definitions, so their meanings can vary."
+"What should I do in case of a severe allergic reaction?","Use epinephrine immediately, call emergency services, and seek medical treatment even if symptoms seem to improve."
diff --git a/data/allergies-doc.pdf b/data/allergies-doc.pdf
diff --git a/data/allergies_ok.pdf b/data/allergies_ok.pdf
diff --git a/data/chroma_db/chroma.sqlite3 b/data/chroma_db/chroma.sqlite3