The idea behind RAG is to leverage both retrieval and generation mechanisms to improve the quality of generated responses.
-
Document Embedding: First, documents or pieces of information are embedded into high-dimensional vectors using a pre-trained encoder model. These embeddings represent the semantic content of the documents.
-
Vector Store Creation: These document embeddings are stored in a vector database or store
-
Query Embedding: When a user inputs a query, it is also embedded into an high-dimensional vector using the same encoder model.
-
Retrieval: The query embedding is then used to search the vector store for the most relevant document embeddings. This retrieval step returns a set of documents/chunks or passages that are most semantically similar to the query.
-
Context Construction: The retrieved documents are combined to form a context, which can be used to provide additional information to the generative model.
-
Generation with LLM: The context, along with the query, is fed into an LLM. When used, the system prompt can set the initial context or provide specific instructions to the model.
-
Response Generation: The LLM analyzes the provided context and query, generating a coherent and contextually relevant response based on the retrieved information.
Gemma models have been identified as not supporting system prompts. This limitation can affect their flexibility and performance in scenarios where initial context-setting or guidance is crucial for generating appropriate responses.
Mistral models support system prompts. This feature allows users to define an initial prompt that sets the context or provides specific instructions for the model
Similar to Mistral, Camel models also support system prompts.