RAG stack components and configuration

Introduction to RAG Systems

Retrieval Augmented Generation (RAG) is a system designed to improve the responses of Large Language Models (LLMs) by providing them with relevant, external context [04:08:00]. This guide presents a RAG stack based on learnings from 37 failed attempts, aiming to be a high-ROI (Return on Investment) guide per minute [00:22:00].

The development process typically involves two phases:

Prototyping Usually conducted in Google Colab, leveraging its free hardware accelerators for ease of experimentation [00:53:00].
Production Deployment Often utilizes Docker for on-premise or cloud deployment, which is crucial for organizations like financial institutions requiring data and processing to remain on-premise [01:04:00].

The RAG Pipeline

A basic RAG solution involves a user query, which is then embedded and compared to documents within a vector database [05:02:00]. Relevant documents are retrieved and passed along with the original query to an LLM, enabling it to generate a context-aware response [05:11:00].

For a more sophisticated RAG solution, additional steps can be incorporated:

Query Processing: A step to potentially remove sensitive information (e.g., Personally Identifiable Information) before the query is sent to the RAG system [08:49:00].
Post-Retrieval Step: This step improves the accuracy of the documents retrieved from the vector database, often by re-ranking [09:06:00].

Core RAG Stack Components

The RAG stack is broken down into several key components:

1. Orchestration

The orchestration layer manages the flow and interaction between different RAG components.

Prototyping: LlamaIndex or LangChain [01:25:00].
Production: LlamaIndex [01:29:00].

2. Embedding Models

Embedding models convert text into numerical vectors for semantic search.

Prototyping:
- Closed models (APIs) for simplicity, e.g., OpenAI’s text-embedding-ada-002 or text-embedding-3-large [01:34:00].
- Open models, e.g., from NVIDIA or BAI (specifically BGE Small model) [01:41:00].
Production: Open models like those from BAI or NVIDIA, which can be downloaded and run locally [01:46:00].

3. Vector Database

The vector database stores document embeddings for efficient retrieval.

Choice: Qdrant is recommended due to its excellent scalability, handling from a few to hundreds of thousands of documents [01:52:00].

4. Large Language Models (LLMs)

LLMs generate the final response based on the query and retrieved context.

Prototyping:
- Closed models (APIs) for simplicity, e.g., OpenAI’s GPT-3.5 Turbo or GPT-4 [02:07:00].
- Open models like Meta’s Llama models or Alibaba Cloud’s Qwen models [02:12:00].
Production:
- Open models (e.g., Llama 3.2, Qwen 3.4 billion) served via Olama or Hugging Face Text Generation Inference within a Docker environment [02:21:00].

5. Monitoring and Tracing

Monitoring and tracing are critical for troubleshooting and understanding performance bottlenecks (e.g., where most time is spent) [02:41:00].

Prototyping: Langsmith or Arize Phoenix [02:56:00].
Production: Arize Phoenix, often used in a Docker container [03:00:00].

6. Re-ranking

Re-ranking improves the accuracy of the RAG solution by re-ordering retrieved documents based on their relevance to the query. This is typically done post-retrieval [11:33:00].

Prototyping: Closed models (e.g., Cohere) [03:11:00].
Production: Open solutions (e.g., from NVIDIA) [03:19:00].

7. Evaluation

Evaluating RAG systems is essential to assess the quality of responses and retrieved documents.

Framework: RAGAS is a recommended framework for RAG evaluation, working with LLMs to make the task “painless” [03:27:00]. It allows checking quality across different metrics [03:39:00].

Encoder Types in RAG

Cross-Encoder:
- Semantically compares a query with a document by sending both to a BERT model and then a classifier, yielding a similarity score between 0 and 1 [09:28:00].
- Offers additional accuracy but is slow and not scalable for large documents due to processing query and document together [10:04:00].
- Best suited for post-retrieval re-ranking, as it works with a reduced set of documents [11:33:00].
Bi-Encoder:
- Uses two separate encoders (BERT models), one for the query and one for the document, each passing through pooling and embedding layers [10:25:00].
- Compares the resulting embeddings using metrics like cosine similarity [10:47:00].
- A fast and scalable solution, excellent for initial information retrieval where the query is compared against multiple documents [10:54:00].
- Typically used where the vector data is stored, as part of the initial retrieval process [11:19:00].

Production Environment Example (Docker Compose)

A production environment often uses a docker-compose.yaml file to manage multiple Docker images as containers [01:14:00]:

Data Ingestion Image: Connects to the knowledge base to pull in HTML files [17:57:00].
Qdrant Image: For the vector database, pulled from Docker Hub [18:03:00].
Front-end App Image: For the user interface [18:09:00].
LLM Serving Image: Olama or Hugging Face Text Generation Inference engine to serve models [18:12:00].
Arize Phoenix Image: For tracing [18:19:00].
RAGAS Image: For model evaluation [18:22:00].

This setup allows for robust and scalable deployment of RAG solutions.

Tubegraph

Explorer

Table of Contents