From: aidotengineer

Introduction to RAG Systems

Retrieval Augmented Generation (RAG) is a system designed to improve the responses of Large Language Models (LLMs) by providing them with relevant, external context [04:08:00]. This guide presents a RAG stack based on learnings from 37 failed attempts, aiming to be a high-ROI (Return on Investment) guide per minute [00:22:00].

The development process typically involves two phases:

  • Prototyping Usually conducted in Google Colab, leveraging its free hardware accelerators for ease of experimentation [00:53:00].
  • Production Deployment Often utilizes Docker for on-premise or cloud deployment, which is crucial for organizations like financial institutions requiring data and processing to remain on-premise [01:04:00].

The RAG Pipeline

A basic RAG solution involves a user query, which is then embedded and compared to documents within a vector database [05:02:00]. Relevant documents are retrieved and passed along with the original query to an LLM, enabling it to generate a context-aware response [05:11:00].

For a more sophisticated RAG solution, additional steps can be incorporated:

  • Query Processing: A step to potentially remove sensitive information (e.g., Personally Identifiable Information) before the query is sent to the RAG system [08:49:00].
  • Post-Retrieval Step: This step improves the accuracy of the documents retrieved from the vector database, often by re-ranking [09:06:00].

Core RAG Stack Components

The RAG stack is broken down into several key components:

1. Orchestration

The orchestration layer manages the flow and interaction between different RAG components.

2. Embedding Models

Embedding models convert text into numerical vectors for semantic search.

  • Prototyping:
    • Closed models (APIs) for simplicity, e.g., OpenAI’s text-embedding-ada-002 or text-embedding-3-large [01:34:00].
    • Open models, e.g., from NVIDIA or BAI (specifically BGE Small model) [01:41:00].
  • Production: Open models like those from BAI or NVIDIA, which can be downloaded and run locally [01:46:00].

3. Vector Database

The vector database stores document embeddings for efficient retrieval.

  • Choice: Qdrant is recommended due to its excellent scalability, handling from a few to hundreds of thousands of documents [01:52:00].

4. Large Language Models (LLMs)

LLMs generate the final response based on the query and retrieved context.

  • Prototyping:
    • Closed models (APIs) for simplicity, e.g., OpenAI’s GPT-3.5 Turbo or GPT-4 [02:07:00].
    • Open models like Meta’s Llama models or Alibaba Cloud’s Qwen models [02:12:00].
  • Production:
    • Open models (e.g., Llama 3.2, Qwen 3.4 billion) served via Olama or Hugging Face Text Generation Inference within a Docker environment [02:21:00].

5. Monitoring and Tracing

Monitoring and tracing are critical for troubleshooting and understanding performance bottlenecks (e.g., where most time is spent) [02:41:00].

  • Prototyping: Langsmith or Arize Phoenix [02:56:00].
  • Production: Arize Phoenix, often used in a Docker container [03:00:00].

6. Re-ranking

Re-ranking improves the accuracy of the RAG solution by re-ordering retrieved documents based on their relevance to the query. This is typically done post-retrieval [11:33:00].

  • Prototyping: Closed models (e.g., Cohere) [03:11:00].
  • Production: Open solutions (e.g., from NVIDIA) [03:19:00].

7. Evaluation

Evaluating RAG systems is essential to assess the quality of responses and retrieved documents.

  • Framework: RAGAS is a recommended framework for RAG evaluation, working with LLMs to make the task “painless” [03:27:00]. It allows checking quality across different metrics [03:39:00].

Encoder Types in RAG

  • Cross-Encoder:
    • Semantically compares a query with a document by sending both to a BERT model and then a classifier, yielding a similarity score between 0 and 1 [09:28:00].
    • Offers additional accuracy but is slow and not scalable for large documents due to processing query and document together [10:04:00].
    • Best suited for post-retrieval re-ranking, as it works with a reduced set of documents [11:33:00].
  • Bi-Encoder:
    • Uses two separate encoders (BERT models), one for the query and one for the document, each passing through pooling and embedding layers [10:25:00].
    • Compares the resulting embeddings using metrics like cosine similarity [10:47:00].
    • A fast and scalable solution, excellent for initial information retrieval where the query is compared against multiple documents [10:54:00].
    • Typically used where the vector data is stored, as part of the initial retrieval process [11:19:00].

Production Environment Example (Docker Compose)

A production environment often uses a docker-compose.yaml file to manage multiple Docker images as containers [01:14:00]:

This setup allows for robust and scalable deployment of RAG solutions.