From: aidotengineer
Introduction to RAG Systems
Retrieval Augmented Generation (RAG) is a system designed to improve the responses of Large Language Models (LLMs) by providing them with relevant, external context [04:08:00]. This guide presents a RAG stack based on learnings from 37 failed attempts, aiming to be a high-ROI (Return on Investment) guide per minute [00:22:00].
The development process typically involves two phases:
- Prototyping Usually conducted in Google Colab, leveraging its free hardware accelerators for ease of experimentation [00:53:00].
- Production Deployment Often utilizes Docker for on-premise or cloud deployment, which is crucial for organizations like financial institutions requiring data and processing to remain on-premise [01:04:00].
The RAG Pipeline
A basic RAG solution involves a user query, which is then embedded and compared to documents within a vector database [05:02:00]. Relevant documents are retrieved and passed along with the original query to an LLM, enabling it to generate a context-aware response [05:11:00].
For a more sophisticated RAG solution, additional steps can be incorporated:
- Query Processing: A step to potentially remove sensitive information (e.g., Personally Identifiable Information) before the query is sent to the RAG system [08:49:00].
- Post-Retrieval Step: This step improves the accuracy of the documents retrieved from the vector database, often by re-ranking [09:06:00].
Core RAG Stack Components
The RAG stack is broken down into several key components:
1. Orchestration
The orchestration layer manages the flow and interaction between different RAG components.
- Prototyping: LlamaIndex or LangChain [01:25:00].
- Production: LlamaIndex [01:29:00].
2. Embedding Models
Embedding models convert text into numerical vectors for semantic search.
- Prototyping:
- Closed models (APIs) for simplicity, e.g., OpenAI’s
text-embedding-ada-002
ortext-embedding-3-large
[01:34:00]. - Open models, e.g., from NVIDIA or BAI (specifically BGE Small model) [01:41:00].
- Closed models (APIs) for simplicity, e.g., OpenAI’s
- Production: Open models like those from BAI or NVIDIA, which can be downloaded and run locally [01:46:00].
3. Vector Database
The vector database stores document embeddings for efficient retrieval.
- Choice: Qdrant is recommended due to its excellent scalability, handling from a few to hundreds of thousands of documents [01:52:00].
4. Large Language Models (LLMs)
LLMs generate the final response based on the query and retrieved context.
- Prototyping:
- Closed models (APIs) for simplicity, e.g., OpenAI’s GPT-3.5 Turbo or GPT-4 [02:07:00].
- Open models like Meta’s Llama models or Alibaba Cloud’s Qwen models [02:12:00].
- Production:
- Open models (e.g., Llama 3.2, Qwen 3.4 billion) served via Olama or Hugging Face Text Generation Inference within a Docker environment [02:21:00].
5. Monitoring and Tracing
Monitoring and tracing are critical for troubleshooting and understanding performance bottlenecks (e.g., where most time is spent) [02:41:00].
- Prototyping: Langsmith or Arize Phoenix [02:56:00].
- Production: Arize Phoenix, often used in a Docker container [03:00:00].
6. Re-ranking
Re-ranking improves the accuracy of the RAG solution by re-ordering retrieved documents based on their relevance to the query. This is typically done post-retrieval [11:33:00].
- Prototyping: Closed models (e.g., Cohere) [03:11:00].
- Production: Open solutions (e.g., from NVIDIA) [03:19:00].
7. Evaluation
Evaluating RAG systems is essential to assess the quality of responses and retrieved documents.
- Framework: RAGAS is a recommended framework for RAG evaluation, working with LLMs to make the task “painless” [03:27:00]. It allows checking quality across different metrics [03:39:00].
Encoder Types in RAG
- Cross-Encoder:
- Semantically compares a query with a document by sending both to a BERT model and then a classifier, yielding a similarity score between 0 and 1 [09:28:00].
- Offers additional accuracy but is slow and not scalable for large documents due to processing query and document together [10:04:00].
- Best suited for post-retrieval re-ranking, as it works with a reduced set of documents [11:33:00].
- Bi-Encoder:
- Uses two separate encoders (BERT models), one for the query and one for the document, each passing through pooling and embedding layers [10:25:00].
- Compares the resulting embeddings using metrics like cosine similarity [10:47:00].
- A fast and scalable solution, excellent for initial information retrieval where the query is compared against multiple documents [10:54:00].
- Typically used where the vector data is stored, as part of the initial retrieval process [11:19:00].
Production Environment Example (Docker Compose)
A production environment often uses a docker-compose.yaml
file to manage multiple Docker images as containers [01:14:00]:
- Data Ingestion Image: Connects to the knowledge base to pull in HTML files [17:57:00].
- Qdrant Image: For the vector database, pulled from Docker Hub [18:03:00].
- Front-end App Image: For the user interface [18:09:00].
- LLM Serving Image: Olama or Hugging Face Text Generation Inference engine to serve models [18:12:00].
- Arize Phoenix Image: For tracing [18:19:00].
- RAGAS Image: For model evaluation [18:22:00].
This setup allows for robust and scalable deployment of RAG solutions.