components of RAG stack

From: aidotengineer
The information in this article is based on the experiences of an independent AI engineer working to build and ship production-ready generative AI solutions [00:00:12]. The insights shared are derived from 37 failed attempts to build a RAG stack [00:00:35].

What is RAG?

Retrieval Augmented Generation (RAG) is a technique that combines retrieval of relevant information with the generation capabilities of large language models (LLMs) [00:04:11].

The process involves three main steps:

Retrieval [00:04:18]: A user query (e.g., “Where can I get help in London?”) is used to perform a semantic search through a vector database to retrieve relevant documents [00:04:23].
Augmentation [00:04:31]: The original query is combined with the information retrieved from the vector database. This combined information is then provided as context to the large language model [00:04:40].
Generation [00:04:45]: With the necessary context and the original query, the large language model can now generate an informed response [00:04:52].

Naive RAG Solution

A basic RAG solution follows a straightforward pipeline:

Query [00:05:02]: The user’s query is received.
Embedding [00:05:02]: The query is embedded (converted into a numerical vector).
Comparison and Retrieval [00:05:07]: The embedded query is compared to documents in a vector database, and relevant documents are retrieved [00:05:11].
Context to LLM [00:05:14]: The retrieved documents, along with the original query, are passed to the large language model as context [00:05:18].
Response Generation [00:05:20]: The LLM generates a response based on the provided context [00:05:20].

While simple, a naive solution may not provide satisfactory responses, as demonstrated by a query about “help in London” yielding information about “wheelchair friendly taxis” from the knowledge base [00:08:30].

Enhancing RAG Solutions

To achieve a more sophisticated RAG solution, additional components can be added:

Query Processing Step: This can involve removing personally identifiable information (PII) from the query before it is passed to the RAG system [00:08:52].
Post-Retrieval Step: This step aims to improve the accuracy of documents retrieved from the vector database [00:09:09].

Cross-encoders vs. Bi-encoders

Understanding cross-encoders and bi-encoders is crucial for improving RAG accuracy and scalability:

Cross-encoder:
- Purpose: Semantically compare a query with a document [00:09:30].
- Mechanism: Sends both the query and the document to a BERT model (encoder from the original transformer model) [00:09:41], then through a classifier to get a similarity score between 0 and 1 [00:09:48].
- Characteristics: Excellent for additional accuracy but slow and not scalable, especially with larger documents, as both query and document are processed by a single model [00:10:07].
- Placement in RAG: Best used post-retrieval (as a re-ranker) because it works with a few documents and needs high accuracy [00:11:33].
Bi-encoder:
- Purpose: Enable fast and scalable information retrieval [00:10:56].
- Mechanism: Uses two separate encoder models—one for the query and one for the document. Each passes through its own BERT layer, pooling, and embedding layer [00:10:38]. The similarity between the query and document embeddings is then compared (e.g., using cosine similarity) [00:10:48].
- Characteristics: Fast and scalable because the two models are separated [00:10:56]. Excellent for initial information retrieval [00:10:58].
- Placement in RAG: Ideal where the vector data is stored, as it needs to compare the query with multiple documents quickly [00:11:21].

Components of the RAG Stack

The RAG stack typically includes several key components, with different choices recommended for prototyping versus production environments.

Development Environments

Prototyping: Google Collab is preferred due to free access to hardware accelerators [00:00:58].
Production: Docker is often used for on-premise or cloud deployments, particularly for financial institutions requiring data and processing to remain on-premise [00:01:14].

Orchestration

This layer manages the flow and interaction between different components of the RAG system.

Prototyping: Llama Index or Lang Graph [00:01:27].
Production: Llama Index [00:01:29].

Embedding Models

These models convert text into numerical vectors (embeddings) for semantic search.

Options: Closed models (using APIs) or open models (like those from Nvidia or BAI) [00:01:31].
Prototyping: Closed models like OpenAI’s text-embedding-ada-002 (default in Llama Index) [00:12:15] or text-embedding-3-large [00:12:22]. Open models such as BAI’s BGE small model can also be downloaded and used [00:12:46].
Production: Open models (BAI, Nvidia) [00:01:48] [00:16:59].

Vector Database

Stores the document embeddings and enables efficient semantic search.

Recommendation: Qdrant is an excellent choice for its scalability, handling anything from a few documents to hundreds of thousands [00:01:58].

Large Language Model (LLM)

Generates the final response based on the retrieved context and query.

Prototyping: Closed models are often preferred for their simplicity via APIs [00:02:09], such as OpenAI’s GPT-3.5 Turbo (default) [00:13:25] or GPT-4 [00:13:39]. Open models like Meta’s models or Qwen 3 [00:02:16] can also be used.
Production: Open models like Llama 3.2 or Qwen 3.4 billion models from Alibaba Cloud, served using Olama or Hugging Face Text Generation Inference [00:02:35].

Monitoring and Tracing

Essential for troubleshooting and understanding performance bottlenecks within the RAG solution [00:02:45].

Prototyping: Langsmith or Phoenix (specifically LlamaIndex Phoenix) [00:02:56] [00:17:15].
Production: Arise Phoenix (can be easily used in a Docker container) [00:03:02] [00:17:20].

Re-rankers (Post-Retrieval)

Used to improve the accuracy of the retrieved documents by re-ranking them based on semantic similarity to the query. This uses a cross-encoder approach [00:11:43].

Prototyping: Closed models like Cohere’s re-ranker [00:03:13] [00:15:26].
Production: Open solutions from Nvidia [00:03:17] [00:17:29].

RAG Evaluation

Crucial for assessing the quality and performance of the RAG solution across various metrics.

Framework: Ragas is recommended for evaluating the quality of RAG solutions [00:03:29] [00:16:53] [00:17:34]. It works with LLMs, making the task “painless” [00:17:45].

Production Environment Setup (Docker Compose)

For production, a docker-compose.yaml file integrates various Docker images to run the RAG solution as containers [00:17:51]:

Ingestion Image: Connects to the knowledge base to pull in data (e.g., HTML files) [00:18:01].
Qdrant Image: For the vector database, pulled directly from Docker Hub [00:18:07].
Front-end App Image: For the user interface of the solution [00:18:09].
LLM Serving Image: Uses Olama or Hugging Face’s Text Generation Inference engine to serve the large language models [00:18:16].
Phoenix Image: For tracing and monitoring [00:18:22].
Ragas Image: For RAG evaluation [00:18:24].

This setup allows for running each component as a separate container within Docker Compose, ensuring a robust and manageable production environment [00:18:28].

Tubegraph

Explorer

Table of Contents