From: aidotengineer
The information in this article is based on the experiences of an independent AI engineer working to build and ship production-ready generative AI solutions [00:00:12]. The insights shared are derived from 37 failed attempts to build a RAG stack [00:00:35].
What is RAG?
Retrieval Augmented Generation (RAG) is a technique that combines retrieval of relevant information with the generation capabilities of large language models (LLMs) [00:04:11].
The process involves three main steps:
- Retrieval [00:04:18]: A user query (e.g., “Where can I get help in London?”) is used to perform a semantic search through a vector database to retrieve relevant documents [00:04:23].
- Augmentation [00:04:31]: The original query is combined with the information retrieved from the vector database. This combined information is then provided as context to the large language model [00:04:40].
- Generation [00:04:45]: With the necessary context and the original query, the large language model can now generate an informed response [00:04:52].
Naive RAG Solution
A basic RAG solution follows a straightforward pipeline:
- Query [00:05:02]: The user’s query is received.
- Embedding [00:05:02]: The query is embedded (converted into a numerical vector).
- Comparison and Retrieval [00:05:07]: The embedded query is compared to documents in a vector database, and relevant documents are retrieved [00:05:11].
- Context to LLM [00:05:14]: The retrieved documents, along with the original query, are passed to the large language model as context [00:05:18].
- Response Generation [00:05:20]: The LLM generates a response based on the provided context [00:05:20].
While simple, a naive solution may not provide satisfactory responses, as demonstrated by a query about “help in London” yielding information about “wheelchair friendly taxis” from the knowledge base [00:08:30].
Enhancing RAG Solutions
To achieve a more sophisticated RAG solution, additional components can be added:
- Query Processing Step: This can involve removing personally identifiable information (PII) from the query before it is passed to the RAG system [00:08:52].
- Post-Retrieval Step: This step aims to improve the accuracy of documents retrieved from the vector database [00:09:09].
Cross-encoders vs. Bi-encoders
Understanding cross-encoders and bi-encoders is crucial for improving RAG accuracy and scalability:
-
Cross-encoder:
- Purpose: Semantically compare a query with a document [00:09:30].
- Mechanism: Sends both the query and the document to a BERT model (encoder from the original transformer model) [00:09:41], then through a classifier to get a similarity score between 0 and 1 [00:09:48].
- Characteristics: Excellent for additional accuracy but slow and not scalable, especially with larger documents, as both query and document are processed by a single model [00:10:07].
- Placement in RAG: Best used post-retrieval (as a re-ranker) because it works with a few documents and needs high accuracy [00:11:33].
-
Bi-encoder:
- Purpose: Enable fast and scalable information retrieval [00:10:56].
- Mechanism: Uses two separate encoder models—one for the query and one for the document. Each passes through its own BERT layer, pooling, and embedding layer [00:10:38]. The similarity between the query and document embeddings is then compared (e.g., using cosine similarity) [00:10:48].
- Characteristics: Fast and scalable because the two models are separated [00:10:56]. Excellent for initial information retrieval [00:10:58].
- Placement in RAG: Ideal where the vector data is stored, as it needs to compare the query with multiple documents quickly [00:11:21].
Components of the RAG Stack
The RAG stack typically includes several key components, with different choices recommended for prototyping versus production environments.
Development Environments
- Prototyping: Google Collab is preferred due to free access to hardware accelerators [00:00:58].
- Production: Docker is often used for on-premise or cloud deployments, particularly for financial institutions requiring data and processing to remain on-premise [00:01:14].
Orchestration
This layer manages the flow and interaction between different components of the RAG system.
- Prototyping: Llama Index or Lang Graph [00:01:27].
- Production: Llama Index [00:01:29].
Embedding Models
These models convert text into numerical vectors (embeddings) for semantic search.
- Options: Closed models (using APIs) or open models (like those from Nvidia or BAI) [00:01:31].
- Prototyping: Closed models like OpenAI’s
text-embedding-ada-002
(default in Llama Index) [00:12:15] ortext-embedding-3-large
[00:12:22]. Open models such as BAI’s BGE small model can also be downloaded and used [00:12:46]. - Production: Open models (BAI, Nvidia) [00:01:48] [00:16:59].
Vector Database
Stores the document embeddings and enables efficient semantic search.
- Recommendation: Qdrant is an excellent choice for its scalability, handling anything from a few documents to hundreds of thousands [00:01:58].
Large Language Model (LLM)
Generates the final response based on the retrieved context and query.
- Prototyping: Closed models are often preferred for their simplicity via APIs [00:02:09], such as OpenAI’s GPT-3.5 Turbo (default) [00:13:25] or GPT-4 [00:13:39]. Open models like Meta’s models or Qwen 3 [00:02:16] can also be used.
- Production: Open models like Llama 3.2 or Qwen 3.4 billion models from Alibaba Cloud, served using Olama or Hugging Face Text Generation Inference [00:02:35].
Monitoring and Tracing
Essential for troubleshooting and understanding performance bottlenecks within the RAG solution [00:02:45].
- Prototyping: Langsmith or Phoenix (specifically LlamaIndex Phoenix) [00:02:56] [00:17:15].
- Production: Arise Phoenix (can be easily used in a Docker container) [00:03:02] [00:17:20].
Re-rankers (Post-Retrieval)
Used to improve the accuracy of the retrieved documents by re-ranking them based on semantic similarity to the query. This uses a cross-encoder approach [00:11:43].
- Prototyping: Closed models like Cohere’s re-ranker [00:03:13] [00:15:26].
- Production: Open solutions from Nvidia [00:03:17] [00:17:29].
RAG Evaluation
Crucial for assessing the quality and performance of the RAG solution across various metrics.
- Framework: Ragas is recommended for evaluating the quality of RAG solutions [00:03:29] [00:16:53] [00:17:34]. It works with LLMs, making the task “painless” [00:17:45].
Production Environment Setup (Docker Compose)
For production, a docker-compose.yaml
file integrates various Docker images to run the RAG solution as containers [00:17:51]:
- Ingestion Image: Connects to the knowledge base to pull in data (e.g., HTML files) [00:18:01].
- Qdrant Image: For the vector database, pulled directly from Docker Hub [00:18:07].
- Front-end App Image: For the user interface of the solution [00:18:09].
- LLM Serving Image: Uses Olama or Hugging Face’s Text Generation Inference engine to serve the large language models [00:18:16].
- Phoenix Image: For tracing and monitoring [00:18:22].
- Ragas Image: For RAG evaluation [00:18:24].
This setup allows for running each component as a separate container within Docker Compose, ensuring a robust and manageable production environment [00:18:28].