From: aidotengineer
The information in this article is based on the experiences of an independent AI engineer working to build and ship production-ready generative AI solutions [00:00:12]. The insights shared are derived from 37 failed attempts to build a RAG stack [00:00:35].

What is RAG?

Retrieval Augmented Generation (RAG) is a technique that combines retrieval of relevant information with the generation capabilities of large language models (LLMs) [00:04:11].

The process involves three main steps:

  1. Retrieval [00:04:18]: A user query (e.g., “Where can I get help in London?”) is used to perform a semantic search through a vector database to retrieve relevant documents [00:04:23].
  2. Augmentation [00:04:31]: The original query is combined with the information retrieved from the vector database. This combined information is then provided as context to the large language model [00:04:40].
  3. Generation [00:04:45]: With the necessary context and the original query, the large language model can now generate an informed response [00:04:52].

Naive RAG Solution

A basic RAG solution follows a straightforward pipeline:

  1. Query [00:05:02]: The user’s query is received.
  2. Embedding [00:05:02]: The query is embedded (converted into a numerical vector).
  3. Comparison and Retrieval [00:05:07]: The embedded query is compared to documents in a vector database, and relevant documents are retrieved [00:05:11].
  4. Context to LLM [00:05:14]: The retrieved documents, along with the original query, are passed to the large language model as context [00:05:18].
  5. Response Generation [00:05:20]: The LLM generates a response based on the provided context [00:05:20].

While simple, a naive solution may not provide satisfactory responses, as demonstrated by a query about “help in London” yielding information about “wheelchair friendly taxis” from the knowledge base [00:08:30].

Enhancing RAG Solutions

To achieve a more sophisticated RAG solution, additional components can be added:

  • Query Processing Step: This can involve removing personally identifiable information (PII) from the query before it is passed to the RAG system [00:08:52].
  • Post-Retrieval Step: This step aims to improve the accuracy of documents retrieved from the vector database [00:09:09].

Cross-encoders vs. Bi-encoders

Understanding cross-encoders and bi-encoders is crucial for improving RAG accuracy and scalability:

  • Cross-encoder:

    • Purpose: Semantically compare a query with a document [00:09:30].
    • Mechanism: Sends both the query and the document to a BERT model (encoder from the original transformer model) [00:09:41], then through a classifier to get a similarity score between 0 and 1 [00:09:48].
    • Characteristics: Excellent for additional accuracy but slow and not scalable, especially with larger documents, as both query and document are processed by a single model [00:10:07].
    • Placement in RAG: Best used post-retrieval (as a re-ranker) because it works with a few documents and needs high accuracy [00:11:33].
  • Bi-encoder:

    • Purpose: Enable fast and scalable information retrieval [00:10:56].
    • Mechanism: Uses two separate encoder models—one for the query and one for the document. Each passes through its own BERT layer, pooling, and embedding layer [00:10:38]. The similarity between the query and document embeddings is then compared (e.g., using cosine similarity) [00:10:48].
    • Characteristics: Fast and scalable because the two models are separated [00:10:56]. Excellent for initial information retrieval [00:10:58].
    • Placement in RAG: Ideal where the vector data is stored, as it needs to compare the query with multiple documents quickly [00:11:21].

Components of the RAG Stack

The RAG stack typically includes several key components, with different choices recommended for prototyping versus production environments.

Development Environments

  • Prototyping: Google Collab is preferred due to free access to hardware accelerators [00:00:58].
  • Production: Docker is often used for on-premise or cloud deployments, particularly for financial institutions requiring data and processing to remain on-premise [00:01:14].

Orchestration

This layer manages the flow and interaction between different components of the RAG system.

Embedding Models

These models convert text into numerical vectors (embeddings) for semantic search.

  • Options: Closed models (using APIs) or open models (like those from Nvidia or BAI) [00:01:31].
  • Prototyping: Closed models like OpenAI’s text-embedding-ada-002 (default in Llama Index) [00:12:15] or text-embedding-3-large [00:12:22]. Open models such as BAI’s BGE small model can also be downloaded and used [00:12:46].
  • Production: Open models (BAI, Nvidia) [00:01:48] [00:16:59].

Vector Database

Stores the document embeddings and enables efficient semantic search.

  • Recommendation: Qdrant is an excellent choice for its scalability, handling anything from a few documents to hundreds of thousands [00:01:58].

Large Language Model (LLM)

Generates the final response based on the retrieved context and query.

  • Prototyping: Closed models are often preferred for their simplicity via APIs [00:02:09], such as OpenAI’s GPT-3.5 Turbo (default) [00:13:25] or GPT-4 [00:13:39]. Open models like Meta’s models or Qwen 3 [00:02:16] can also be used.
  • Production: Open models like Llama 3.2 or Qwen 3.4 billion models from Alibaba Cloud, served using Olama or Hugging Face Text Generation Inference [00:02:35].

Monitoring and Tracing

Essential for troubleshooting and understanding performance bottlenecks within the RAG solution [00:02:45].

Re-rankers (Post-Retrieval)

Used to improve the accuracy of the retrieved documents by re-ranking them based on semantic similarity to the query. This uses a cross-encoder approach [00:11:43].

RAG Evaluation

Crucial for assessing the quality and performance of the RAG solution across various metrics.

Production Environment Setup (Docker Compose)

For production, a docker-compose.yaml file integrates various Docker images to run the RAG solution as containers [00:17:51]:

  • Ingestion Image: Connects to the knowledge base to pull in data (e.g., HTML files) [00:18:01].
  • Qdrant Image: For the vector database, pulled directly from Docker Hub [00:18:07].
  • Front-end App Image: For the user interface of the solution [00:18:09].
  • LLM Serving Image: Uses Olama or Hugging Face’s Text Generation Inference engine to serve the large language models [00:18:16].
  • Phoenix Image: For tracing and monitoring [00:18:22].
  • Ragas Image: For RAG evaluation [00:18:24].

This setup allows for running each component as a separate container within Docker Compose, ensuring a robust and manageable production environment [00:18:28].