From: aidotengineer
Retrieval Augmented Generation (RAG) is a technique for enhancing the capabilities of language models by enabling them to access and integrate information from external knowledge bases [00:04:06]. This approach aims to provide more accurate and contextually relevant responses [00:04:48].
Jonathan Fernandez, an independent AI engineer, has experience working with language models prior to ChatGPT’s emergence, helping companies build and ship production-ready generative AI solutions [00:00:04]. His insights into RAG solutions are based on lessons learned from “37 fails” [00:00:35].
How RAG Works [00:04:06]
A typical RAG process involves three main steps:
- Retrieval: When a user submits a query (e.g., “Where can I get help in London?”), the system performs a semantic search through a vector database to retrieve relevant documents [00:04:13].
- Augmentation: The original user query is combined with the retrieved information from the vector database [00:04:31]. This combined input is then provided as context to the large language model (LLM) [00:04:40].
- Generation: With the provided context and the original query, the LLM generates a coherent and informed response [00:04:43].
Naive RAG Solution [00:04:56]
A basic RAG implementation follows this flow:
- The query is embedded [00:05:02].
- This embedded query is compared to documents in the vector database [00:05:04].
- Relevant documents are retrieved [00:05:10].
- The retrieved documents and the query are passed to the language model as context [00:05:13].
- The language model generates a response [00:05:18].
In a prototype using Llama Index, a basic RAG solution can be set up with just a few lines of code to retrieve HTML files, store them in an in-memory vector database, and then query it [00:07:57]. However, a naive solution may not yield satisfactory results [00:08:27].
Improving RAG Solutions [0:08:42]
To achieve a more sophisticated RAG solution, additional components are often added:
- Query Processing: This step involves preparing the user query, for example, by removing personally identifiable information (PII) before passing it to the RAG system [00:08:47].
- Post-Retrieval Processing: This step aims to improve the accuracy of the documents retrieved from the vector database, typically through a re-ranking mechanism [00:09:06].
Cross-Encoders vs. Bi-Encoders [00:09:21]
Understanding how encoders work is crucial for improving retrieval accuracy:
- Cross-Encoder: A cross-encoder takes both a query and a document, passes them to a BERT model (the encoder from the original transformer model), and then to a classifier [00:09:28]. It outputs a score between 0 and 1, indicating semantic similarity [00:09:45].
- Pros: Excellent for additional accuracy [00:10:04].
- Cons: Slow and not scalable, especially with large documents, because it processes the query and document together [00:09:53]. Best used post-retrieval for re-ranking a small set of documents [00:11:30].
- Bi-Encoder: A bi-encoder uses two separate encoder models—one for the query and one for the document [00:10:25]. Each encoder processes its input independently to produce an embedding, and then their similarity is compared using metrics like cosine similarity [00:10:42].
- Pros: Fast and scalable, as embeddings can be pre-computed [00:10:54]. Excellent for information retrieval [00:10:58].
- Cons: Less accurate than cross-encoders for fine-grained similarity.
- Application: Ideal for the vector database’s retrieval step, where many documents need to be compared against a query [00:11:16].
RAG Stack Components and Choices [00:00:40]
The components of a RAG stack often vary between prototyping and production environments [00:00:51]. Prototyping is typically done in Google Colab for its free hardware accelerators [00:00:53], while production environments, especially for financial institutions, often require on-premise solutions like Docker [00:01:04].
| Component | Prototyping Choice | Production Choice | Notes RAG Stack we landed on after 37 fails. Hi, I’m Jonathan Fernandez and I’ve been working with language models way before chat GBT appeared on the scene. So I work as an independent AI engineer and I help companies build and ship production ready generative AI solutions. Now if you are new to rag I’ll give you a bit of an introduction but this is my objective is for this to be the most ROI rag guide per minute on the internet.
Orchestration [00:01:23]
- Prototyping: Llama Index or LangGraph [00:01:25].
- Production: Llama Index [00:01:29].
Embedding Models [00:01:31]
- Prototyping: Closed models (via API) or open models like Nvidia or BAI [00:01:34]. The default in Llama Index is
text-embedding-ada-002
from OpenAI, but can be swapped fortext-embedding-3-large
or open models like BGE small from BAI [00:12:13]. - Production: Open models (e.g., BAI or Nvidia) [00:01:46] [00:16:56].
Vector Database [00:01:51]
- Choice: Qdrant [00:01:52].
- Reason: Scales exceptionally well from a few documents to hundreds of thousands [00:01:53] [00:17:01].
Large Language Model (LLM) [00:02:04]
- Prototyping: Closed models (e.g., OpenAI’s GPT-3.5 Turbo or GPT-4) for simplicity of APIs [00:02:07] [00:13:22]. Open models like Meta’s Llama 3 or Qwen 3 are also options [00:02:12].
- Production: Open models (e.g., Llama 3.2 or Qwen 3.4 billion from Alibaba Cloud) served via Olama or Hugging Face Text Generation Inference [00:02:21].
Monitoring and Tracing [00:02:41]
Essential for troubleshooting and identifying time-consuming components [00:02:44].
- Prototyping: LangSmith or Llama Phoenix [00:02:56].
- Production: Arise Phoenix (Docker container) [00:03:00] [00:17:18].
Re-ranking (Post-Retrieval Improvement) [00:03:07]
Improves the accuracy of the RAG solution [00:03:08]. This component typically uses a cross-encoder [00:11:41].
- Prototyping: Closed model like Cohere’s re-ranker [00:03:11] [00:17:24].
- Production: Open solution from Nvidia [00:03:19] [00:17:26].
Evaluation [00:03:24]
Crucial for assessing the quality of your RAG solution [00:03:26].
- Framework: Ragas [00:03:29].
- Features: Allows testing across many documents and checking the quality using the LLM itself, making the task painless [00:16:48].
Knowledge Base Example [00:03:35]
A single knowledge base was used for demonstration, consisting of HTML files for a train/railway company operating in London [00:03:52]. An example query is “Where can I get help in London?” [00:03:48]. The knowledge base includes articles like “Are there any wheelchair friendly taxis in London?” and “Are there any wheelchair friendly taxis in Paris?” [00:05:42].
Prototyping in Google Colab [00:05:22]
The process of building and refining the RAG solution was demonstrated in Google Colab [00:05:25].
- Files are copied to a data folder in Colab [00:06:19].
- Llama Index’s
SimpleDirectoryReader
is used to read HTML files, which are treated as documents with IDs and metadata (e.g., file path) [00:06:44]. - Initially, a naive RAG solution using an in-memory vector database and OpenAI’s
gpt-3.5-turbo
yields an unhelpful response [00:08:00]. - By swapping the in-memory vector database for an in-memory Qdrant solution and changing the embedding model to BAI’s BGE small (an open solution downloaded locally), the response improves slightly but is still not ideal [00:11:53]. The LLM is also changed to GPT-4 [00:13:35].
- The system’s
source_nodes
feature in Llama Index allows inspecting which files were used to generate the response, helping in troubleshooting [00:14:42]. - Finally, adding a re-ranker (cross-encoder) using Cohere’s closed model to re-rank the top two results from the vector database significantly improves the answer to a more accurate and specific one [00:15:14].
Production Environment with Docker [00:01:04]
For production, Docker is often used for on-premise or cloud deployments [00:01:14]. A docker-compose.yaml
file defines the services:
- An image for data ingestion connected to the knowledge base [00:17:57].
- A Qdrant image for the vector database [00:18:03].
- A front-end application [00:18:09].
- Olama or Hugging Face’s Text Generation Inference engine for serving LLMs [00:18:12].
- Phoenix for tracing [00:18:19].
- Ragas for model evaluation [00:18:22].
Each service runs as a container within Docker Compose, with configurations for embeddings, re-ranking, and LLMs [00:18:26].