retrieval augmented generation RAG

From: aidotengineer

Retrieval Augmented Generation (RAG) is a technique for enhancing the capabilities of language models by enabling them to access and integrate information from external knowledge bases [00:04:06]. This approach aims to provide more accurate and contextually relevant responses [00:04:48].

Jonathan Fernandez, an independent AI engineer, has experience working with language models prior to ChatGPT’s emergence, helping companies build and ship production-ready generative AI solutions [00:00:04]. His insights into RAG solutions are based on lessons learned from “37 fails” [00:00:35].

How RAG Works [00:04:06]

A typical RAG process involves three main steps:

Retrieval: When a user submits a query (e.g., “Where can I get help in London?”), the system performs a semantic search through a vector database to retrieve relevant documents [00:04:13].
Augmentation: The original user query is combined with the retrieved information from the vector database [00:04:31]. This combined input is then provided as context to the large language model (LLM) [00:04:40].
Generation: With the provided context and the original query, the LLM generates a coherent and informed response [00:04:43].

Naive RAG Solution [00:04:56]

A basic RAG implementation follows this flow:

The query is embedded [00:05:02].
This embedded query is compared to documents in the vector database [00:05:04].
Relevant documents are retrieved [00:05:10].
The retrieved documents and the query are passed to the language model as context [00:05:13].
The language model generates a response [00:05:18].

In a prototype using Llama Index, a basic RAG solution can be set up with just a few lines of code to retrieve HTML files, store them in an in-memory vector database, and then query it [00:07:57]. However, a naive solution may not yield satisfactory results [00:08:27].

Improving RAG Solutions [0:08:42]

To achieve a more sophisticated RAG solution, additional components are often added:

Query Processing: This step involves preparing the user query, for example, by removing personally identifiable information (PII) before passing it to the RAG system [00:08:47].
Post-Retrieval Processing: This step aims to improve the accuracy of the documents retrieved from the vector database, typically through a re-ranking mechanism [00:09:06].

Cross-Encoders vs. Bi-Encoders [00:09:21]

Understanding how encoders work is crucial for improving retrieval accuracy:

Cross-Encoder: A cross-encoder takes both a query and a document, passes them to a BERT model (the encoder from the original transformer model), and then to a classifier [00:09:28]. It outputs a score between 0 and 1, indicating semantic similarity [00:09:45].
- Pros: Excellent for additional accuracy [00:10:04].
- Cons: Slow and not scalable, especially with large documents, because it processes the query and document together [00:09:53]. Best used post-retrieval for re-ranking a small set of documents [00:11:30].
Bi-Encoder: A bi-encoder uses two separate encoder models—one for the query and one for the document [00:10:25]. Each encoder processes its input independently to produce an embedding, and then their similarity is compared using metrics like cosine similarity [00:10:42].
- Pros: Fast and scalable, as embeddings can be pre-computed [00:10:54]. Excellent for information retrieval [00:10:58].
- Cons: Less accurate than cross-encoders for fine-grained similarity.
- Application: Ideal for the vector database’s retrieval step, where many documents need to be compared against a query [00:11:16].

RAG Stack Components and Choices [00:00:40]

The components of a RAG stack often vary between prototyping and production environments [00:00:51]. Prototyping is typically done in Google Colab for its free hardware accelerators [00:00:53], while production environments, especially for financial institutions, often require on-premise solutions like Docker [00:01:04].

| Component | Prototyping Choice | Production Choice | Notes RAG Stack we landed on after 37 fails. Hi, I’m Jonathan Fernandez and I’ve been working with language models way before chat GBT appeared on the scene. So I work as an independent AI engineer and I help companies build and ship production ready generative AI solutions. Now if you are new to rag I’ll give you a bit of an introduction but this is my objective is for this to be the most ROI rag guide per minute on the internet.

Orchestration [00:01:23]

Prototyping: Llama Index or LangGraph [00:01:25].
Production: Llama Index [00:01:29].

Embedding Models [00:01:31]

Prototyping: Closed models (via API) or open models like Nvidia or BAI [00:01:34]. The default in Llama Index is text-embedding-ada-002 from OpenAI, but can be swapped for text-embedding-3-large or open models like BGE small from BAI [00:12:13].
Production: Open models (e.g., BAI or Nvidia) [00:01:46] [00:16:56].

Vector Database [00:01:51]

Choice: Qdrant [00:01:52].
Reason: Scales exceptionally well from a few documents to hundreds of thousands [00:01:53] [00:17:01].

Large Language Model (LLM) [00:02:04]

Prototyping: Closed models (e.g., OpenAI’s GPT-3.5 Turbo or GPT-4) for simplicity of APIs [00:02:07] [00:13:22]. Open models like Meta’s Llama 3 or Qwen 3 are also options [00:02:12].
Production: Open models (e.g., Llama 3.2 or Qwen 3.4 billion from Alibaba Cloud) served via Olama or Hugging Face Text Generation Inference [00:02:21].

Monitoring and Tracing [00:02:41]

Essential for troubleshooting and identifying time-consuming components [00:02:44].

Prototyping: LangSmith or Llama Phoenix [00:02:56].
Production: Arise Phoenix (Docker container) [00:03:00] [00:17:18].

Re-ranking (Post-Retrieval Improvement) [00:03:07]

Improves the accuracy of the RAG solution [00:03:08]. This component typically uses a cross-encoder [00:11:41].

Prototyping: Closed model like Cohere’s re-ranker [00:03:11] [00:17:24].
Production: Open solution from Nvidia [00:03:19] [00:17:26].

Evaluation [00:03:24]

Crucial for assessing the quality of your RAG solution [00:03:26].

Framework: Ragas [00:03:29].
Features: Allows testing across many documents and checking the quality using the LLM itself, making the task painless [00:16:48].

Knowledge Base Example [00:03:35]

A single knowledge base was used for demonstration, consisting of HTML files for a train/railway company operating in London [00:03:52]. An example query is “Where can I get help in London?” [00:03:48]. The knowledge base includes articles like “Are there any wheelchair friendly taxis in London?” and “Are there any wheelchair friendly taxis in Paris?” [00:05:42].

Prototyping in Google Colab [00:05:22]

The process of building and refining the RAG solution was demonstrated in Google Colab [00:05:25].

Files are copied to a data folder in Colab [00:06:19].
Llama Index’s SimpleDirectoryReader is used to read HTML files, which are treated as documents with IDs and metadata (e.g., file path) [00:06:44].
Initially, a naive RAG solution using an in-memory vector database and OpenAI’s gpt-3.5-turbo yields an unhelpful response [00:08:00].
By swapping the in-memory vector database for an in-memory Qdrant solution and changing the embedding model to BAI’s BGE small (an open solution downloaded locally), the response improves slightly but is still not ideal [00:11:53]. The LLM is also changed to GPT-4 [00:13:35].
The system’s source_nodes feature in Llama Index allows inspecting which files were used to generate the response, helping in troubleshooting [00:14:42].
Finally, adding a re-ranker (cross-encoder) using Cohere’s closed model to re-rank the top two results from the vector database significantly improves the answer to a more accurate and specific one [00:15:14].

Production Environment with Docker [00:01:04]

For production, Docker is often used for on-premise or cloud deployments [00:01:14]. A docker-compose.yaml file defines the services:

An image for data ingestion connected to the knowledge base [00:17:57].
A Qdrant image for the vector database [00:18:03].
A front-end application [00:18:09].
Olama or Hugging Face’s Text Generation Inference engine for serving LLMs [00:18:12].
Phoenix for tracing [00:18:19].
Ragas for model evaluation [00:18:22].

Each service runs as a container within Docker Compose, with configurations for embeddings, re-ranking, and LLMs [00:18:26].

Tubegraph

Explorer

Table of Contents