evaluation and improvement of RAG solutions

From: aidotengineer

This article provides insights into the evaluation and improvement of RAG (Retrieval Augmented Generation) solutions, drawing from lessons learned over many iterations [00:00:00]. The focus is on practical components used in a RAG stack for both prototyping and production environments [00:00:40].

Components of a RAG Solution

A typical RAG stack includes:

Orchestration [00:00:44]
Embedding models [00:00:46]
Vector database [00:00:47]
Large Language Model (LLM) [00:00:47]
Monitoring and Tracing tools [00:02:41]
Re-ranking solutions [00:03:07]
Evaluation frameworks [00:03:24]

Solutions often differ between prototyping and production. Prototyping is typically done in Google Colab for free hardware accelerators [00:00:53]. Production environments, especially for financial institutions, often require on-premise data processing, making Docker a common choice for deployment [00:01:04].

Orchestration

Prototyping: Llama Index or LangChain/LangGraph [00:01:25]
Production: Llama Index [00:01:29]

Embedding Models

Prototyping: Closed models (via APIs) or open models (e.g., Nvidia, BAAI) [00:01:31]. The BAAI BGE small model is an example of an open embedding model [00:12:42].
Production: Open models (e.g., BAAI, Nvidia) [00:01:46].

Vector Database

Qdrant is recommended due to its scalability from a few documents to hundreds of thousands [00:01:51].

Large Language Models (LLMs)

Prototyping: Closed models (via APIs for simplicity) or open models (e.g., Meta, Qwen 3) [00:02:04].
Production: Open models, served using Olama or Hugging Face Text Generation Inference within a Docker environment (e.g., Llama 3.2, Qwen 3.4 billion from Alibaba Cloud) [00:02:21].

Improving RAG Solution Accuracy

A naive RAG solution involves embedding a query, retrieving relevant documents from a vector database, and passing them to an LLM for response generation [00:04:59]. To enhance accuracy, two key steps can be added:

Query Processing: Removing or refining information in the initial query (e.g., personal identifiable information) before it enters the RAG system [00:08:49].
Post-Retrieval Processing (Re-ranking): Improving the accuracy of documents retrieved from the vector database before they are sent to the LLM [00:09:06].

Cross-Encoders vs. Bi-Encoders

Understanding the difference between cross-encoders and bi-encoders is crucial for effective information retrieval and re-ranking:

Cross-Encoder:
- Semantically compares a query with a document by sending both to a BERT model (encoder from the original transformer model) [00:09:25].
- A classifier then provides a similarity score between 0 and 1 [00:09:45].
- Pros: Excellent for additional accuracy [00:10:04].
- Cons: Slow and not scalable, especially with large documents, as the query and document are processed together [00:09:53].
- Placement in RAG: Best used post-retrieval (as a re-ranker) because it works with a smaller number of documents retrieved from the vector database [00:11:25].
Bi-Encoder:
- Uses two separate encoders (e.g., BERT models) for the query and the document [00:10:25]. Each encoder produces an embedding.
- Similarity is then compared using metrics like cosine similarity [00:10:42].
- Pros: Fast and scalable because query and document embeddings are generated independently, allowing for efficient comparison [00:10:51]. Excellent for information retrieval [00:10:58].
- Placement in RAG: Ideal for the vector database search where the query is compared against many documents [00:11:19].

Re-ranking with Cross-Encoders

Re-rankers improve the accuracy of a RAG solution [00:03:07].

Prototyping: Closed models like Cohere’s re-ranker [00:03:11].
Production: Open solutions like Nvidia’s re-ranker [00:03:17].

Evaluation of RAG Solutions

Evaluating how well a RAG solution performs is critical [00:03:24].

Monitoring and Tracing

Monitoring and tracing help troubleshoot issues and identify performance bottlenecks within the RAG pipeline [00:02:41].

Prototyping: LangChain Phoenix or Langsmith [00:02:56].
Production: Arise Phoenix, which can be easily used in a Docker container [00:03:00].

RAG Evaluation Frameworks

While a single query can be tested manually, a robust RAG solution requires testing against a larger set of documents and queries [00:16:40].

Ragas: A recommended framework for RAG evaluation, which allows checking the quality of solutions in different ways and works well with LLMs, making the task relatively painless [00:03:27], [00:16:53].

Production Environment Setup

A production RAG environment often uses a docker-compose.yaml file to manage various services as Docker containers [00:17:46]. Key images include:

Data ingestion [00:17:57]
Qdrant for the vector database [00:18:03]
Front-end application [01:18:09]
Olama or Hugging Face Text Generation Inference for serving models [00:18:12]
Phoenix for tracing [00:18:19]
Ragas for evaluation [00:18:22]

Tubegraph

Explorer

Table of Contents