From: aidotengineer
This article provides insights into the evaluation and improvement of RAG (Retrieval Augmented Generation) solutions, drawing from lessons learned over many iterations [00:00:00]. The focus is on practical components used in a RAG stack for both prototyping and production environments [00:00:40].
Components of a RAG Solution
A typical RAG stack includes:
- Orchestration [00:00:44]
- Embedding models [00:00:46]
- Vector database [00:00:47]
- Large Language Model (LLM) [00:00:47]
- Monitoring and Tracing tools [00:02:41]
- Re-ranking solutions [00:03:07]
- Evaluation frameworks [00:03:24]
Solutions often differ between prototyping and production. Prototyping is typically done in Google Colab for free hardware accelerators [00:00:53]. Production environments, especially for financial institutions, often require on-premise data processing, making Docker a common choice for deployment [00:01:04].
Orchestration
- Prototyping: Llama Index or LangChain/LangGraph [00:01:25]
- Production: Llama Index [00:01:29]
Embedding Models
- Prototyping: Closed models (via APIs) or open models (e.g., Nvidia, BAAI) [00:01:31]. The BAAI BGE small model is an example of an open embedding model [00:12:42].
- Production: Open models (e.g., BAAI, Nvidia) [00:01:46].
Vector Database
- Qdrant is recommended due to its scalability from a few documents to hundreds of thousands [00:01:51].
Large Language Models (LLMs)
- Prototyping: Closed models (via APIs for simplicity) or open models (e.g., Meta, Qwen 3) [00:02:04].
- Production: Open models, served using Olama or Hugging Face Text Generation Inference within a Docker environment (e.g., Llama 3.2, Qwen 3.4 billion from Alibaba Cloud) [00:02:21].
Improving RAG Solution Accuracy
A naive RAG solution involves embedding a query, retrieving relevant documents from a vector database, and passing them to an LLM for response generation [00:04:59]. To enhance accuracy, two key steps can be added:
- Query Processing: Removing or refining information in the initial query (e.g., personal identifiable information) before it enters the RAG system [00:08:49].
- Post-Retrieval Processing (Re-ranking): Improving the accuracy of documents retrieved from the vector database before they are sent to the LLM [00:09:06].
Cross-Encoders vs. Bi-Encoders
Understanding the difference between cross-encoders and bi-encoders is crucial for effective information retrieval and re-ranking:
-
Cross-Encoder:
- Semantically compares a query with a document by sending both to a BERT model (encoder from the original transformer model) [00:09:25].
- A classifier then provides a similarity score between 0 and 1 [00:09:45].
- Pros: Excellent for additional accuracy [00:10:04].
- Cons: Slow and not scalable, especially with large documents, as the query and document are processed together [00:09:53].
- Placement in RAG: Best used post-retrieval (as a re-ranker) because it works with a smaller number of documents retrieved from the vector database [00:11:25].
-
Bi-Encoder:
- Uses two separate encoders (e.g., BERT models) for the query and the document [00:10:25]. Each encoder produces an embedding.
- Similarity is then compared using metrics like cosine similarity [00:10:42].
- Pros: Fast and scalable because query and document embeddings are generated independently, allowing for efficient comparison [00:10:51]. Excellent for information retrieval [00:10:58].
- Placement in RAG: Ideal for the vector database search where the query is compared against many documents [00:11:19].
Re-ranking with Cross-Encoders
Re-rankers improve the accuracy of a RAG solution [00:03:07].
- Prototyping: Closed models like Cohere’s re-ranker [00:03:11].
- Production: Open solutions like Nvidia’s re-ranker [00:03:17].
Evaluation of RAG Solutions
Evaluating how well a RAG solution performs is critical [00:03:24].
Monitoring and Tracing
Monitoring and tracing help troubleshoot issues and identify performance bottlenecks within the RAG pipeline [00:02:41].
- Prototyping: LangChain Phoenix or Langsmith [00:02:56].
- Production: Arise Phoenix, which can be easily used in a Docker container [00:03:00].
RAG Evaluation Frameworks
While a single query can be tested manually, a robust RAG solution requires testing against a larger set of documents and queries [00:16:40].
- Ragas: A recommended framework for RAG evaluation, which allows checking the quality of solutions in different ways and works well with LLMs, making the task relatively painless [00:03:27], [00:16:53].
Production Environment Setup
A production RAG environment often uses a docker-compose.yaml
file to manage various services as Docker containers [00:17:46]. Key images include:
- Data ingestion [00:17:57]
- Qdrant for the vector database [00:18:03]
- Front-end application [01:18:09]
- Olama or Hugging Face Text Generation Inference for serving models [00:18:12]
- Phoenix for tracing [00:18:19]
- Ragas for evaluation [00:18:22]