From: aidotengineer

This article provides insights into the evaluation and improvement of RAG (Retrieval Augmented Generation) solutions, drawing from lessons learned over many iterations [00:00:00]. The focus is on practical components used in a RAG stack for both prototyping and production environments [00:00:40].

Components of a RAG Solution

A typical RAG stack includes:

Solutions often differ between prototyping and production. Prototyping is typically done in Google Colab for free hardware accelerators [00:00:53]. Production environments, especially for financial institutions, often require on-premise data processing, making Docker a common choice for deployment [00:01:04].

Orchestration

Embedding Models

  • Prototyping: Closed models (via APIs) or open models (e.g., Nvidia, BAAI) [00:01:31]. The BAAI BGE small model is an example of an open embedding model [00:12:42].
  • Production: Open models (e.g., BAAI, Nvidia) [00:01:46].

Vector Database

  • Qdrant is recommended due to its scalability from a few documents to hundreds of thousands [00:01:51].

Large Language Models (LLMs)

  • Prototyping: Closed models (via APIs for simplicity) or open models (e.g., Meta, Qwen 3) [00:02:04].
  • Production: Open models, served using Olama or Hugging Face Text Generation Inference within a Docker environment (e.g., Llama 3.2, Qwen 3.4 billion from Alibaba Cloud) [00:02:21].

Improving RAG Solution Accuracy

A naive RAG solution involves embedding a query, retrieving relevant documents from a vector database, and passing them to an LLM for response generation [00:04:59]. To enhance accuracy, two key steps can be added:

  1. Query Processing: Removing or refining information in the initial query (e.g., personal identifiable information) before it enters the RAG system [00:08:49].
  2. Post-Retrieval Processing (Re-ranking): Improving the accuracy of documents retrieved from the vector database before they are sent to the LLM [00:09:06].

Cross-Encoders vs. Bi-Encoders

Understanding the difference between cross-encoders and bi-encoders is crucial for effective information retrieval and re-ranking:

  • Cross-Encoder:

    • Semantically compares a query with a document by sending both to a BERT model (encoder from the original transformer model) [00:09:25].
    • A classifier then provides a similarity score between 0 and 1 [00:09:45].
    • Pros: Excellent for additional accuracy [00:10:04].
    • Cons: Slow and not scalable, especially with large documents, as the query and document are processed together [00:09:53].
    • Placement in RAG: Best used post-retrieval (as a re-ranker) because it works with a smaller number of documents retrieved from the vector database [00:11:25].
  • Bi-Encoder:

    • Uses two separate encoders (e.g., BERT models) for the query and the document [00:10:25]. Each encoder produces an embedding.
    • Similarity is then compared using metrics like cosine similarity [00:10:42].
    • Pros: Fast and scalable because query and document embeddings are generated independently, allowing for efficient comparison [00:10:51]. Excellent for information retrieval [00:10:58].
    • Placement in RAG: Ideal for the vector database search where the query is compared against many documents [00:11:19].

Re-ranking with Cross-Encoders

Re-rankers improve the accuracy of a RAG solution [00:03:07].

  • Prototyping: Closed models like Cohere’s re-ranker [00:03:11].
  • Production: Open solutions like Nvidia’s re-ranker [00:03:17].

Evaluation of RAG Solutions

Evaluating how well a RAG solution performs is critical [00:03:24].

Monitoring and Tracing

Monitoring and tracing help troubleshoot issues and identify performance bottlenecks within the RAG pipeline [00:02:41].

  • Prototyping: LangChain Phoenix or Langsmith [00:02:56].
  • Production: Arise Phoenix, which can be easily used in a Docker container [00:03:00].

RAG Evaluation Frameworks

While a single query can be tested manually, a robust RAG solution requires testing against a larger set of documents and queries [00:16:40].

  • Ragas: A recommended framework for RAG evaluation, which allows checking the quality of solutions in different ways and works well with LLMs, making the task relatively painless [00:03:27], [00:16:53].

Production Environment Setup

A production RAG environment often uses a docker-compose.yaml file to manage various services as Docker containers [00:17:46]. Key images include: