From: aidotengineer

Retrieval Augmented Generation (RAG) is a technique used to improve the output of Large Language Models (LLMs) by giving them access to external knowledge bases [04:06:00]. This approach aims to provide more accurate and contextually relevant responses [04:48:00].

A typical RAG process involves:

  1. Retrieval: A user query, such as “Where can I get help in London?”, is used to perform a semantic search through a vector database to retrieve relevant documents [04:13:00].
  2. Augmentation: The retrieved information is combined with the original query to provide context to the LLM [04:31:00].
  3. Generation: The LLM uses this context and the original query to generate a response [04:45:00].

Naive RAG Solution Limitations

A basic RAG solution involves embedding the query, comparing it to documents in a vector database, retrieving relevant documents, and passing them to an LLM for response generation [04:59:00]. However, such a “naive” solution may not always provide satisfactory results [08:27:00]. To improve accuracy, additional steps like query processing and post-retrieval processing are needed [08:42:00].

Understanding Cross-Encoders

A cross-encoder is a model designed to semantically compare a query with a document [09:25:00].

Mechanism

The process involves:

  • Sending both the query and the document to a BERT model (an encoder from the original transformer model) [09:36:00].
  • Passing the output to a classifier [09:45:00].
  • Receiving a score between 0 and 1, indicating the semantic similarity between the query and the document [09:46:00].

Advantages and Disadvantages

  • Advantage: Excellent for achieving additional accuracy [10:04:00].
  • Disadvantage: As document size increases, the solution becomes less scalable [09:53:00]. It is slow and not scalable for many documents due to the need to pass both the query and full document through a single model [10:07:00].

Placement in RAG Pipeline

Given its high accuracy but low scalability, a cross-encoder is best used post-retrieval [11:33:00]. It functions effectively as a “reranker” to refine the top few documents retrieved by an initial, more scalable method [11:41:00]. For example, Cohere’s closed model can be used for reranking, or an open solution from Nvidia in production environments [15:22:00], [17:24:00].

Understanding Bi-Encoders

Bi-encoders address the scalability issues of cross-encoders by splitting the encoding process [10:14:00].

Mechanism

  • Instead of one model, two separate encoders are used [10:24:00].
  • One encoder (BERT layer, pooling, embedding layer) processes the query [10:27:00].
  • Another separate encoder (BERT layer, pooling, embedding layer) processes the document [10:33:00].
  • The similarity between the query and document embeddings is then compared using metrics like cosine similarity [10:42:00].

Advantages

  • Fast and Scalable: By separating the models, this approach allows for faster and more scalable information retrieval [10:54:00]. This is because document embeddings can be pre-computed and stored, enabling quick similarity searches.

Placement in RAG Pipeline

A bi-encoder is ideal for the initial retrieval step, particularly where the vector data is stored [11:19:00]. It efficiently compares the query against multiple documents to retrieve a set of relevant candidates [11:16:00]. An example of an open bi-encoder solution is the BGE small model from B AI [12:42:00].

Integrating Encoders into RAG

A robust RAG solution typically combines both bi-encoders and cross-encoders:

  1. Bi-encoder for Initial Retrieval: The bi-encoder, often part of the vector database (e.g., Qdrant), quickly retrieves a larger set of potentially relevant documents based on semantic similarity [11:19:00], [11:51:00]. This is a fast and scalable step.
  2. Cross-encoder for Re-ranking: The smaller set of retrieved documents is then passed through a cross-encoder (reranker) for a more fine-grained similarity assessment [11:33:00]. This improves the accuracy of the final selection of documents provided to the LLM [09:08:00].

Practical Implementation Notes

  • Prototyping: Google Collab is useful for prototyping RAG solutions due to free hardware accelerators [00:57:00].
  • Orchestration: LlamaIndex and LangGraph are options for the orchestration layer [01:25:00]. LlamaIndex can quickly set up a basic RAG solution [07:57:00].
  • Embedding Models: Both closed (e.g., OpenAI’s text embedding ADA/three large) and open (e.g., Nvidia, B AI) embedding models can be used [01:31:00], [12:15:00].
  • Vector Database: Qdrant is highlighted for its excellent scalability from a few documents to hundreds of thousands [01:51:00], [17:01:00].
  • Language Models: Closed models like OpenAI’s GPT-3.5 Turbo or GPT-4 offer simplicity via APIs [02:04:00], [13:22:00]. Open models include those from Meta or Qwen [02:12:00]. For production with Docker, Olama or Hugging Face’s text generation inference can serve models like Llama 3.2 or Qwen 3.4 billion [02:25:00].

Monitoring, Tracing, and Evaluation

  • Monitoring and Tracing: Tools like Langsmith or Phoenix Arise are crucial for troubleshooting and understanding component performance in a RAG solution [02:41:00], [16:16:00].
  • Evaluation: Frameworks like Ragas are essential for testing the quality of RAG solutions across a larger set of questions, working with LLMs to make the evaluation process smoother [03:24:00], [16:45:00], [17:34:00].

For production, Docker Compose can be used to set up images for data ingestion, vector databases (Qdrant), front-end apps, LLM serving (Olama/Hugging Face TGI), tracing (Phoenix), and evaluation (Ragas) [17:46:00].