From: aidotengineer
Retrieval Augmented Generation (RAG) is a technique used to improve the output of Large Language Models (LLMs) by giving them access to external knowledge bases [04:06:00]. This approach aims to provide more accurate and contextually relevant responses [04:48:00].
A typical RAG process involves:
- Retrieval: A user query, such as “Where can I get help in London?”, is used to perform a semantic search through a vector database to retrieve relevant documents [04:13:00].
- Augmentation: The retrieved information is combined with the original query to provide context to the LLM [04:31:00].
- Generation: The LLM uses this context and the original query to generate a response [04:45:00].
Naive RAG Solution Limitations
A basic RAG solution involves embedding the query, comparing it to documents in a vector database, retrieving relevant documents, and passing them to an LLM for response generation [04:59:00]. However, such a “naive” solution may not always provide satisfactory results [08:27:00]. To improve accuracy, additional steps like query processing and post-retrieval processing are needed [08:42:00].
Understanding Cross-Encoders
A cross-encoder is a model designed to semantically compare a query with a document [09:25:00].
Mechanism
The process involves:
- Sending both the query and the document to a BERT model (an encoder from the original transformer model) [09:36:00].
- Passing the output to a classifier [09:45:00].
- Receiving a score between 0 and 1, indicating the semantic similarity between the query and the document [09:46:00].
Advantages and Disadvantages
- Advantage: Excellent for achieving additional accuracy [10:04:00].
- Disadvantage: As document size increases, the solution becomes less scalable [09:53:00]. It is slow and not scalable for many documents due to the need to pass both the query and full document through a single model [10:07:00].
Placement in RAG Pipeline
Given its high accuracy but low scalability, a cross-encoder is best used post-retrieval [11:33:00]. It functions effectively as a “reranker” to refine the top few documents retrieved by an initial, more scalable method [11:41:00]. For example, Cohere’s closed model can be used for reranking, or an open solution from Nvidia in production environments [15:22:00], [17:24:00].
Understanding Bi-Encoders
Bi-encoders address the scalability issues of cross-encoders by splitting the encoding process [10:14:00].
Mechanism
- Instead of one model, two separate encoders are used [10:24:00].
- One encoder (BERT layer, pooling, embedding layer) processes the query [10:27:00].
- Another separate encoder (BERT layer, pooling, embedding layer) processes the document [10:33:00].
- The similarity between the query and document embeddings is then compared using metrics like cosine similarity [10:42:00].
Advantages
- Fast and Scalable: By separating the models, this approach allows for faster and more scalable information retrieval [10:54:00]. This is because document embeddings can be pre-computed and stored, enabling quick similarity searches.
Placement in RAG Pipeline
A bi-encoder is ideal for the initial retrieval step, particularly where the vector data is stored [11:19:00]. It efficiently compares the query against multiple documents to retrieve a set of relevant candidates [11:16:00]. An example of an open bi-encoder solution is the BGE small model from B AI [12:42:00].
Integrating Encoders into RAG
A robust RAG solution typically combines both bi-encoders and cross-encoders:
- Bi-encoder for Initial Retrieval: The bi-encoder, often part of the vector database (e.g., Qdrant), quickly retrieves a larger set of potentially relevant documents based on semantic similarity [11:19:00], [11:51:00]. This is a fast and scalable step.
- Cross-encoder for Re-ranking: The smaller set of retrieved documents is then passed through a cross-encoder (reranker) for a more fine-grained similarity assessment [11:33:00]. This improves the accuracy of the final selection of documents provided to the LLM [09:08:00].
Practical Implementation Notes
- Prototyping: Google Collab is useful for prototyping RAG solutions due to free hardware accelerators [00:57:00].
- Orchestration: LlamaIndex and LangGraph are options for the orchestration layer [01:25:00]. LlamaIndex can quickly set up a basic RAG solution [07:57:00].
- Embedding Models: Both closed (e.g., OpenAI’s text embedding ADA/three large) and open (e.g., Nvidia, B AI) embedding models can be used [01:31:00], [12:15:00].
- Vector Database: Qdrant is highlighted for its excellent scalability from a few documents to hundreds of thousands [01:51:00], [17:01:00].
- Language Models: Closed models like OpenAI’s GPT-3.5 Turbo or GPT-4 offer simplicity via APIs [02:04:00], [13:22:00]. Open models include those from Meta or Qwen [02:12:00]. For production with Docker, Olama or Hugging Face’s text generation inference can serve models like Llama 3.2 or Qwen 3.4 billion [02:25:00].
Monitoring, Tracing, and Evaluation
- Monitoring and Tracing: Tools like Langsmith or Phoenix Arise are crucial for troubleshooting and understanding component performance in a RAG solution [02:41:00], [16:16:00].
- Evaluation: Frameworks like Ragas are essential for testing the quality of RAG solutions across a larger set of questions, working with LLMs to make the evaluation process smoother [03:24:00], [16:45:00], [17:34:00].
For production, Docker Compose can be used to set up images for data ingestion, vector databases (Qdrant), front-end apps, LLM serving (Olama/Hugging Face TGI), tracing (Phoenix), and evaluation (Ragas) [17:46:00].