From: aidotengineer

This article explores the roles of vector databases and embedding models within AI systems, particularly focusing on Retrieval Augmented Generation (RAG) solutions.

Understanding RAG and Component Breakdown

Retrieval Augmented Generation (RAG) is a process that involves an initial user query, followed by a retrieval step where a semantic search is performed through a vector database to retrieve relevant documents based on the query [04:08:00]. The original query is then combined with the retrieved information to provide context to the language model for the generation piece [04:31:00].

Components of a RAG stack typically include:

For prototyping, Google Collab is often used due to free hardware accelerators [00:57:00]. For production environments, especially with financial institutions requiring on-premise data processing, Docker is a common solution [01:10:00].

Embedding Models

Embedding models are crucial for converting queries and documents into a format that allows for semantic comparison [09:28:00].

Types of Embedding Models

  • Closed Models: These are typically accessed via APIs, making them simple to use [01:34:00]. Examples include OpenAI’s text-embedding-ada-002 (default in Quadrant) [01:15:00] and text-embedding-3-large [01:21:00].
  • Open Models: These can be downloaded and run locally, providing more control [01:40:00]. Examples include models from Nvidia and B AI [01:41:00], specifically the BGE small model from B AI available on Hugging Face [01:42:00].

Cross-Encoder vs. Bi-Encoder

The choice of encoder impacts accuracy and scalability within a RAG pipeline.

  • Cross-Encoder:

    • Semantically compares a query with a document by sending both to a BERT model and then to a classifier [09:30:00].
    • Produces a result between 0 and 1 indicating semantic similarity [09:46:00].
    • Excellent for additional accuracy [10:04:00], but is slow and not scalable, especially with larger documents [10:07:00].
    • Best used post-retrieval as a “reranker” due to its accuracy and inability to scale with many documents [11:29:00].
    • An example of a closed model for reranking is Cohere’s solution [03:11:00] [15:22:00]. Nvidia offers an open solution for reranking [03:17:00].
  • Bi-Encoder:

    • Uses two separate encoders (BERT models), one for the query and one for the document, each with pooling and embedding layers [10:25:00].
    • Compares the query and document embeddings using metrics like cosine similarity [10:47:00].
    • This separation makes it a fast and scalable solution [10:54:00].
    • Excellent for information retrieval, making it suitable for the vector database component where multiple documents are compared [10:57:00].

Vector Databases

A vector database stores vector embeddings of documents, allowing for semantic search and retrieval of relevant documents based on a query’s embedding [04:23:00].

Quadrant

Quadrant is highlighted as an excellent vector database solution because it scales very well, handling a range from a couple of documents to hundreds of thousands [01:52:00] [01:56:00]. It is a preferred choice for both prototyping and production environments [02:00:00] [17:01:00].

Integration in RAG Pipeline

In a naive RAG solution, a query is embedded, compared to documents in a vector database, relevant documents are retrieved, and then passed along with the query to a language model for response generation [04:59:00].

For a more sophisticated RAG solution, a query processing step can be added (e.g., to remove personally identifiable information) [08:49:00]. A post-retrieval step can also be used to improve the accuracy of retrieved documents [09:08:00].

Component Workflow in RAG

  • Orchestration: Tools like LlamaIndex or LangChain (LangGraph) are used for orchestrating the RAG pipeline. LlamaIndex is suggested for both prototyping and production [01:25:00].
  • Embedding Models: An open solution, such as the BGE small model from B AI, can be downloaded and used as the default embedding model in tools like LlamaIndex [01:40:00] [01:56:00].
  • Vector Database: Quadrant is integrated to store and retrieve document embeddings [01:51:00].
  • Reranking (Cross-Encoder): After initial retrieval from the vector database, a reranker (cross-encoder) like Cohere’s model or Nvidia’s open solution can be applied to improve accuracy by re-ranking the top results [03:07:00] [15:18:00].

Monitoring and Evaluation

Monitoring and tracing solutions, such as Langsmith or Phoenix/Arise, are crucial for troubleshooting and understanding performance (e.g., where most time is spent) [02:41:00] [03:29:00]. RAG evaluation frameworks like Ragas are used to test the quality of RAG solutions across many documents [03:24:00] [16:45:00].

Production Environment

A production environment often uses Docker Compose with separate Docker images for:

Embedding models and reranking models are specified within this Docker setup [18:31:00].