From: aidotengineer

Embedding models and vector databases are crucial components in a Retrieval Augmented Generation (RAG) stack [00:00:46]. They enable the system to perform semantic searches and retrieve relevant information effectively.

Embedding Models

Embedding models transform text (or other data) into numerical vectors (embeddings) that capture their semantic meaning. This allows for efficient comparison and retrieval of semantically similar items.

Types of Embedding Models

  • Closed Models: These models are often used via APIs, simplifying their integration and use [00:01:34]. An example mentioned is OpenAI’s text-embedding-ada-002 or text-embedding-3-large [00:12:15].
  • Open Models: These models can be downloaded and run locally, providing more control and often being preferred for production environments [00:01:40]. Examples include models from Nvidia or BAAI [00:01:42]. The BGE small model from Hugging Face is specifically mentioned as an open solution [00:12:42].

How Embedding Models Work

Two main types of encoders are discussed in the context of semantic comparison:

  • Cross-Encoder:
    • Objective: To semantically compare a query with a document [00:09:30].
    • Process: Both the query and the document are sent to a BERT model (the encoder from the original transformer model), then passed to a classifier [00:09:36].
    • Output: A result between 0 and 1, indicating semantic similarity [00:09:46].
    • Characteristics: Excellent for additional accuracy but can be slow and not scalable, especially with larger documents [00:10:04]. It’s best used post-retrieval (as a re-ranker) because it only works with a few documents [00:11:30].
  • Bi-Encoder:
    • Objective: To provide a fast and scalable solution for information retrieval [00:10:54].
    • Process: Uses two separate encoders (BERT layers) for the query and the document, each followed by pooling and embedding layers [00:10:25]. The similarity between the query and document embeddings is then compared using metrics like cosine similarity [00:10:42].
    • Characteristics: Fast and scalable, making it excellent for information retrieval [00:10:54]. It’s ideally placed where the vector data is, comparing the query with multiple documents [00:11:19].

Vector Databases

Vector databases store embeddings and allow for efficient similarity search, which is a core part of the RAG retrieval step [00:04:23].

Role in RAG

The first and most important step in RAG is the retrieval step, where a semantic search is performed through a vector database to retrieve relevant documents based on the user’s query [00:04:18]. The retrieved information is then combined with the original query as context for the language model [00:04:31].

Specific Vector Database Solutions

  • Qdrant: An excellent solution for vector databases because it scales very well, handling anything from a few documents to hundreds of thousands [00:01:51]. It can be used as an in-memory solution for prototyping [00:11:57]. In a production environment, Qdrant can be deployed via Docker [00:18:03].
  • In-memory vector database: Used for simple prototyping setups in environments like Google Colab [00:08:15].

Querying a Vector Database

In a naive RAG solution, a user’s query is embedded and then compared to documents available in the vector database [00:05:02]. Relevant documents are retrieved and passed along with the query to the large language model as context for generating a response [00:05:11]. LlamaIndex can be used to read HTML files into documents and store them in an in-memory vector database for basic RAG functionality [00:07:57].

Example of improved RAG with components:

  1. Read HTML files from a directory using LlamaIndex’s simple directory reader [00:12:05].
  2. Use Qdrant as the vector database [00:11:57].
  3. Set the default embedding model (e.g., an open model like BGE small) [00:12:56].
  4. Optionally, configure the Large Language Model (LLM) (e.g., GPT-4) [00:13:35].
  5. Introduce a re-ranker (cross-encoder) in the post-retrieval step to improve accuracy by re-ranking the top results [00:15:16]. Cohere’s model is mentioned as a closed model for prototyping, while Nvidia’s solution is preferred for production [00:17:22].