From: aidotengineer

Jonathan Fernandez, an independent AI engineer, specializes in helping companies build and ship production-ready generative AI solutions [00:00:11]. His approach to developing AI systems, particularly Retrieval Augmented Generation (RAG) stacks, involves a distinct separation between prototyping and production environments [00:00:51].

Development Workflow

The typical product development process involves two main phases:

  • Prototyping: Often conducted in Google Colab due to access to free hardware accelerators, making it easy to prototype [00:00:53].
  • Production: Frequently uses Docker for deployment, allowing solutions to run either on-premise (common for financial institutions with data residency requirements) or in the cloud [00:01:04].

RAG Stack Components and Tooling

Based on learnings from 37 prior attempts, a robust RAG stack utilizes specific tools tailored for each phase [00:00:35].

ComponentPrototyping ToolsProduction ToolsNotes
Orchestration LayerLlamaIndex, LangChain / LangGraphLlamaIndexLlamaIndex is noted for its ability to create a basic RAG solution in a few lines of code [00:07:57].
Embedding ModelsClosed (APIs e.g., OpenAI)Open (e.g., NVIDIA, BAAI)Closed models simplify usage via APIs [00:01:34]. Open models like BGE small model from Hugging Face can be downloaded and used [00:12:30].
Vector DatabaseQdrant (in-memory for prototype)QdrantQdrant is an excellent choice for its scalability, handling anything from a few to hundreds of thousands of documents [00:01:51].
Large Language ModelClosed (APIs e.g., OpenAI)Open (e.g., Meta Llama, Alibaba Cloud Qwen) served by Olama or Hugging Face Text Generation InferenceClosed models offer simplicity via APIs [00:02:07]. In production, open models are served using tools like Olama or Hugging Face’s text generation inference engine, often within Docker [00:02:23].
Monitoring and TracingLangSmith, Arise PhoenixArise Phoenix (Docker container)Essential for troubleshooting, understanding time consumption of components, and tracking the RAG pipeline [00:02:41]. Arise Phoenix has a good Docker solution [00:17:18].
Re-ranking / AccuracyClosed (e.g., Cohere)Open (e.g., NVIDIA)Re-ranking improves the accuracy of the RAG solution [00:03:07]. Cross-encoders, also known as rerankers, are placed post-retrieval for additional accuracy, though they are slower and less scalable than bi-encoders [00:11:30].
Evaluation FrameworkRagasRagasCrucial for evaluating the quality of the RAG solution across various metrics and works well with Large Language Models, making the task relatively painless [00:03:24], [00:17:34].

Understanding RAG: Retrieval Augmented Generation

RAG is a process to enhance the generation capabilities of Large Language Models (LLMs) by providing them with relevant context from an external knowledge base [00:04:08].

The process involves:

  1. Retrieval: A user query (e.g., “Where can I get help in London?”) is used to perform a semantic search through a vector database to retrieve relevant documents [00:04:18].
  2. Augmentation: The original query is combined with the retrieved information from the vector database, providing crucial context to the LLM [00:04:31].
  3. Generation: With the additional context, the LLM can produce a more accurate and relevant response to the original question [00:04:45].

A naive RAG solution embeds the query, compares it to documents in a vector database, retrieves relevant documents, and passes them with the query to the LLM for response generation [00:04:58].

Improving RAG Accuracy

To enhance a RAG solution, additional steps can be incorporated:

  • Query Processing: Steps to refine the query, such as removing Personally Identifiable Information (PII) [00:08:49].
  • Post-Retrieval Processing: Improving the accuracy of retrieved documents before passing them to the LLM [00:09:06]. This often involves re-rankers.

Cross-Encoders vs. Bi-Encoders

  • Cross-Encoder: Semantically compares a query with a document by sending both to a BERT model and then a classifier, yielding a similarity score (0-1) [00:09:25]. While excellent for additional accuracy, it is slow and not scalable for large documents [00:10:04]. It is best used as a re-ranker post-retrieval with a limited number of documents [00:11:25].
  • Bi-Encoder: Uses two separate encoder models (e.g., two BERT layers) for the query and the document independently, and then compares their embeddings using cosine similarity [00:10:14]. This solution is fast and scalable, making it excellent for information retrieval in the vector database step [00:10:51].

Production Environment Setup with Docker

A typical production setup for an AI solution, particularly a RAG pipeline, uses docker-compose.yaml to orchestrate multiple Docker images as containers [00:17:46].

Key Docker images for a RAG solution include:

  • Data Ingestion: An image for ingesting data from the knowledge base (e.g., HTML files) [00:17:57].
  • Vector Database: An image for Qdrant, pulled from Docker Hub [00:18:03].
  • Front-end Application: For user interaction [00:18:09].
  • LLM Serving: Olama or Hugging Face’s Text Generation Inference engine to serve large language models [00:18:12].
  • Tracing/Monitoring: Phoenix (Arise Phoenix) for tracking and troubleshooting [00:18:19].
  • Evaluation: Ragas for continuously evaluating model quality [00:18:22].

This structured approach allows for robust scaling AI solutions in production and effective design of AI in products.