From: aidotengineer
Jonathan Fernandez, an independent AI engineer, specializes in helping companies build and ship production-ready generative AI solutions [00:00:11]. His approach to developing AI systems, particularly Retrieval Augmented Generation (RAG) stacks, involves a distinct separation between prototyping and production environments [00:00:51].
Development Workflow
The typical product development process involves two main phases:
- Prototyping: Often conducted in Google Colab due to access to free hardware accelerators, making it easy to prototype [00:00:53].
- Production: Frequently uses Docker for deployment, allowing solutions to run either on-premise (common for financial institutions with data residency requirements) or in the cloud [00:01:04].
RAG Stack Components and Tooling
Based on learnings from 37 prior attempts, a robust RAG stack utilizes specific tools tailored for each phase [00:00:35].
Component | Prototyping Tools | Production Tools | Notes |
---|---|---|---|
Orchestration Layer | LlamaIndex, LangChain / LangGraph | LlamaIndex | LlamaIndex is noted for its ability to create a basic RAG solution in a few lines of code [00:07:57]. |
Embedding Models | Closed (APIs e.g., OpenAI) | Open (e.g., NVIDIA, BAAI) | Closed models simplify usage via APIs [00:01:34]. Open models like BGE small model from Hugging Face can be downloaded and used [00:12:30]. |
Vector Database | Qdrant (in-memory for prototype) | Qdrant | Qdrant is an excellent choice for its scalability, handling anything from a few to hundreds of thousands of documents [00:01:51]. |
Large Language Model | Closed (APIs e.g., OpenAI) | Open (e.g., Meta Llama, Alibaba Cloud Qwen) served by Olama or Hugging Face Text Generation Inference | Closed models offer simplicity via APIs [00:02:07]. In production, open models are served using tools like Olama or Hugging Face’s text generation inference engine, often within Docker [00:02:23]. |
Monitoring and Tracing | LangSmith, Arise Phoenix | Arise Phoenix (Docker container) | Essential for troubleshooting, understanding time consumption of components, and tracking the RAG pipeline [00:02:41]. Arise Phoenix has a good Docker solution [00:17:18]. |
Re-ranking / Accuracy | Closed (e.g., Cohere) | Open (e.g., NVIDIA) | Re-ranking improves the accuracy of the RAG solution [00:03:07]. Cross-encoders, also known as rerankers, are placed post-retrieval for additional accuracy, though they are slower and less scalable than bi-encoders [00:11:30]. |
Evaluation Framework | Ragas | Ragas | Crucial for evaluating the quality of the RAG solution across various metrics and works well with Large Language Models, making the task relatively painless [00:03:24], [00:17:34]. |
Understanding RAG: Retrieval Augmented Generation
RAG is a process to enhance the generation capabilities of Large Language Models (LLMs) by providing them with relevant context from an external knowledge base [00:04:08].
The process involves:
- Retrieval: A user query (e.g., “Where can I get help in London?”) is used to perform a semantic search through a vector database to retrieve relevant documents [00:04:18].
- Augmentation: The original query is combined with the retrieved information from the vector database, providing crucial context to the LLM [00:04:31].
- Generation: With the additional context, the LLM can produce a more accurate and relevant response to the original question [00:04:45].
A naive RAG solution embeds the query, compares it to documents in a vector database, retrieves relevant documents, and passes them with the query to the LLM for response generation [00:04:58].
Improving RAG Accuracy
To enhance a RAG solution, additional steps can be incorporated:
- Query Processing: Steps to refine the query, such as removing Personally Identifiable Information (PII) [00:08:49].
- Post-Retrieval Processing: Improving the accuracy of retrieved documents before passing them to the LLM [00:09:06]. This often involves re-rankers.
Cross-Encoders vs. Bi-Encoders
- Cross-Encoder: Semantically compares a query with a document by sending both to a BERT model and then a classifier, yielding a similarity score (0-1) [00:09:25]. While excellent for additional accuracy, it is slow and not scalable for large documents [00:10:04]. It is best used as a re-ranker post-retrieval with a limited number of documents [00:11:25].
- Bi-Encoder: Uses two separate encoder models (e.g., two BERT layers) for the query and the document independently, and then compares their embeddings using cosine similarity [00:10:14]. This solution is fast and scalable, making it excellent for information retrieval in the vector database step [00:10:51].
Production Environment Setup with Docker
A typical production setup for an AI solution, particularly a RAG pipeline, uses docker-compose.yaml
to orchestrate multiple Docker images as containers [00:17:46].
Key Docker images for a RAG solution include:
- Data Ingestion: An image for ingesting data from the knowledge base (e.g., HTML files) [00:17:57].
- Vector Database: An image for Qdrant, pulled from Docker Hub [00:18:03].
- Front-end Application: For user interaction [00:18:09].
- LLM Serving: Olama or Hugging Face’s Text Generation Inference engine to serve large language models [00:18:12].
- Tracing/Monitoring: Phoenix (Arise Phoenix) for tracking and troubleshooting [00:18:19].
- Evaluation: Ragas for continuously evaluating model quality [00:18:22].
This structured approach allows for robust scaling AI solutions in production and effective design of AI in products.