prototyping and production in AI

From: aidotengineer

Jonathan Fernandez, an independent AI engineer, specializes in helping companies build and ship production-ready generative AI solutions [00:00:11]. His approach to developing AI systems, particularly Retrieval Augmented Generation (RAG) stacks, involves a distinct separation between prototyping and production environments [00:00:51].

Development Workflow

The typical product development process involves two main phases:

Prototyping: Often conducted in Google Colab due to access to free hardware accelerators, making it easy to prototype [00:00:53].
Production: Frequently uses Docker for deployment, allowing solutions to run either on-premise (common for financial institutions with data residency requirements) or in the cloud [00:01:04].

RAG Stack Components and Tooling

Based on learnings from 37 prior attempts, a robust RAG stack utilizes specific tools tailored for each phase [00:00:35].

Component	Prototyping Tools	Production Tools	Notes
Orchestration Layer	LlamaIndex, LangChain / LangGraph	LlamaIndex	LlamaIndex is noted for its ability to create a basic RAG solution in a few lines of code [00:07:57].
Embedding Models	Closed (APIs e.g., OpenAI)	Open (e.g., NVIDIA, BAAI)	Closed models simplify usage via APIs [00:01:34]. Open models like BGE small model from Hugging Face can be downloaded and used [00:12:30].
Vector Database	Qdrant (in-memory for prototype)	Qdrant	Qdrant is an excellent choice for its scalability, handling anything from a few to hundreds of thousands of documents [00:01:51].
Large Language Model	Closed (APIs e.g., OpenAI)	Open (e.g., Meta Llama, Alibaba Cloud Qwen) served by Olama or Hugging Face Text Generation Inference	Closed models offer simplicity via APIs [00:02:07]. In production, open models are served using tools like Olama or Hugging Face’s text generation inference engine, often within Docker [00:02:23].
Monitoring and Tracing	LangSmith, Arise Phoenix	Arise Phoenix (Docker container)	Essential for troubleshooting, understanding time consumption of components, and tracking the RAG pipeline [00:02:41]. Arise Phoenix has a good Docker solution [00:17:18].
Re-ranking / Accuracy	Closed (e.g., Cohere)	Open (e.g., NVIDIA)	Re-ranking improves the accuracy of the RAG solution [00:03:07]. Cross-encoders, also known as rerankers, are placed post-retrieval for additional accuracy, though they are slower and less scalable than bi-encoders [00:11:30].
Evaluation Framework	Ragas	Ragas	Crucial for evaluating the quality of the RAG solution across various metrics and works well with Large Language Models, making the task relatively painless [00:03:24], [00:17:34].

Understanding RAG: Retrieval Augmented Generation

RAG is a process to enhance the generation capabilities of Large Language Models (LLMs) by providing them with relevant context from an external knowledge base [00:04:08].

The process involves:

Retrieval: A user query (e.g., “Where can I get help in London?”) is used to perform a semantic search through a vector database to retrieve relevant documents [00:04:18].
Augmentation: The original query is combined with the retrieved information from the vector database, providing crucial context to the LLM [00:04:31].
Generation: With the additional context, the LLM can produce a more accurate and relevant response to the original question [00:04:45].

A naive RAG solution embeds the query, compares it to documents in a vector database, retrieves relevant documents, and passes them with the query to the LLM for response generation [00:04:58].

Improving RAG Accuracy

To enhance a RAG solution, additional steps can be incorporated:

Query Processing: Steps to refine the query, such as removing Personally Identifiable Information (PII) [00:08:49].
Post-Retrieval Processing: Improving the accuracy of retrieved documents before passing them to the LLM [00:09:06]. This often involves re-rankers.

Cross-Encoders vs. Bi-Encoders

Cross-Encoder: Semantically compares a query with a document by sending both to a BERT model and then a classifier, yielding a similarity score (0-1) [00:09:25]. While excellent for additional accuracy, it is slow and not scalable for large documents [00:10:04]. It is best used as a re-ranker post-retrieval with a limited number of documents [00:11:25].
Bi-Encoder: Uses two separate encoder models (e.g., two BERT layers) for the query and the document independently, and then compares their embeddings using cosine similarity [00:10:14]. This solution is fast and scalable, making it excellent for information retrieval in the vector database step [00:10:51].

Production Environment Setup with Docker

A typical production setup for an AI solution, particularly a RAG pipeline, uses docker-compose.yaml to orchestrate multiple Docker images as containers [00:17:46].

Key Docker images for a RAG solution include:

Data Ingestion: An image for ingesting data from the knowledge base (e.g., HTML files) [00:17:57].
Vector Database: An image for Qdrant, pulled from Docker Hub [00:18:03].
Front-end Application: For user interaction [00:18:09].
LLM Serving: Olama or Hugging Face’s Text Generation Inference engine to serve large language models [00:18:12].
Tracing/Monitoring: Phoenix (Arise Phoenix) for tracking and troubleshooting [00:18:19].
Evaluation: Ragas for continuously evaluating model quality [00:18:22].

This structured approach allows for robust scaling AI solutions in production and effective design of AI in products.

Tubegraph

Explorer

Table of Contents