From: aidotengineer

Developing and deploying AI solutions, particularly those involving Retrieval Augmented Generation (RAG) stacks, typically involves distinct phases for prototyping and production [00:00:51]. Each phase utilizes different tools and strategies to meet its specific requirements.

Prototyping Environment

The prototyping phase focuses on rapid iteration and experimentation to validate concepts and build initial models [00:00:51].

Key Characteristics & Tools

  • Platform: Google Colab is a preferred environment for prototyping due to its accessibility and provision of a free hardware accelerator [00:00:53].
  • Orchestration: Llama Index or Lang Graph are suitable for orchestrating the RAG pipeline in the prototyping stage [00:01:25].
  • Embedding Models: Both closed models (via APIs for simplicity) and open models (e.g., from Nvidia or BAI) can be used [00:01:31]. The BGE small model from B AI is an example of an open embedding model that can be downloaded and used [00:12:42]. OpenAI’s text embedding ADA 002 or text embedding three large are default built-in options in some systems [00:12:15].
  • Vector Database: Qdrant is a suitable choice, even as an in-memory solution for prototyping, given its scalability [00:01:51], [00:11:57].
  • Large Language Models (LLMs): Closed models are often used for their API simplicity, such as GPT-3.5 Turbo (default) or GPT-4 from OpenAI [00:02:04], [00:13:22]. Open models like those from Meta or Qwen 3 are also options [00:02:12].
  • Monitoring and Tracing: Solutions like Langsmith or Phoenix Arise help track the performance of different components and aid troubleshooting [00:02:53], [00:16:25].
  • Re-ranking: Closed models, such as the one from Cohere, are used for re-ranking to improve accuracy after retrieval [00:03:11], [00:17:24].
  • Evaluation: The Ragas framework is used to evaluate the quality of RAG solutions, supporting testing across many documents [00:03:27], [00:16:47], [00:17:34].

Production Environment

The production phase focuses on deploying robust, scalable, and secure AI solutions. Engineering teams ensure that the solutions meet operational requirements.

Key Characteristics & Tools

  • Deployment: Docker is a common solution for deployment, allowing for both on-premise and cloud environments [00:01:14]. This is particularly important for organizations like financial institutions that may have requirements for on-premise data processing [00:01:06].
  • Orchestration: Llama Index is a suitable choice for orchestrating RAG pipelines in production [00:01:29].
  • Embedding Models: Open models, such as those from BAI and Nvidia, are favored for production environments [00:01:46], [00:16:56].
  • Vector Database: Qdrant is recommended due to its ability to scale from a few documents to hundreds of thousands, making it suitable for production workloads [00:01:51], [00:17:01].
  • Large Language Models (LLMs): Open models like Llama 3.2 or Qwen 3.4 billion from Alibaba Cloud are often used and served within a Docker environment via Olama or Hugging Face Text Generation Inference [00:02:21].
  • Monitoring and Tracing: Phoenix Arise is a robust solution for monitoring and tracing in production, with a Docker solution readily available [00:03:00], [00:17:18].
  • Re-ranking: Open solutions, such as those from Nvidia, are preferred in production for re-ranking to enhance accuracy [00:03:17], [00:17:26].
  • Evaluation: The Ragas framework remains crucial for checking the quality of RAG solutions across various metrics in a production context [00:03:27], [00:17:34].

Docker Compose Configuration for Production

A typical production setup using Docker Compose involves a compose.yaml file to manage various services as containers [00:17:48]. This allows for modular deployment of the RAG stack:

  • Data Ingestion Image: Connects to the knowledge base to pull in HTML files [00:17:57].
  • Qdrant Image: For the vector database, often pulled from Docker Hub [00:18:03].
  • Front-end Application Image: For the user interface [00:18:09].
  • LLM Serving Image: Uses Olama or Hugging Face Text Generation Inference to serve models [00:18:12].
  • Phoenix Image: For tracing and monitoring [00:18:19].
  • Ragas Image: For evaluating model quality [00:18:22].
  • Model Details: Configuration for embedding, re-ranking, and large language models [00:18:30].

This structured approach helps in building and shipping production-ready generative AI solutions efficiently [00:00:12].