From: hu-po

Vision-based agents, particularly those leveraging Vision Language Models (VLMs), are central to the future of AI applications. The development of these models involves continuous improvements in underlying components, but also significant challenges in efficiently deploying them for real-world inference scenarios [00:00:10].

Evolution of Vision Encoders

Vision encoders are the initial neural network modules that process an image, functioning similarly to the visual cortex of a VLM [00:04:08]. Recent research, such as Apple’s “Multimodal Auto-Regressive Pre-training of Large Vision Encoders,” demonstrates that these encoders can be effectively pre-trained using a simple auto-regressive objective, much like language models [00:04:27]. This contrasts with traditional contrastive losses used in models like CLIP [00:04:38].

The auto-regressive approach involves pairing a vision encoder with a multimodal decoder that generates raw image patches and text tokens [00:05:18]. The encoder converts an image into a vector representation, which the decoder then reconstructs in patches [00:05:25]. This method achieves high accuracy, for example, 89.5% on ImageNet 1K, indicating strong performance even with a frozen “trunk” (the lowest part of the encoder) [00:05:45].

Crucially, the scaling properties observed in Large Language Models (LLMs) also apply to vision encoders trained with this auto-regressive objective [00:08:31]. Increasing parameters consistently improves validation loss, suggesting that vision encoders will continue to become more capable over time as pre-training scale increases [00:08:05].

Inference Challenges

Despite advancements in pre-training, inferencing with VLMs presents significant challenges, particularly in achieving efficiency on devices like mobile phones [00:09:59]. A primary issue is the high latency caused by the large number of input tokens, predominantly from images [00:14:57]. When an image is processed, it’s typically cut into many patches, each converted into a vision token, which then needs to be fed into the language model [00:15:17].

Optimization Strategies for Inference

To mitigate latency and improve efficiency for mobile deployments, several strategies are employed:

  • Downsizing LLMs or Reducing Visual Tokens: One can either decrease the size of the LLM or reduce the number of input image tokens [00:15:02]. Surprisingly, for visual reasoning tasks, the optimal strategy for inference is often to use the largest LLM that fits the budget while minimizing visual token count, sometimes even to a single token [00:16:34]. This means a single token might represent an entire image [00:16:50].

  • Pipeline Parallelism: This involves running the vision embedding module (vision encoder) and the vision Transformer on different processing units simultaneously, such as the CPU and NPU (Neural Processing Unit) respectively [00:13:50]. This parallel processing helps speed up inference [00:13:53].

  • Batching Image Patches: Instead of sequentially processing each image patch, they can be processed in batches on the NPU, further accelerating the vision encoder inference [00:14:20]. Experiments suggest that processing four patches per batch can deliver the fastest performance [01:29:02].

  • Token Reduction Methods: To reduce the number of tokens coming out of the vision encoder, techniques like filtering out visual tokens with low similarity to a class token (a token representing the entire image) or using a “token packer” with cross-attention to compress tokens are used [00:20:52].

  • Caching Text Input Tokens: In use cases where the same text instruction (e.g., “what is in this image?”) is repeatedly given, the corresponding text input tokens can be calculated once and then cached instead of recalculating them every time [00:22:55]. This reduces redundant computation, although its effectiveness depends on hardware and memory architecture [00:23:37].

Task-Specific Optimizations

The optimal inference strategy varies significantly depending on the task [00:24:21]. For tasks like Optical Character Recognition (OCR), reducing the number of visual tokens can be detrimental to performance, as fine-grained detail is needed to identify letters [00:24:40]. In contrast, for general visual reasoning, fewer tokens might suffice, as the gist of the image is often enough [00:25:31].

Application: GUI Agents

A significant future use case for Vision Language Models is in Graphical User Interface (GUI) agents [00:26:49]. These agents interact with digital devices through human-like mouse and keyboard actions, observing the GUI through screenshots [00:27:31]. This approach bypasses the need for custom APIs for AI, as companies might prefer maintaining a single GUI for both humans and agents [00:28:48].

GUI agents often maintain an extensive context of historical screenshots, which can be stored as vision tokens rather than raw image frames to save computational cost [00:30:39]. This further incentivizes the use of fewer visual tokens [00:31:01].

However, the current state of AI agents and their limitations in GUI interaction can be inconsistent; an agent might fail on a simple task like updating a phone number in a Word document, yet succeed at a more complex sequence of actions in a video game with an unusual UI [00:31:56]. This highlights the “weirdness” in their current capabilities [00:34:01].

Future Directions and Implications

Self-Improvement

Models are increasingly capable of self-improvement. Techniques like “minimum Bayes risk” allow models to sample multiple outputs, score them based on consistency, and then use these scores for supervised fine-tuning or preference optimization [00:43:47]. This approach can automatically generate high-quality reasoning trace datasets, outperforming methods relying on human-produced data [00:47:15]. This evidence suggests that AI models can become “smarter” by training on their own outputs [00:49:26].

Imagining Futures

Beyond structured reasoning, advanced agents may leverage generative models to imagine future observations or screen states [00:54:26]. By generating video sequences based on current observations, an agent can use these “hallucinated images” to update its internal beliefs and plan more effectively [00:54:26]. This “look-ahead” capability, while compute-intensive, can lead to better decision-making [00:53:24].

The “Arms Race” of Inference

There’s an ongoing “arms race” between hardware and software in AI inference [01:05:50]. Hardware developers strive to increase tokens per second, while AI researchers push for more complex reasoning traces and imagined outputs, requiring more tokens and compute [01:06:05]. The goal is to perform millions of token generations and complex chain-of-thoughts in the blink of an eye, leading to faster and more intelligent agent actions [01:06:42].

Data Collection and Privacy

The widespread adoption of vision-based agents, particularly GUI agents, raises concerns about privacy. Devices might send frequent screenshots to cloud servers for processing, allowing companies to gather extensive data on user interactions [01:12:21]. This trend of decreasing privacy appears to be accelerating [01:20:30]. While local, federated learning approaches could offer privacy, the competitive advantage of vast cloud-based data collection makes this path less certain [01:11:15].

The future of AI will involve a dynamic interplay between continuous improvements in Vision Language Models, innovative inference optimizations, and the evolving applications of visual reasoning in agents, potentially leading to self-improving AI that can even simulate future scenarios to enhance its decisions [01:25:23].