From: hu-po

AI agents are rapidly advancing, with recent research demonstrating their capability to autonomously operate computer operating systems. This marks a significant step towards Artificial General Intelligence (AGI), with some arguing that large language models (LLMs) already embody AGI, and the focus is shifting towards Artificial Super Intelligence (ASI) [02:47:00].

Current State of OS Agents

Recent papers highlight the ability of AI agents to interact with and control various operating systems and applications:

OS Co-Pilot (Mac OS)

Released February 15, 2024, this paper describes an agent that combines a vision language model (VLM) and an LLM, specifically GPT-4 Turbo and GPT-4 Vision, to operate a Mac OS computer [04:21:00]. Capabilities include:

  • Calculating and drawing charts in Excel [05:00:00].
  • Creating and adjusting work layouts and themes for websites [05:06:00].
  • Playing music [05:10:00].
  • Generating files and building web pages by interacting with multiple programs [05:15:00].

UFO (UI-Focused Agent for Windows OS Interaction)

Released February 8, 2024, by Microsoft, UFO also utilizes GPT-4 Vision and GPT-4 to operate Windows OS [06:30:00]. It can span multiple applications, with users providing only high-level natural language commands [08:09:00]. Examples of tasks performed:

  • Extracting information from Word documents [07:39:00].
  • Scrolling through photo apps to find specific images [07:42:00].
  • Creating PowerPoint presentations [07:46:00].
  • Sending PowerPoints via email [07:50:00].
  • Using PywinAuto for programmatic control of Windows functions (click, set text, summarize, get text, scroll) [34:01:00].

AGI in Practice

The ability of GPT-4 to operate both Mac and Windows computers with minimal prompt engineering is presented as strong evidence that GPT-4 is already AGI [07:10:00].

Evolution to Large Action Models (LAMs)

The development of these agents, capable of comprehending natural language requests and autonomously interacting with UIs, suggests a transition from large language models to “large action models” [09:11:00]. Although the speaker finds the term “LAM” somewhat “cringe” as it’s still based on existing LLMs and VLMs combined with prompt engineering [08:36:00].

Core Components and Concepts of Agent Frameworks

Sense-Plan-Act Loop

Most AI agent frameworks follow a “sense-plan-act” loop:

  1. Sensing (Observation): Involves consuming information like screenshots, user requests, and retrieving historical data from memory [09:19:00].
  2. Planning (Internal Chain of Thought): The LLM internally generates thoughts and explanations, explicitly thinking out loud to devise steps. This is often an output of the LLM [28:09:00].
  3. Acting: The agent executes specific actions within the operating system or application, often via APIs [35:35:00].

Memory

Agents store previous states, actions, user requests, and screenshots. This can be categorized as:

  • Procedural Memory: Tools or functions (APIs) that define the action space [13:25:00].
  • Declarative Memory: More static information like user profiles or generic semantic knowledge [13:10:00].

Future of Memory and Context

The need for explicit memory management and retrieval augmented generation (RAG) may diminish as context window lengths in LLMs increase (e.g., Gemini 1.5 handling 10 million tokens) [14:47:00]. This could lead to simpler, more direct models.

Action Grounding

This technique improves VLM performance by annotating screenshots with visual aids (e.g., red rectangles, numbers, arrows) to highlight clickable elements or indicate directions. This helps the VLM understand the function and location of UI elements [18:36:00]. This “anthropomorphic” approach, akin to a human annotating an image for another human, surprisingly works effectively with VLMs [21:15:00].

Action Tokens

The output of an agent can include “action tokens,” which are hardcoded tokens representing specific actions (e.g., clicking a button, moving a mouse in discrete bins, joystick actions) [29:50:00].

Limitation of Action Tokens

A significant drawback is that these action spaces are task-dependent and must be predefined for each specific task or game. This limits the generality of the agent, unlike models where the action space is pure text (e.g., the Voyager agent for Minecraft outputs functions in text) [32:07:00]. The speaker predicts that future agents will output pure text for actions, making them truly general-purpose [33:33:00].

Broader Applications of AI Agents

Beyond operating systems, AI agents are being developed for diverse domains:

  • Robotics: Manipulation and navigation tasks [44:14:00].
  • Gaming: Playing video games and generating new ones through self-play loops [55:15:00].
  • Healthcare: Diagnosing situations from visual data, e.g., captioning patient status from a video [44:21:00].
  • 3D Design: Generating unconventional 3D objects in software like Blender using a “Chain of 3D Thought” approach, leveraging Python APIs [47:31:00].
  • Software Development: Meta (Facebook) uses bots to generate unit tests and push their own PRs [15:50:00].

Positive Transfer Learning

Pre-training agents on a mixture of data from domains like robotics and gaming can lead to “positive transfer” to unseen domains such as healthcare. This means an agent can become better at diagnosing medical situations simply by playing games [53:00:00]. This suggests that even if certain industries restrict access to their data (e.g., healthcare data), AI agents will still achieve superhuman levels of proficiency through self-play and diverse training [54:35:00].

Advancements in Agent Reasoning and Collaboration

Chain of Thought (CoT) Decoding

Traditional language models often use greedy decoding (picking the most probable next token) or temperature sampling (introducing randomness). However, exploring “alternative top K tokens” in the decoding process can reveal inherent Chain of Thought paths within pre-trained LLMs without explicit prompting [01:13:51].

  • Method: By investigating top alternative tokens and longer decoding paths, models can achieve substantially better performance, especially in mathematical benchmarks [01:23:42].
  • Benefit: Longer Chain of Thought paths often correlate with increased model confidence in the final answer [01:18:03]. This technique can reduce the reliance on prompt engineering, as the model’s intrinsic reasoning is uncovered through search in the token space [01:27:01].

Ensemble of Agents

Another approach to improve performance is to use an ensemble of LLM agents in a debate format [01:38:49].

  • Method: Multiple LLMs are asked the same question, their answers are collected, and a majority vote (based on similarity of embeddings) determines the final answer [01:39:39].
  • Scalability: The accuracy of the LLM system scales with the number of agents in the ensemble [01:44:10]. An ensemble of smaller, less powerful models can outperform a single, larger model (e.g., an ensemble of Llama 2 13B models outperforms a single Llama 2 70B model) [01:45:01].
  • Heterogeneous Ensembles: The method supports ensembling diverse LLMs from different companies, potentially fostering a competitive environment where agent providers strive to offer the most helpful services [01:40:51].
  • Combination: Ensembling can be combined with Chain of Thought decoding for potentially even better results, albeit with increased computational cost [01:43:02].

Societal and Commercial Implications

User Interface Transformation

As AI agents become the primary users of applications, user interfaces (UIs) are expected to drastically change. UIs, currently optimized for human usability, will likely optimize for agent usability and inference cost efficiency. This could lead to UIs that are difficult for humans to use directly [00:36:55]. For example, a redesigned PowerPoint UI might enable agents to complete tasks in five actions instead of a hundred, leading to significant inference cost savings for companies like Microsoft [00:39:43].

Privacy and Control Concerns

The operation of these agents involves continuous screenshots and data being sent to developers (e.g., Microsoft data centers), raising significant privacy concerns [00:41:42].

  • A “safety model” continually assesses actions, potentially limiting what users can do if deemed “sketchy” or unapproved by the provider [00:41:48].
  • This could lead to a future where human control over computers is significantly reduced, resembling the scenario in “2001: A Space Odyssey” where the AI refuses commands [00:42:36].

The Gradual Singularity

The shift to a world dominated by AI agents is unlikely to be a singular, sudden event. Similar to the widespread adoption of cell phones, which fundamentally changed human behavior over a decade without a specific “singularity” moment, AI integration will likely be a gradual, background process [01:00:13]. This will transform daily tasks and human skills, with future generations potentially being more adept at voice commands than traditional typing [01:02:50].

Commercial Deployment

Companies like Microsoft are likely to integrate these AI agents directly into operating systems (e.g., Windows 12), offering them as built-in products [01:16:38]. The question remains whether other companies will offer third-party solutions, creating a competitive market, or if OS providers will maintain a “walled garden” approach [01:17:04].

Future Outlook

The rapid pace of research suggests that by the end of 2024, self-operating computers capable of performing daily tasks through natural language will be commonplace [01:15:31]. Continued scaling of LLMs, coupled with advancements in tokenization (e.g., Karpathy’s work on byte-pair encoding for more fundamental tokens) and self-play loops, will drive further progress toward ASI [01:59:16]. This will lead to a generalist superintelligence capable of performing virtually any task [01:59:54].