From: aidotengineer
Introduction
The speaker, a co-founder of PyTorch, which is software that powers many AI APIs and is largely funded by Meta, discusses the future of AI, specifically focusing on personal local agents [00:00:30]. While working on Llama, the speaker does not share any secrets about its future releases [00:00:54].
The Rise of AI Agents
The speaker’s interest in personal local agents stemmed from personal experience. Swyx’s AI news, an aggregator, significantly saved time by summarizing AI news, which was personally effective for productivity and led to a deeper integration of AI into daily life [00:01:20]. Additionally, work in robotics, where robots inherently act as agents, reinforced this interest [00:02:09]. The ultimate goal is to build home robots to handle errands [00:02:17].
Defining an Agent
An agent is defined as something that can act in the world and has “agency” [00:03:05]. If a system can only gather context but cannot take action, it is not considered an agent [00:03:12].
The Critical Need for Context
A highly intelligent agent without the right context is largely useless [00:03:21]. For example, an agent with access to Gmail, WhatsApp, and a calendar might lie about a prescription renewal if it doesn’t have access to iMessage where the text from CVS was received [00:03:32]. Similarly, an agent only accessing one bank account might miss funds transferred via Venmo [00:04:08]. Without sufficient context, a personal agent becomes irritating and unreliable, leading to a lack of trust from the user [00:04:17]. For an agent to be truly useful, it needs to reach a certain level of reliability and predictability [00:04:36].
Enabling Personal Local Agents
The ideal scenario for providing context to an AI agent is for it to see everything the user sees and hear everything the user hears, essentially having access to all “variables” of one’s life [00:05:17].
Challenges with Current Devices
- Wearable Devices (e.g., smart glasses): Battery life limitations make continuous context provision impractical [00:05:31].
- Smartphones: Ecosystem restrictions, particularly from companies like Apple, prevent applications from running asynchronously in the background and constantly monitoring screen activity, limiting the agent’s ability to gain context [00:05:53].
The Case for Dedicated Local Hardware
A feasible solution for running a personal AI agent right now is a device like a Mac Mini [00:06:30]. It can be placed at home, connected to the internet, and run agents asynchronously without battery life concerns [00:06:33]. It also allows logging into various services and can access Android ecosystems [00:06:40].
Why Local and Private? (Privacy and Control)
The speaker strongly advocates for running personal agents locally and privately, rather than relying on cloud services provided by large tech companies [00:07:11].
Trust, Predictability, and Action Space
Unlike existing digital services like cloud email, which have a simple and predictable mental model (email in, reply out), AI agents possess a powerful and unpredictable action space [00:08:01]. For example, if an email service suddenly offered to auto-reply on your behalf, users would become uncomfortable because they don’t understand the full scope of potential actions, such as sending a “nasty” reply to a boss [00:08:27]. This unpredictability makes users uneasy when they are not in full control [00:08:58]. Furthermore, cloud services might monetize in ways that compromise user interests, such as making an agent prioritize purchases from partners who offer kickbacks [00:09:09]. Ultimately, users want control over something as personal and intimate as their AI agent [00:09:30].
Avoiding Walled Gardens
Running a personal agent in a centralized cloud ecosystem risks locking users into “walled gardens” that restrict interoperability [00:09:53]. While this might be acceptable for compartmentalized services like maps or email, it poses a significant concern for an agent that can take a wide variety of actions across a user’s entire daily life [00:10:01]. Therefore, the speaker believes the world should move towards local personalized agents as the norm [00:10:32].
Protecting “Thought Crimes”
A highly personal AI agent acts as an extension of the user, potentially processing thoughts and queries that one would never voice aloud [00:11:00]. Relying on cloud providers for such intimate interactions carries the risk of legal or social repercussions due to mandated logging and safety checks, even under enterprise-grade contracts [00:11:16]. To avoid being “punished for thought crimes,” a local agent ensures privacy and control over highly sensitive personal data [00:11:42].
Technical Hurdles and Opportunities
While the speaker is convinced of the necessity of local and private AI agents, there are technical challenges to overcome [00:12:15].
Local Inference Performance
Running local models, which are key components of these agents, is facilitated by open-source projects like vLLM and MLC LLM (referred to as SG Lang) [00:12:28]. Both are built on PyTorch [00:12:43]. However, local model inference remains slower and more limited compared to cloud services, even on powerful machines [00:13:13]. While a 20-billion parameter or distilled model might run quickly locally, the latest, unquantized models are still very slow [00:13:28]. This is expected to improve and “fix itself” over time, though users might not always run the absolute latest and greatest models [00:13:45].
Research and Product Gaps
The primary challenges are not infrastructural but rather in research and product development [00:14:00], presenting an open challenge for AI engineers:
- Open Multimodal Models: Current open multimodal models are good but not great, particularly in areas like computer vision. Even closed models struggle with computer use and tend to break [00:14:20]. When asked to shop, they often provide boring, generic results and struggle with visually identifying specific tastes [00:14:42].
- Catastrophic Action Classifiers: A significant gap exists in the ability of agents to identify “catastrophic actions” – irreversible and harmful actions like purchasing a Tesla instead of Tide Pods [00:15:37]. More research is needed to enable agents to identify such actions and notify users before proceeding [00:16:22].
- Open-Source Voice Mode: The state of open-source voice mode for personal agents is currently insufficient, yet it is crucial for a natural interaction experience [00:16:52].
Optimism for the Future
Despite the challenges, the speaker is bullish on the future of personal local agents due to the rapid advancements in AI and the progress of open models.
Open Models’ Compounding Intelligence
Open models are compounding in intelligence faster than closed models because resources are being pooled across a broad community [00:17:11]. Unlike companies like OpenAI or Anthropic, which focus on improving their own proprietary models, open models benefit from coordinated efforts globally [00:17:19]. This phenomenon has been evident with the releases of LLaMA, Mistral, and GGUF [00:17:37]. Historically, open source projects, once they achieve critical mass, tend to win in unprecedented ways, as seen with Linux [00:18:05]. The speaker believes that open models will eventually become better than closed models in terms of intelligence per dollar invested [00:18:22].
Current Efforts
- Ross Taylor is working on plugging the reasoning gap between open and closed models by releasing open reasoning data [00:19:14].
- PyTorch is actively working on enabling local agents and addressing the technical challenges, and is hiring AI and systems engineers [00:19:27].
- LlamaCon, an event focusing on Llama-related developments, is scheduled for April 29th [00:19:55].