From: aidotengineer

Anthropic, an AI safety and research company, focuses on building safe and effective large language models [00:01:26]. A distinguishing research direction for Anthropic is interpretability [00:02:34]. This involves reverse engineering the models to understand how and why they are thinking [00:02:39], and developing capabilities to steer them for specific use cases [00:02:46].

Stages of Interpretability Research

Interpretability research is still in its early stages [00:02:53]. It is approached in stages that build upon each other [00:03:04]:

  • Understanding Grasping AI decision-making [00:03:07].
  • Detection Understanding specific behaviors and assigning labels to them [00:03:10].
  • Steering Influencing the AI’s input [00:03:15].
  • Explainability Unlocking business value from interpretability methods [00:03:22].

Methods and Goals

Anthropic’s interpretability team uses methods to understand feature activations at the model level [00:03:38]. They have published research on “Monosemanticity” and “Scaling Monosemanticity” [00:03:44]. As the technology improves, it could lead to better detection landscapes, providing a deeper grasp of model thinking and behavior [00:03:52], and even discovering “sleeper agents” for safety reasons [00:04:02].

Interpretability is expected to provide significant improvements in AI safety, reliability, and usability in the long term [00:03:31].

Examples

  • Feature Activation: When a model is asked about NBA scores and responds with “Steph Curry scored 30 points,” it activates a specific feature, like “feature number 304: famous NBA players” [00:04:09]. This represents a group of neurons activating in a recognizable pattern identified across all mentions of famous basketball players [00:04:27].
  • Model Steering (Golden Gate Claude): An example of steering the model was with “Golden Gate Claude” [00:04:41]. By amplifying the activation in the “Golden Gate” direction, if a user asked, “What should I paint my bedroom?”, Claude would respond with suggestions like “paint it red like the Golden Gate Bridge” [00:04:43].