From: aidotengineer
Anthropic, an AI safety and research company, focuses on building safe and effective large language models [00:01:26]. A distinguishing research direction for Anthropic is interpretability [00:02:34]. This involves reverse engineering the models to understand how and why they are thinking [00:02:39], and developing capabilities to steer them for specific use cases [00:02:46].
Stages of Interpretability Research
Interpretability research is still in its early stages [00:02:53]. It is approached in stages that build upon each other [00:03:04]:
- Understanding Grasping AI decision-making [00:03:07].
- Detection Understanding specific behaviors and assigning labels to them [00:03:10].
- Steering Influencing the AI’s input [00:03:15].
- Explainability Unlocking business value from interpretability methods [00:03:22].
Methods and Goals
Anthropic’s interpretability team uses methods to understand feature activations at the model level [00:03:38]. They have published research on “Monosemanticity” and “Scaling Monosemanticity” [00:03:44]. As the technology improves, it could lead to better detection landscapes, providing a deeper grasp of model thinking and behavior [00:03:52], and even discovering “sleeper agents” for safety reasons [00:04:02].
Interpretability is expected to provide significant improvements in AI safety, reliability, and usability in the long term [00:03:31].
Examples
- Feature Activation: When a model is asked about NBA scores and responds with “Steph Curry scored 30 points,” it activates a specific feature, like “feature number 304: famous NBA players” [00:04:09]. This represents a group of neurons activating in a recognizable pattern identified across all mentions of famous basketball players [00:04:27].
- Model Steering (Golden Gate Claude): An example of steering the model was with “Golden Gate Claude” [00:04:41]. By amplifying the activation in the “Golden Gate” direction, if a user asked, “What should I paint my bedroom?”, Claude would respond with suggestions like “paint it red like the Golden Gate Bridge” [00:04:43].