From: jimruttshow8596
Melanie Mitchell, a professor at the Santa Fe Institute and author of “Artificial Intelligence: A Guide for Thinking Humans,” discusses the performance of artificial intelligence (AI) systems, particularly large language models (LLMs) like GPT-3.5 and GPT-4, on standardized tests and the implications for understanding AI capabilities [00:00:32].
Initial Assessments of GPT-3.5
In early 2023, there was discussion about whether AI, specifically ChatGPT 3.5, truly passed graduate-level exams [00:01:36]. Mitchell’s own analysis raised concerns about the validity of such claims [00:01:42].
A key issue with testing LLMs is determining if they are genuinely understanding concepts or merely “memorizing or compressing” previous text from their training data [00:04:32]. Standardized tests designed for humans assume that the test-taker has not memorized vast amounts of information like “all of Wikipedia” or “all of GitHub code” [00:03:31].
Sensitivity to Prompts
GPT-3.5 exhibited significant sensitivity to prompts [00:05:32]. For instance, when a Wharton MBA professor’s exam question, which GPT-3.5 had answered with an “A plus,” was rephrased with a different “word scenario,” the model performed poorly [00:05:52]. This raises questions about the model’s underlying understanding of concepts versus its reliance on specific linguistic patterns [00:05:49].
Vocabulary IQ Test
A vocabulary IQ test (VIQT), designed to assess vocabulary knowledge and general intelligence, yielded an IQ of 119 for GPT-3.5. While this is the level of a “four-year college grad from a third-tier State University,” Mitchell notes that this is an ideal task for a language model given its extensive linguistic training [00:22:32]. Despite this, it did not “totally ace it,” getting 38 out of 45 questions correct [00:22:46].
GPT-4 Performance
OpenAI’s technical report for GPT-4 claims significant improvements, with the model performing well on various standardized exams [00:02:29]. However, it notably did not score well on AP English [00:02:40].
A key challenge with evaluating GPT-4 is the lack of transparency from OpenAI [00:06:31]. Researchers do not have full access to the model used for testing or the exact test materials, making independent probing and scientific verification difficult [00:06:21]. This has led some to jokingly call OpenAI “Closed AI” [00:06:44].
GPT-4 appears to be more accurate and less prone to “hallucinations” than GPT-3.5 [00:10:11]. For example, when asked to list the 10 most prominent guests on “The Jim Rutt Show,” GPT-3.5 hallucinated 8 out of 10, whereas GPT-4 correctly identified 9 out of 10, with the one hallucination being highly plausible [00:13:20].
Challenges in Assessing AI Understanding
The term “understanding” itself is “not well understood” in the context of AI and human cognition [00:31:21]. Melanie Mitchell and David Krakauer’s paper, “The Debate Over Understanding and AI’s Large Language Models,” explores this [00:31:12].
There are two main perspectives:
- AI as “understanders”: Proponents argue LLMs can understand human language and the world in a similar way to humans [00:30:06].
- AI as “stochastic parrots”: Critics argue LLMs merely “parrot” or sophisticatedly compute the probability of the next word, without true comprehension [00:30:32].
Differences from Human Cognition
Humans develop “compressed models of the world” [00:35:38]. This is partly due to constraints like small working memory, which forces the creation of abstractions and compressions [00:36:21]. LLMs, with their massive “context window” (e.g., 32k tokens for GPT-4), do not face the same evolutionary pressure for compression [00:36:08].
Another key difference is the lack of “long-term memory” in LLMs that resembles human episodic memory, which contributes to a “sense of self” [00:38:54]. LLMs also lack “grounding” in bodily sensations and physical experience, which underpins human language function [00:45:49]. This raises the empirical question of whether language alone is rich enough to convey intuitive physics or psychology models [00:46:53].
The Problem of Hallucinations
LLMs, even when incorrect, are as “confident about their mistakes as they are about their correct answers” [00:19:50]. Unlike humans, who often exhibit linguistic differences when lying, LLMs do not “know they’re lying” because they lack a model of truth [00:20:50]. Their responses are based purely on statistics, meaning “the statistics of something that’s untruthful to them is equal to the statistics that something’s truthful” [00:21:03].
Future Directions for AI Assessment
There is a pressing need to develop “right assessments” for AI systems that can predict their abilities in real-world tasks [00:25:02]. Initiatives like Stanford’s “Holistic Evaluation of Language Models” (HELM), which involved “about 40 co-authors,” are working towards this goal [00:25:40].
Some ideas for future testing and research include:
- A “College Board for LLMs”: A company dedicated to testing language models [00:25:58].
- Open-source models for scientific research: Projects like the joint venture between EleutherAI and Stability AI aim to release open-source models, software, and datasets, allowing for more rigorous scientific experimentation and the study of “phase changes” or emergent properties at different scales [00:07:35].
- AI bootstrapping: Using AI to refine its own prompts or critique its own answers could potentially lead to better performance [00:11:47].
- External memory and intentional mechanisms: Researchers are exploring ways to augment LLMs with external memory hierarchies and intentional processes to make them more human-like [00:38:00].
Ultimately, LLMs represent a technology akin to the personal computer (PC) era, opening up “a huge number of applications” that will require human creativity to realize [00:26:23]. For the foreseeable future, the most valuable applications will likely involve “humans in the loop,” especially in fields like legal and accounting where human oversight can mitigate the risk of unpredictable errors [00:34:00].