Experiments with Large Language Models and AI Assisted Work

From: aidotengineer

Travis Fry Singinger, technical director of AI ETHLite, presents insights into why prompting Large Language Models (LLMs) “feels like magic but isn’t” [00:00:06]. This perspective explains the effectiveness of LLMs despite their lack of true intelligence or intent, stemming from a journey of experiments, analysis, and theory formulation [00:00:14].

Early Experiences with LLMs

In November 2022, the release of GPT-3.5 generated significant hype [00:00:37]. While it showed advancements in areas like email improvement, it was often “brittle in its understanding,” with surface-level fluency collapsing at edge cases and exhibiting prompt sensitivity [00:01:02]. The speaker initially found it overhyped [00:01:17].

However, the subsequent release of GPT-4 in January 2023 marked a significant shift [00:01:22]. Its outputs felt like genuine understanding, transcending mere text generation to display something akin to comprehension [00:02:17]. This observation was shared by others, including Microsoft Research, which published “Sparks of Artificial General Intelligence: Early Experiments with GPT-4” [00:01:37]. This experience prompted further exploration into the capabilities of these large language models [00:02:38].

Experimental Approach

Driven by an engineering and scientific background, the speaker embarked on a series of experiments to understand this perceived intelligence [00:02:43]. The journey began unconventionally with a live stream, chosen for its low friction [00:03:01].

AI-Assisted Programming (Vibe Coding)

The initial experiments focused on paired programming, termed “chat-assisted programming” or “vibe coding” [00:03:24]. While producing a few hundred lines of usable code required considerable effort [00:03:34], these sessions served as prototypes for future AI pair programming techniques, still utilized today in tools like Cursor [00:03:49]. A useful utility named Webcat was developed, a Python Azure function designed to scrape web pages for content, providing early GPT-4 models (which lacked internet access) with external information [00:04:00]. This tool significantly enhanced the utility of the chatbot for specific problem-solving tasks [00:04:25].

Collaborative Experiments: AIBuddy.software

Building on this, deeper collaborative experiments led to the creation of the blog, AIBuddy.software [00:04:33]. The objective was to build the blog collaboratively with AI, leveraging its capabilities for content generation [00:04:47]. The AI assisted in selecting the platform (Ghost) and providing installation instructions. Through live streaming, using Webcat to pull in article snippets, the AI helped in developing blog concepts, leading to a successful thought leadership blog [00:05:11].

Creative Collaboration: Feline Metal Album

To push the boundaries further, a creative project was undertaken: producing a concept album titled “Mr. Fluff’s Reign of Tiny Terror” [00:06:09]. This involved using ChatGPT for lyrics and music composition, alongside image editing to maintain visual consistency [00:06:23]. A key aspect was the ability of ChatGPT’s image generation to refine outputs, a new feature at the time [00:06:33]. Despite initial skepticism about producing anything of value from “cat metal” generated by AI, the project garnered over 3,000 views and significant positive feedback within a month of being uploaded to YouTube [00:07:01]. This demonstrated the AI’s capability to assist in creating valuable content beyond the user’s individual capabilities [00:07:22].

Analysis of AI Interactions: The AI Decision Loop

The success across various domains prompted a deeper investigation into the underlying behaviors of effective AI interaction [00:07:45]. Focusing on “decision intelligence” and “pairing behavior,” an analysis tool was built using “vibe coding” skills [00:08:11]. This tool processed chat history from ChatGPT, applying prompts to extract qualitative and quantitative metrics related to these behaviors [00:08:24]. The findings were documented in a 21-page research paper [00:08:50].

This analysis led to the formulation of the AI Decision Loop:

Frame: Define the problem and context (akin to prompt engineering) [00:09:35].
Generate: Produce outputs based on the prompt [00:09:45].
Judge: Evaluate the quality and fit of the output [00:09:57].
Validate: (Optional, for higher rigor) Check against external requirements [00:10:04].
Iterate: Refine the prompt and re-engage the model based on what was right or wrong in the previous output [00:10:15].

A simplified version, the “nudge and iterate” framework, condenses this to: Frame, Generate, Judge, Iterate [00:10:38]. Following this iterative process leads to more reliable outputs and is crucial for successful interactions with large language models [00:10:55].

Understanding LLM Coherence

Despite effective mechanics, the question of why LLMs work so well without being intelligent remained [00:11:12]. The answer lies in coherence, viewed as a system property rather than a cognitive one [00:11:59].

Key properties of coherence:

Relevant: Outputs feel topical, connected, and purposeful [00:12:24].
Consistent: The model maintains tone, terminology, and structure across multiple interactions [00:12:31].
Stable: The model withstands pressure, questioning, or competing theories without collapsing, instead firming up or course correcting [00:12:51]. This stability was a significant improvement over earlier models like GPT-3.5 [00:13:15].
Emergent Property: Coherence is not explicitly trained but arises from the model’s structure. For example, GPT-4o, without specific training, can diagnose swine disease or certain cancers through “coherent pattern alignment” [00:13:21].

Mechanics of Coherence: Superposition and Force Vectors

Unlike traditional neural networks where a concept might be stored in a single neuron, LLMs utilize superposition [00:14:13]. This allows the network to represent complex ideas with fewer parameters, packing more nuance into the same space [00:14:16]. As context accumulates, the network resolves ambiguity into a coherent output [00:14:23]. Meaning is thus “constructed on demand from distributed sparks of possibility,” rather than simply retrieved [00:14:49].

Prompts are considered force vectors in the high-dimensional latent space of an AI model [00:14:59]. Each prompt sets a specific direction, aligning patterns within the model [00:15:10]. When a model is prompted, external context activates conceptual clouds (sub-networks) within its latent space, which then merge to create a new, coherent idea [00:15:43]. This process results in “essence reconstruction” – the ability to recreate the core of an idea or combine multiple essences to form something new [00:17:08]. This is why hallucinations, though factually incorrect, often still “feel correct” – they are compelling patterns rather than intelligent assertions [00:17:20].

Engineering Implications of Coherence

Framing systems as coherent rather than intelligent changes the approach to AI engineering [00:17:35]:

Hallucinations as Indicators: Hallucinations are a system feature, an emergent behavior indicating that the model is trying to complete a pattern [00:17:45].
RAG as Factual Anchors: Retrieval Augmented Generation (RAG) fragments act as “factual anchors,” providing contextual gravity that pulls the model’s output towards reality [00:18:11].
Three-Layer Model:
1. Latent Space: The internal model structure of concepts, weights, and activations [00:18:43].
2. Execution Layer: Tools, APIs, and retrieval mechanisms that bring external context to Layer 1 [00:18:51].
3. Conversational Interface: Where human intent and thought pass to the machine, grounding Layer 1 and 2 in actionable value [00:19:00].

Guidelines for building for coherence:

Prompting as Interfaces: Prompts should be seen as components within a larger system, not one-off interactions [00:19:19].
Use RAG to Ground: Leverage dense, relevant context as “coherency anchors” to steer generation [00:19:26].
Design for Emergence: Accept that the system is not deterministic and build around the “frame, generate, judge, iterate” loop [00:19:34].
Avoid Fragile Chains: Long reasoning chains can break coherence; keep chains modular and reinforce context at each point [00:19:42].
Watch for Breakdowns: Early signs of coherence loss (e.g., changes in tone, structure, flow) indicate the model is losing context and needs debugging or adjustment [00:19:53]. This relates to evaluating language model performance.

In conclusion, LLMs are best understood as “high-dimensional mirrors” [00:20:17]. They don’t think but resonate through structure, sometimes reflecting back a sharper result than the input [00:20:26]. Their superpower is coherence, and the perceived “magic” lies in the collaborative dance between human intent and the model’s ability to create compelling, consistent patterns [00:20:34]. The focus should be on designing for “structured resonance” rather than chasing artificial intelligence [00:20:40].

Tubegraph

Explorer

Table of Contents