From: aidotengineer

The AI SDK is a toolkit designed for building agents [00:00:07] and developing AI applications [00:00:02]. This article explores its core functionalities, focusing on how language models can be leveraged to generate text, perform actions via tools, and produce structured data.

Fundamentals

The AI SDK offers fundamental building blocks necessary for interacting with language models [00:00:17]. The project typically involves a single index.ts file, runnable via pmpp rundev [00:00:46].

generateText Function

The generateText function is the primary way to call a large language model to generate text [00:01:06].

It requires specifying a model, such as OpenAI’s GPT-4o mini, and a prompt [00:01:41]. For instance, prompting “hello world” to GPT-4o mini might return “Hello, how can I assist you today?” [00:01:48]. This function can accept either a direct string prompt or an array of messages, where each message has a role and content [00:02:07].

Unified Interface

A core feature of the AI SDK is its unified interface, allowing developers to switch between different language models by changing a single line of code [00:02:37]. This flexibility is beneficial for reasons such as cost-efficiency, speed, or performance in specific use cases [00:02:50].

Models like GPT-4o Mini have training data cutoffs (e.g., 2024), limiting their knowledge of future events [00:03:16]. For information requiring real-time data, models with built-in web search capabilities, such as Perplexity’s Sonar Pro, can be used [00:03:52]. Perplexity’s responses include references or sources, accessible via the sources property in the AI SDK [00:04:40]. Google’s Gemini Flash 1.5 also supports search grounding for similar functionality [00:05:33]. The SDK supports a wide range of providers, detailed in its documentation [00:05:10].

Tools and Function Calling

Beyond generating text, language models can interact with the outside world and perform actions through “tools” or “function calling” [00:06:15].

How Tools Work

The fundamental idea behind tools is simple: the model is given a prompt along with a list of available tools [00:06:26]. Each tool includes:

  • A name [00:06:38]
  • A description of what it does, which the model uses to decide when to invoke it [00:06:40]
  • parameters (data required to use the tool) [00:06:47]
  • An execute function, which is arbitrary asynchronous JavaScript code run when the language model calls the tool [00:08:33].

When the model decides to use a tool, it generates a “tool call” (the tool’s name and arguments parsed from the conversation context) instead of text [00:06:53]. The developer then parses and runs this tool call [00:07:13]. The AI SDK automatically parses tool calls, invokes the execute function, and returns the result in a toolResults array [00:09:24].

The tool utility function (e.g., tool('addNumbers', ...)) provides type safety between defined parameters and the execute function’s arguments [00:08:06].

maxSteps for Autonomous Agents

By default, after a tool call, the language model might not generate text [00:10:05]. To get the model to incorporate tool results into a text answer, the maxSteps property can be used [00:11:35]. When maxSteps is set, if the model generates a tool call, the tool result is sent back to the model along with the previous conversation context, triggering another generation [00:11:46]. This continues until the model generates plain text or the maxSteps threshold is reached [00:12:05]. This enables the model to run autonomously, picking the next step without explicit developer logic [00:12:16].

For example, asking “what’s 10 + 5?” with an “add numbers” tool will result in a tool call and then a subsequent text generation of “10 + 5 equals 15” if maxSteps is configured [00:13:00].

Models can even make parallel tool calls within a single step [00:16:42]. For instance, getting weather in two cities (“San Francisco” and “New York”) and then adding them together can involve two parallel “get weather” calls followed by an “add numbers” call [00:15:45]. The language model can infer parameters like latitude and longitude from the city names provided in the prompt, using its training data [00:15:05].

Structured Outputs

The AI SDK provides ways to generate structured data, also known as structured outputs [00:18:38].

Methods for Structured Output

There are two primary ways to generate structured outputs:

  1. Using generateText with its experimentalOutput option [00:18:46].
  2. Using the dedicated generateObject function [00:18:55]. The streamObject function is also available for streaming structured output [00:19:00].

Zod for Schema Definition

Language models can be guided to output structured data using schema definitions [00:19:46]. Zod, a TypeScript validation library, is highly recommended for defining these schemas in the AI SDK [00:19:47]. Zod schemas allow for type-safe objects as output, ensuring data integrity [00:20:42].

The .describe() function in Zod can be chained onto any key or value within the schema to provide detailed instructions to the language model about the desired output for that specific value [00:23:14]. This allows for fine-grained control over the generated text within the structured output [00:23:28].

generateObject can also use an enum mode for schemas with a limited set of discrete values (e.g., “relevant” or “irrelevant”), simplifying the schema definition [00:40:08]. This constraint makes it easier for the language model to produce the correct output [00:40:31].

Practical Project: Building a Deep Research Clone

This section demonstrates how to combine the AI SDK’s functions to build complex AI systems, specifically a “deep research clone” in Node.js [00:25:24]. Deep research tools generally take a topic, search the web, aggregate resources, and return a comprehensive report [00:26:12].

Workflow Breakdown

The deep research workflow involves several steps [00:26:50]:

  1. Input Query: Start with a general prompt [00:26:53].
  2. Generate Subqueries: Generate multiple specific search queries from the initial prompt [00:26:57].
  3. Search Web: For each subquery, search the web for relevant results [00:27:15].
  4. Analyze Results: Analyze search results for “learnings” (insights) and “follow-up questions” [00:27:19].
  5. Recursive Research: Take follow-up questions and existing research to generate new queries, recursively continuing the process while accumulating information [00:27:26].

This recursive process explores “webs of thought” by adjusting “depth” (levels of inquiry) and “breadth” (number of queries at each step) [00:27:45].

Implementation Details

generateSearchQueries

This function uses generateObject to create an array of search queries based on an initial prompt and a desired number of queries [00:29:41]. It defines a Zod schema for an array of strings to ensure structured output [00:30:38].

searchWeb with Exa

The searchWeb function uses the Exa service to perform web searches and retrieve content [00:32:53]. It fetches a specified number of results and can use liveCrawl to ensure up-to-date information [00:33:59]. Importantly, it maps through the results and only returns relevant information (e.g., URL, title, content) to reduce token count and improve language model effectiveness [00:34:49].

searchAndProcess (Agentic Component)

This is the most complex and agentic part of the workflow [00:36:10]. It employs generateText with two tools:

  • searchWeb: Performs a web search [00:38:57]. Search results are temporarily stored in pendingSearchResults [00:39:21].
  • evaluate: Evaluates the relevance of the latest search result using generateObject in enum mode (relevant or irrelevant) [00:39:41]. If relevant, it moves the result to finalSearchResults; otherwise, it discards it [00:40:41].

The maxSteps parameter ensures this agentic loop continues until a relevant source is found or the step limit is reached [00:38:27]. The model receives feedback (“Search results are irrelevant, please search again with a more specific query”) to guide its next action [00:41:01]. To prevent redundant searches, the evaluate tool also receives a list of accumulatedSources and marks already-used URLs as irrelevant [00:52:50].

generateLearnings

This function uses generateObject to analyze relevant search results and extract key “learnings” (insights) and “follow-up questions” [00:43:39]. It defines a Zod schema for a string learning and an array of string follow-up questions [00:44:15].

Deep Research Recursion

The overall deepResearch function handles the recursion, managing the “accumulated research” state [00:47:10]. This state includes the original query, active queries, search results, learnings, and completed queries [00:48:10]. The function decrements depth and breadth parameters in each recursive call to ensure termination and control the extent of the research [00:50:01].

generateReport

Finally, the generateReport function takes the accumulatedResearch and feeds it into a large language model (e.g., GPT-3.5 mini) to synthesize all the gathered information [00:54:32]. A detailed system prompt can be provided to guide the model on formatting (e.g., Markdown) and tone (e.g., “expert researcher”), ensuring the output meets specific requirements and is more structured [00:57:22]. The final report can then be written to a file system [00:55:49].