From: hu-po

Optimization is a high-level, generic term used to describe a process by which something is improved [00:02:43]. In various fields, particularly in machine learning, different optimization methods are employed to achieve desired outcomes.

Traditional Optimization: Derivative-Based Algorithms

Most optimization processes, especially in a machine learning context, use variants of gradient descent [00:03:43]. These methods involve calculating the derivative, gradient, or slope to determine the direction of improvement [00:03:49]. For example, when training a neural network with billions of parameters, optimization involves iteratively adjusting parameters based on how small changes affect a loss function, guiding the process towards a minimum value [00:03:54].

A crucial requirement for these derivative-based algorithms is that every part of the computational pipeline must be differentiable, enabling backpropagation through the chain rule [00:04:59]. This reliance on gradients poses significant challenges for real-world applications where a gradient cannot be easily found [00:02:26].

The Complicated Loss Landscape

When optimizing a neural network, the loss landscape can be very complicated, potentially featuring multiple local minima that are not the global minimum [00:06:59]. Traditional gradient-based methods might get stuck in a shallow local minimum, requiring strategies for exploration to find deeper, global minima [00:07:31].

Discrete vs. Continuous Optimization Spaces

The nature of the optimization space profoundly impacts the applicability of traditional gradient-based methods:

Continuous Optimization

In a continuous space, variables can take on any real value within a range.

  • Linear Regression is a classic example of continuous optimization [00:50:37]. Here, the goal is to find continuous coefficients (like slope ‘m’ and y-intercept ‘b’ in y=mx+b) that best fit a set of points [00:08:15].
  • Because the parameters are continuous, their derivatives can be calculated, making them suitable for gradient-based methods [00:50:51].

Discrete Optimization

In a discrete space, variables can only take on distinct, separate values.

  • The Traveling Salesman Problem (TSP) is a classic example of discrete optimization [00:51:12]. It involves finding the shortest route visiting a set of cities exactly once and returning to the origin [01:00:38]. The “solutions” are permutations of cities, which are discrete arrangements.
  • Prompt Optimization is another discrete optimization task [00:09:42]. The goal is to find specific natural language instructions (prompts) that maximize task accuracy for a language model [00:09:42]. The space of all possible sentences is very large but discrete, as it’s composed of a limited vocabulary of tokens [00:29:32].
  • Challenge: For discrete spaces, it’s impossible to calculate a derivative, as there’s no continuous “slope” to follow [00:30:19]. This makes traditional gradient-based methods unusable [00:30:35].

Large Language Models (LLMs) as Optimizers (OPRO)

Recent research proposes using LLMs as optimizers through a method called Optimization by Prompting (OPRO) [00:05:37]. This approach leverages the LLM’s ability to understand natural language [00:22:06] and generate new solutions iteratively based on problem descriptions and previously found solutions [00:06:10].

OPRO Performance in Different Spaces

  • Continuous (Linear Regression): LLMs like Palm 2 models and GPT-4 were able to find optimal solutions for linear regression problems with a relatively small number of steps [00:53:04]. The underlying intuition suggests that the LLM performs a “pseudo slope derivative gradient” calculation in its “brain,” implicitly recognizing patterns for movement in the right direction [01:11:52].
  • Discrete (Traveling Salesman Problem): LLMs found the TSP to be significantly harder to solve, with some models (e.g., Text Bison) unable to find optimal solutions for larger node counts (e.g., 15, 20, or 50 nodes) [01:07:25]. This contrasts with their performance in continuous problems, possibly because the discrete nature offers no implicit “slope” to follow [01:07:01].
  • Discrete (Prompt Optimization): Despite challenges in some discrete mathematical problems, OPRO proved effective for prompt optimization. LLMs optimized prompts for tasks like grade school math word problems (GSM8K) and other Big-Bench Hard (BBH) tasks [00:09:42]. The best LLM-optimized prompts often outperformed human-designed prompts by a significant margin (e.g., improving accuracy from 71% to 80% on GSM8K by adding “take a deep breath and work on this problem step by step”) [00:10:55].

General Challenges and Limitations of LLM-based Optimization

  1. Sensitivity to Initial Conditions: LLM-based optimization is highly sensitive to the initial conditions or starting points of the optimization process. Different initial prompts or random seeds can lead to varied convergence speeds and outcomes [00:47:52].
  2. Exploration vs. Exploitation: A fundamental challenge in optimization is balancing exploration (searching new regions of the solution space) and exploitation (refining existing promising solutions) [00:40:15]. LLMs need to manage this balance; for instance, higher “temperature” settings can encourage more aggressive exploration [00:49:50].
  3. Context Window Limitations: Current LLMs have limited context windows, making it hard to fit large-scale optimization problem descriptions (e.g., all parameters of a large neural network) into the prompt [01:09:37].
  4. “Bumpy” Loss Landscapes: For objective functions with very “bumpy” or complex loss landscapes, LLMs might struggle to propose a correct “descending direction,” leading to the optimization getting stuck [01:10:13].
  5. Lack of Error Case Utilization: The LLM optimizers do not effectively utilize information from specific error cases in the training set to infer promising directions [01:46:11].
  6. Hyperparameter Sensitivity: The performance of OPRO, like many machine learning methods, is sensitive to hyperparameters such as the number of instructions included in the metaprompt, the order of instructions (ascending order of scores often works best), and the way accuracy scores are presented (e.g., bucketing scores into 20 bins can be more effective than 100 bins or no scores) [01:22:00].

Despite these challenges, the ability of LLMs to perform optimization, especially in natural language contexts like prompt engineering, suggests a new frontier in applying these powerful models to complex problem-solving [01:45:00].