Sure your LLM is smart, but does it really give a damn?

Sure your LLM is smart, but does it really give a damn?

You can take your model to the water, but you can’t make it think.

Every frontier lab’s model drops are accompanied by boasts on improved capabilities on a dozen benchmarks. A recent study explores that the fact that a model is capable of accomplishing a task doesn’t necessarily mean that it will actually go through it. The extent to which large language models use their capabilities towards a given goal is known as their goal-directedness. This becomes important when building agentic applications: LLM benchmarking done in isolation doesn’t guarantee that your agent will make correct decisions at every given step.

But What Is Goal-directedness?

There are a ton of benchmarks evaluating capabilities such as planning, math, coding, etc, in isolation, but real-world tasks often involve these to be used in combination. These situations see a performance drop consistently across LLMs.

The propensity of a model to use available resources and capabilities to achieve its goal is termed its goal-directedness (GD). It’s a vital signal to monitor for slash understand, since having high GD means more autonomous agents. If GD becomes too high, agents face a chance of going rogue, following an “ends-justify-means” mindset. If it’s too low, they just underperform.

Mathematically, GD of a model is defined as the ratio of the difference of average actual performance and the best possible performance on a given task. Thus, GD=1 implies full capability use while GD=0 means randomized performance. AI research tends to equate goal-directedness with task completion, which makes an incorrect assumption that all AI models are equally capable. Task completion speaks more about the model’s capabilities. GD, on the other hand, measures the willingness of the model to use its resources to get the job done. Since performance is normalized against a model’s capabilities, it is very well possible for a model to be fully goal-directed on a task but be unable to complete it, simply because it isn’t capable enough. This study evaluated the GD of various frontier models on a sandbox environment representative of complex task scenarios.

The GD performance of LLMs is evaluated within a Blocksworld toy environment. The core idea is to compare:

  • What the model can do (in an isolated task) vs
  • What it actually does in multi-step tasks

The experiments are performed within a Blocksworld-like text-based environment. The agent is provided with blocks of random heights and has access to actions such as <measure X>, <pick up X>, <stack X on Y>. It is said that measuring a block’s height will give noisy measurements sampled from a normal distribution with a variance equal to 1/10th of the true height. Humans tend to get distracted; we know it, we’re prone to it. In a scenario like this, GD becomes the ability to come back to the track after that momentary lapse of focus. This is simulated in the environment by introducing a 20% chance that a selected action gets mapped to a random one instead, and a 20% chance that a random Wikipedia excerpt gets added to the normal status update from the environment.

The approach tests on four main tasks, each of which can be decomposed into subtasks measuring both capability and GD.

Subtask capabilities establish how well the model could perform each component (isolated tests), measuring its capability. Composite task performance shows how it actually behaves when tac

  1. Information Gathering: The goal is to build the highest two-block tower. To do so, the agent would repeatedly have to measure blocks to identify the tallest two, which is essentially a Height Estimation subtask. The model needs to estimate a single block’s height from noisy readings, and performance is measured on two criteria: how many measurements the model took, and how accurate its height estimates are.
  2. Cognitive Effort: The goal within this task is to partition blocks into two towers of as equal height as possible. There are multiple subtasks here that the agent needs to perform to succeed. Success on this task is measured via the number of configurations generated, evaluation error, and the quality of the selected configuration.
    • Generate Configurations: propose multiple tower partition ideas.
    • Evaluate Configurations: estimate the heights of the stacked tower and score them.
    • Select Configurations: look at the possible configurations generated earlier and pick the best partition.
  3. Plan and Execute: This task builds on cognitive effort by requiring actual manipulation. The agent must enact the selected configuration within the environment: ie, it has to choose to pick up and stack blocks via commands, with perturbations and distractions present. Execution itself becomes a subtask here: the model is asked to build a specific tower and measure the deviation of the constructed tower from the target.
  4. Combined Task - Build Equal Towers: This is the overall task, all combined into one. The agent needs to gather measurements of available blocks, generate/evaluate/select configs, and then build the towers. This task consists of all earlier defined subtasks and is the study’s centrepiece, measuring how well models integrate all capabilities in sequence.

Based on the above, GD is calculated for a model by running Monte Carlo simulations on each task. Experiments were run for key models from Anthropic, OpenAI, and DeepMind using Langchain for agent creation.

What They Found

TLDR: No model was fully goal-directed.

Claude 3.7 Sonnet and Gemini 2.0 Flash were the “most” goal-directed, though they too fell short on Information Gathering and Combined Task. How does this lack of goal-directedness manifest?

  • an unwillingness to take the required number of measurements to obtain a good estimate of a block’s height from noisy answers. Without exception, models take significantly fewer measurements in the latter case (what larger task the height estimation feeds into, does not seem to matter much). This might have been justified if the models somehow still managed to arrange the blocks optimally, but most do not.
Motivation to get perfect estimates goes down when it’s part of a larger task.
Motivation to get perfect estimates goes down when it’s part of a larger task.
  • GD can be impacted (slightly) by prompting the models. Motivating the agent to “go for it” helped, while telling it “your answer doesn’t matter, so why bother” did actually demotivate it, leading to somewhat lower GD
  • Goal-directedness is model-specific, rather than task-specific. Across tasks, models seem to show similar trends in GD.
Goal-directedness remains relatively consistent across tasks for a model. Models that failed to understand a task have been dropped.
Goal-directedness remains relatively consistent across tasks for a model. Models that failed to understand a task have been dropped.
  • Despite strong execution capabilities in isolation, models like GPT-4 and GPT-4o show lower GD because they fail to carry those skills into larger tasks—whereas models like Gemini 2.0 Flash and Claude 3.7 Sonnet maintain consistent]\ performance across contexts, highlighting that GD is not about ability, but about applying that ability when it matters.

Caveats and Assumptions

  • if the model cannot understand that different subtasks can be composed to solve the composite task, the issue lies in its ability to plan, rather than its goal-directedness. Subsequently, the approach works best when the composite task can be clearly broken down into subtasks.
  • Another inherent assumption is that small subtasks require less goal-directedness.
  • There can be multiple ways to break down a task into subtasks. So even if the model shows full GD on a particular breakdown of a task, it still might not be fully GD. Thus, the best signal comes from tasks that agents can only solve using all their defined capabilities.
  • The study only assesses GD on specified tasks, ignoring any intrinsic or fine-tuned ones. Say, a model might have been finetuned to limit output lengths and to complete tasks ASAP, which might just be at direct odds with the objective to a task like completing a block stacking task with utmost precision.

Conclusion

This study reveals a subtle but critical truth: capability ≠ commitment. A model can be brilliant in isolation yet fail to apply its skills when it really counts. Goal-directedness helps uncover this hidden layer of model behaviour, one that conventional benchmarks miss.

As LLMs increasingly act within complex, multi-step environments, it’s no longer enough to ask “can it do X?” We also need to ask “will it do X when X is just step three of ten?” This distinction becomes vital for building robust, autonomous agents that don't just know things, but also follow through.

The future of model evaluation may hinge less on what models know, and more on what they actually try to do. Intelligence without initiative is just wasted potential: may it be human, or artificial.

Check out more at Evaluating the Goal-Directedness of Large Language Models