Eval-Driven Agent Development

The problem with vibes-based development

Here's how most teams build agents today: a developer changes a prompt, runs a few examples by hand, eyeballs the output, and ships it. When something breaks in production, they add another patch to the prompt and repeat the cycle.

This works until it doesn't. And it stops working fast — usually the moment you have more than one person touching the agent, more than one task type to support, or more than one model version to evaluate.

Agent changes are non-local. A prompt update that improves one behavior can silently break another — tool use, state transitions, or task completion. Because LLM behavior is context-sensitive and partly opaque, isolated unit tests cannot validate an agent system the way they validate traditional software.

The answer is familiar from ML: develop evaluation-first. Each capability is defined by an eval and a bar. Progress is measured by test cases and metrics. The north star is an evaluation that proxies real-world impact — user satisfaction, retention, revenue. That evaluation guides development.

Why evals are the highest-leverage investment

They give you a loss function

Without evals, you're optimizing against intuition. With evals, you have a number that goes up or down. This changes agent development from an art into an engineering discipline. You can now ask: "Did this change actually help?" and get an answer that isn't "it seemed better when I tried it three times."

They compound over time

Every bug you catch, every edge case you discover, every failure mode you encounter — these become test cases. Your eval suite is institutional memory. Six months from now, when a new team member changes the retrieval strategy, the eval suite catches the regression you forgot was possible.

They enable parallelism

When two developers are working on different aspects of the agent, evals let them merge confidently. Without evals, concurrent agent development is a coordination nightmare where every change might interact with every other change in unpredictable ways.

They make model upgrades tractable

New model version drops. Is it better for your use case? Without evals, you run a handful of examples and guess. With evals, you swap the model config, run the suite, and get a clear comparison across your full task distribution in minutes.

Best practices

1. Start small, start now

You don't need a perfect eval suite to start. Twenty hand-labeled input-output pairs covering your main use cases is enough. The first eval you write will teach you more about your agent's failure modes than a week of manual testing.

# This is enough to start
evals = [
    {"input": "Book a flight to NYC next Tuesday", "expected": "calls search_flights with correct date"},
    {"input": "What meetings do I have tomorrow?", "expected": "calls get_calendar with tomorrow's date"},
    # ... 18 more
]

Don't let perfect be the enemy of running. A flawed eval that runs on every commit beats a perfect eval that lives in a planning doc.

2. Eval on every change

Evals that run quarterly are audits. Evals that run on every PR are development tools. The value of evals scales with how often they run, because the faster you get feedback, the faster you can iterate.

Set up your CI pipeline to run evals on every pull request. If cost is a concern, split your suite into a fast "smoke test" set (runs in under a minute, catches obvious regressions) and a full set (runs nightly or on-demand).

3. Version evals with code

Your eval suite should live in the same repository as your agent code and change in the same PRs. When you add a new capability, you add evals for it in the same commit. When you change behavior intentionally, you update the expected outputs.

This sounds obvious but most teams keep evals in a separate spreadsheet or notebook, disconnected from the development workflow. That's where evals go to die.

4. Test the trajectory, not just the output

Agent evals that only check the final answer miss most failure modes. An agent can get the right answer through a broken process — calling unnecessary tools, retrying in loops, hallucinating intermediate steps that happen to cancel out.

Eval the full trajectory: which tools were called, in what order, with what arguments, and how many steps it took.

def eval_trajectory(result):
    assert result.final_answer == expected_answer
    assert result.tool_calls[0].name == "search_flights"  # right tool first
    assert result.num_steps <= 4                           # didn't loop
    assert "hallucinated_tool" not in [t.name for t in result.tool_calls]

5. Use multiple eval methods at the right layer

Not everything needs LLM-as-judge. Layer your eval approach:

Deterministic checks for structured outputs: Did the agent call the right tool? Did it return valid JSON? Did it stay within the token budget?
Fuzzy matching for semi-structured outputs: Is the answer semantically equivalent to the reference? Does it contain the required information?
LLM-as-judge for open-ended quality: Is the response helpful? Is the reasoning sound? This is your most expensive and noisiest signal — use it only where cheaper methods can't work.

The cheapest, fastest, most reliable eval method that can detect the failure mode you care about is the right one.

6. Track regressions, not just averages

A mean score going up can hide individual cases getting worse. Track per-case results over time. When your overall score improves from 82% to 85% but three specific cases flip from pass to fail, investigate those three cases. They're often early warnings of a systematic issue.

The eval-driven workflow

Put it all together and the development loop looks like this:

Observe a problem — from production logs, user feedback, or your own testing
Write an eval case that captures the problem before you fix it
Make the change — prompt edit, tool update, model swap, whatever
Run the suite — confirm the fix works without regressions
Ship with confidence — the eval suite is your proof that the change is net-positive

This is just test-driven development applied to a stochastic system. The principles aren't new. What's new is that most agent teams haven't internalized them yet, because the field is young and moving fast and "just ship it" feels productive until it isn't.

The teams that invest in evals early will compound that advantage. The teams that don't will keep firefighting the same classes of bugs, wondering why their agent feels fragile despite constant effort.

Start with 20 examples. Run them on every change. Go from there.

Open Questions

What's the right ratio of deterministic to LLM-judged evals for a typical agent system? Our experience suggests heavy deterministic, light LLM-judged, but this likely varies by domain.
How should eval suites handle non-deterministic agent behavior? Running the same eval multiple times and requiring a pass rate threshold works but is expensive.
When does it make sense to generate eval cases synthetically vs. curating them by hand? Synthetic generation scales, but hand-curated cases tend to cover the failure modes that actually matter.

References

Brockman et al. "Practices for Governing Agentic AI Systems." OpenAI 2023.
Anthropic. "Evaluating AI Systems: A Developer's Guide." 2024.
Ribeiro et al. "Beyond Accuracy: Behavioral Testing of NLP Models with CheckList." ACL 2020.
Shankar et al. "Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences." 2024.