Shuo Qiu’s Notes on AI Agent Evaluation

Benchmark·

Synthetic Eval Datasets: Skip the Framework, Just Prompt

We tested DeepEval's Synthesizer and promptfoo-style templates against plain prompts across 3 LLMs. Neither framework helped — a basic prompt matched or beat both every time. What actually moved the needle was which model you used to generate.

Opinion·

Eval-Driven Agent Development

Most agent teams ship based on vibes. Eval-driven development — treating evaluations as the inner loop of agent engineering — is the single highest-leverage practice for building reliable agent systems. This post explains why and outlines the practices that make it work.