Shuo Qiu’s Notes on AI Agent Evaluation
Benchmark·
Synthetic Eval Datasets: Skip the Framework, Just Prompt
We tested DeepEval's Synthesizer and promptfoo-style templates against plain prompts across 3 LLMs. Neither framework helped — a basic prompt matched or beat both every time. What actually moved the needle was which model you used to generate.
Opinion·
Eval-Driven Agent Development
Most agent teams ship based on vibes. Eval-driven development — treating evaluations as the inner loop of agent engineering — is the single highest-leverage practice for building reliable agent systems. This post explains why and outlines the practices that make it work.