Shuo Qiu’s Notes

Benchmark·

Don’t Shop for Evaluators. Let Your Coding Agent Build One.

Don't shop for pre-built LLM judges. Have a coding agent read your real task material (code, docs, traces) and write the judge. It's faster than shopping, and on tau-bench telecom agreement with ground truth more than doubled.

Opinion·

Your Agent and Harness Aren't the Asset, Your Eval Is

The durable asset in agent development is not the prompt or the harness. It is the eval: the specification of what good looks like, where agents fail, and what customers actually need.