About
Casual notes on agent evaluation by Shuo Qiu. Posts include code and datasets where possible. Nothing here is meant to be the final word on anything — just notes from someone who spends a lot of time thinking about evals.
Casual notes on agent evaluation by Shuo Qiu. Posts include code and datasets where possible. Nothing here is meant to be the final word on anything — just notes from someone who spends a lot of time thinking about evals.