About

Casual notes on agent evaluation by Shuo Qiu. Posts include code and datasets where possible. Nothing here is meant to be the final word on anything — just notes from someone who spends a lot of time thinking about evals.