Benchmark

Synthetic Eval Datasets: Skip the Framework, Just Prompt

By Shuo Qiu··8 min read

We tested DeepEval's Synthesizer and promptfoo-style templates against plain prompts across 3 LLMs. Neither framework helped — a basic prompt matched or beat both every time. What actually moved the needle was which model you used to generate.

  • 01Neither DeepEval nor promptfoo showed an advantage over a plain prompt. A basic custom prompt scored highest overall.
  • 02The generator model matters ~2x more than the method. Swapping models moved ELO twice as much as swapping methods (233 vs 121 spread).
  • 03LLM judges are biased toward their own model's output. ~40% of the apparent quality gap between models was judge self-preference.

Why synthetic eval datasets?

If you're building an LLM-powered pipeline — RAG, agents, tool-use chains — you need evaluation data to catch regressions and measure quality. Real user queries are ideal but slow to collect, expensive to label, and often unavailable early in development. Synthetic datasets let you generate hundreds of targeted test cases on demand, covering edge cases and failure modes you'd wait months to see organically.

The open question isn't whether to use synthetic data, but how to generate it. Most teams either reach for a framework (DeepEval, promptfoo) or write a quick prompt. We wanted to know if that choice matters.

Method

Human annotation processes typically end with a quality review step — a separate reviewer inspects the data, compares samples, and flags what doesn't hold up. We mimic this process by using LLMs as judges: given two datasets, the judge reads both and picks the better one on each quality criterion. This "pairwise comparison" approach measures the dataset directly — unlike running queries through answer models and scoring responses, which mixes up dataset quality with how good the answer model is.

We crossed 3 generation methods with 3 generator models to produce 9 datasets of 20 queries each. Every pair of datasets (36 pairs) was evaluated by all 3 models as judges — 108 comparisons total. To avoid judges favoring whichever dataset appears first, we swapped the presentation order.

Generation methods

MethodApproach
CustomHand-crafted system prompt targeting practical developer/ops scenarios
DeepEvalDeepEval's Synthesizer class with StylingConfig for structured generation
PromptfooStructured prompt templates targeting clear, directly answerable questions

Generator models

ModelProvider
GPT-5.4Azure OpenAI
Sonnet 4.6Anthropic
DeepSeek V3.2DeepInfra

All three models also served as judges, creating a 3x3 judge matrix that lets us measure self-preference bias.

Judging criteria

Each judge receives the full 20 queries from both datasets and scores them on a 5-point comparative scale — from A>> (strongly prefer A) through tie to B>> (strongly prefer B). The judge scores five dimensions:

  1. Diversity — topic spread, lexical variety, structural variety
  2. Difficulty — challenges the model, exposes failure modes and edge cases
  3. Validity — well-formed, unambiguous, actually testable by an LLM
  4. Realism — reflects real-world scenarios practitioners would encounter
  5. Overall — a holistic judgment of which dataset is the better eval benchmark

The overall score is a direct LLM judgment, not a weighted average of the four criteria. The judge is free to weigh trade-offs as it sees fit — which is exactly what makes the self-preference analysis interesting later. All judges run at temperature=0.0 and return structured JSON with a reasoning field explaining their verdict. The full judge prompt is included at the bottom of this post.

Results: generator model matters more than method

ELO rankings

ELO ratings — borrowed from chess rankings — give each dataset a single score based on its head-to-head wins and losses. Higher is better; 1000 is the starting point.

ELO ratings of 9 synthetic eval datasets with bootstrap 95% confidence intervals, color-coded by generator model
ELO ratings computed using Bradley-Terry (a standard ranking model) with bootstrap confidence intervals. Clear tier separation between GPT-5.4/Sonnet 4.6-generated datasets and DeepSeek V3.2-generated datasets.

Two clear tiers: GPT-5.4/Sonnet 4.6-generated datasets cluster at 1065–1143 ELO, while DeepSeek V3.2-generated datasets fall to 784–920.

The model you use to generate matters more than how you prompt it. The gap between the best and worst generator (233 ELO) is nearly double the gap between the best and worst method (121 ELO). If you're investing in eval dataset quality, upgrade your generator model first.

Per-criterion breakdown

Each dataset appears in 8 matchups (against every other dataset), each judged by all 3 models — 24 judgments total. For each criterion, the judge's verdict maps to a score: strongly prefer (+2), slightly prefer (+1), tie (0), slightly prefer opponent (−1), strongly prefer opponent (−2). The per-criterion score is the mean across all 24 judgments, so +1.5 means judges consistently preferred that dataset on that dimension.

Breaking it down by criterion reveals where each approach actually wins and loses:

  • Custom prompting dominates realism — the dimension most tied to practical usefulness
  • DeepEval's one clear advantage is difficulty — it generates harder, more failure-provoking questions
  • Diversity is a wash — no method or model separates meaningfully
  • DeepSeek V3.2 is the validity bottleneck regardless of which method wraps it
Heatmap of per-criterion scores across 9 eval datasets, showing realism dominated by custom prompting and validity weakest for DeepSeek V3.2
Average pairwise score per criterion (−2 to +2). Green means judges preferred that dataset; rose means they preferred opponents. The Overall column is the judge's direct holistic judgment, not a mean of the four criteria. Rows sorted by ELO.

But can we trust the judges?

Every judge in this experiment also generated one of the datasets it's evaluating. If a model is biased toward its own output, its datasets get inflated scores.

Across all 108 comparisons, judges agree on the winner 73% of the time — but agreement doesn't rule out shared bias. The judge × generator matrix reveals the tilt:

Judge / GeneratorGPT-5.4Sonnet 4.6DeepSeek V3.2
GPT-5.4+0.88 (self)+0.04-0.92
Sonnet 4.6+0.42+0.58 (self)-1.00
DeepSeek V3.2+0.330.00-0.33 (self)

GPT-5.4-as-judge shows the strongest self-preference: it rates its own model's datasets +0.88 on average but rates other models' datasets -0.44 — a gap of 1.31 points on the 5-point scale. Sonnet 4.6 shows a similar but smaller gap of 0.87. DeepSeek V3.2 doesn't show positive self-preference, but that's because its datasets are genuinely weaker — all judges agree on that.

What happens when we remove self-judgments?

If the bias is real, removing self-judgments should change the rankings. It does — but not as much as you'd expect.

ELO ratings before and after removing self-judgments, showing GPT-5.4 and Sonnet 4.6 datasets dropping while DeepSeek V3.2 datasets rise
ELO after removing all comparisons where the judge shared a model with either dataset (63 of 108 removed). Faded bars show original ratings.

GPT-5.4 and Sonnet 4.6 datasets drop 55–64 ELO — their scores were partly inflated by self-preference. DeepSeek V3.2 datasets rise 26–83 ELO — penalized by lacking a friendly self-judge. The ranking order mostly holds, but the spread shrinks from 359 to 213 ELO. Roughly 40% of the apparent quality gap between models was judge bias, not actual quality difference.

Practical takeaways for building eval datasets

  1. Generator model > generation method. If you're building eval datasets, invest in the best generator model you can afford. The prompting strategy matters less — though method choice still shapes which quality dimension you optimize for (DeepEval produces harder questions, custom prompting produces more realistic ones).

  2. LLM judges have measurable self-preference. GPT-5.4-as-judge inflates GPT-5.4-generated content by ~1.3 points on a 5-point scale. Always use cross-model judges or remove self-judgments from analysis.

  3. A coding agent's one-shot prompt matched or outperformed both frameworks. No multi-step synthesis, no evolution loops — just a prompt generated by Claude Code. It scored highest on realism, the criterion most tied to practical utility.

Appendix: all 9 datasets with per-judge scores

Each card below contains the full set of 20 queries and a breakdown of how each judge scored the dataset across all pairwise matchups.

positive (preferred)negative (not preferred)selfjudge evaluated own model
custom_gpt-5.4ELO 1143.2realism+1.21
JudgeOverallDiversityDifficultyValidityRealism
GPT-5.4self
+1.25
+0.62+0.38+1.25+1.00
Sonnet 4.6
+0.88
-0.12+0.62+0.62+1.25
DeepSeek V3.2
+0.38
+0.38+0.00+0.38+1.38

Queries (20)

  1. 01What are the most useful metrics for evaluating a multi-turn LLM agent beyond simple task success rate, and how should I measure things like coherence, memory consistency, and recovery from mistakes across a conversation?
  2. 02How should I design a benchmark to compare human evaluation, rule-based automated checks, and LLM-as-judge scoring for the same agent workflow without introducing obvious bias?
  3. 03What is a good methodology for generating synthetic evaluation datasets for agent systems that still reflect realistic user behavior, tool failures, and ambiguous instructions?
  4. 04How do I evaluate whether a function-calling agent is choosing the right tool, passing the correct arguments, and deciding when not to call a tool at all?
  5. 05When evaluating a RAG-based assistant, which metrics should I use for retrieval quality, faithfulness, and answer relevance, and how do I interpret tradeoffs between them?
  6. 06How can I set up regression tests for an LLM application in CI/CD so that prompt changes, model upgrades, or retrieval tweaks do not silently degrade quality?
  7. 07What are the practical differences between promptfoo, DeepEval, Ragas, Inspect AI, and LangSmith for evaluating agentic workflows, and how should I choose among them for a production team?
  8. 08How should I measure cost and latency for a production multi-turn agent when the workflow includes retries, tool calls, retrieval, and fallback models rather than just a single model response?
  9. 09What is the best way to run pairwise comparison experiments between two agent versions so that the ranking is statistically meaningful and not overly sensitive to prompt wording or judge variance?
  10. 10How can I build adversarial and edge-case test sets for LLM agents that specifically target long-horizon planning errors, prompt injection, tool misuse, and context-window failures?
  11. 11What are good best practices for evaluating safety and alignment in agent systems, including red-teaming, jailbreak resistance, harmful tool invocation, and policy compliance over multiple turns?
  12. 12How do I evaluate a multi-agent system where several specialized agents collaborate, and what metrics capture coordination quality, unnecessary handoffs, and final outcome correctness?
  13. 13When using LLM-as-judge for open-ended outputs, how can I calibrate or validate the judge so that evaluation results correlate with human preferences and remain stable across model updates?
  14. 14What is a sound experimental design for comparing multiple foundation models inside the same agent architecture without confounding model quality with prompt tuning, context length, or tool schemas?
  15. 15How should I score partial success in complex agent tasks where the system completes some subgoals correctly but makes a critical mistake late in the interaction?
  16. 16What techniques are available for detecting bias or unfair behavior in conversational agents, especially when harmful patterns only emerge across multiple turns or through tool-mediated decisions?
  17. 17How can I evaluate whether an agent is over-relying on retrieval or tools when a direct answer would be sufficient, and is there a standard efficiency metric for unnecessary actions?
  18. 18What should a robust evaluation suite for a customer-support agent include if I want to measure resolution quality, policy adherence, escalation behavior, user satisfaction, and operational cost together?
  19. 19How do I create reusable evaluation cases for agents that involve nondeterministic outputs, so tests are strict enough to catch regressions but flexible enough to avoid false failures?
  20. 20What ranking and aggregation methods work best when comparing several agent variants across heterogeneous metrics like accuracy, latency, safety, faithfulness, and token cost?

Scores are average pairwise comparison scores across all 8 matchups per judge. Bars scale from -2 to +2.

The "custom" prompt that beat the frameworks

Since the whole point of this post is that a plain prompt works as well as a framework, here's the exact prompt we used — generated by Claude Code in one shot. No multi-step synthesis, no evolution loops, no schema validation.

Generate exactly 20 diverse questions that a developer or researcher would ask about evaluating LLM-based agents and multi-turn AI systems. These are questions about evaluation methodology, metrics, frameworks, and best practices — NOT generic trivia or capability tests. Cover these topic areas: - Evaluation metrics for multi-turn agents (task completion, coherence, tool use accuracy, etc.) - Benchmarking methodologies (human eval vs automated, LLM-as-judge, pairwise comparison) - Dataset generation for agent evaluation (synthetic data, adversarial testing, edge cases) - Framework comparison (promptfoo, deepeval, ragas, inspect-ai, langsmith, etc.) - Safety & alignment evaluation (red-teaming, guardrail testing, bias detection) - Cost & latency measurement for production agent deployments - Evaluating tool-use and function-calling agents - Evaluating RAG pipelines (retrieval quality, faithfulness, answer relevance) - Regression testing and CI/CD for LLM applications - Multi-model comparison and ranking strategies Requirements: - Mix of difficulty levels (beginner, intermediate, advanced) - Each question should be self-contained - Questions should sound natural, like real developer queries - Cover both practical "how-to" questions and deeper methodological questions Return ONLY a JSON array of strings, no other text.

That's it. This prompt, fed to GPT-5.4, produced the highest-ranked dataset in the experiment. The same prompt fed to Sonnet 4.6 produced the second-highest. The model did the work — the prompt just pointed it in the right direction.

The DeepEval Synthesizer setup

DeepEval's Synthesizer generates test cases from a StylingConfig that describes the scenario, task, and expected I/O format. Under the hood it runs multi-step synthesis with an evolution loop to diversify the outputs.

from deepeval.synthesizer import Synthesizer from deepeval.synthesizer.config import StylingConfig STYLING = StylingConfig( scenario="A developer or researcher asking questions about how to evaluate " "LLM-based agents and multi-turn AI systems — covering evaluation metrics, " "benchmarking methodologies, framework comparisons (promptfoo, deepeval, ragas), " "LLM-as-judge approaches, RAG pipeline evaluation, safety/alignment testing, " "tool-use agent evaluation, cost/latency measurement, and CI/CD for LLM apps.", task="Generate diverse questions about LLM agent evaluation methodology, " "metrics, frameworks, and best practices. Cover topics like multi-turn agent " "metrics, synthetic dataset generation, red-teaming, regression testing, " "multi-model ranking, and production monitoring. Mix beginner and advanced levels.", input_format="A natural-language question from a developer or researcher about " "LLM/agent evaluation, sounding like a real query on a forum or Slack channel.", expected_output_format="A detailed, practical answer about LLM agent evaluation " "methodology, with concrete recommendations and trade-offs.", ) synthesizer = Synthesizer(model=llm, styling_config=STYLING) goldens = synthesizer.generate_goldens_from_scratch(num_goldens=20)

The promptfoo-style prompt

The promptfoo method uses a simpler structured prompt that asks for a JSON array directly — similar to how promptfoo's built-in dataset generation works.

Generate exactly 20 diverse questions that a developer or researcher would ask about evaluating LLM-based agents and multi-turn AI systems. These are questions about evaluation methodology, metrics, frameworks, and best practices — NOT generic trivia or capability tests. Return ONLY a JSON array of question strings. No other text. Example: ["Question 1?", "Question 2?", ...]

The judge prompt

Each pairwise comparison uses the following prompt. The judge sees all 20 queries from both datasets, scores 4 criteria plus an overall judgment, and returns structured JSON with reasoning. Position bias is mitigated by deterministically swapping which dataset appears as "A" vs "B" based on a SHA-256 hash of the pair names.

You are an expert evaluator comparing two evaluation query datasets for testing LLM agent systems on evaluation methodology topics. **Dataset A** (20 queries): {dataset_a_queries} **Dataset B** (20 queries): {dataset_b_queries} Compare these two datasets as evaluation benchmarks. Score each criterion from the perspective of: "which dataset would be more effective for evaluating an LLM agent's knowledge of evaluation methodology?" Criteria: 1. **Diversity** — Topic spread, lexical variety, structural variety 2. **Difficulty** — Challenges the model, exposes failure modes and edge cases 3. **Validity** — Well-formed, unambiguous, actually testable by an LLM 4. **Realism** — Reflects real-world scenarios practitioners would encounter Rate each criterion on a 5-point scale: - **A>>** — Dataset A is significantly better - **A>** — Dataset A is slightly better - **tie** — No meaningful difference - **B>** — Dataset B is slightly better - **B>>** — Dataset B is significantly better Return your evaluation as JSON: { "reasoning": "<detailed explanation>", "diversity": "<A>>|A>|tie|B>|B>>", "difficulty": "<A>>|A>|tie|B>|B>>", "validity": "<A>>|A>|tie|B>|B>>", "realism": "<A>>|A>|tie|B>|B>>", "overall": "<A>>|A>|tie|B>|B>>" }

Note that the overall field is a direct holistic judgment by the LLM — it's not computed from the four criteria. This matters: when we analyze self-preference bias, the overall score reveals how judges weigh trade-offs differently, not just how they rate individual dimensions.