Eval Frameworks

Prompts are easy to change. Evals tell you whether the change made things better or worse. Without evals, you're optimizing by vibes.

Why Evals > Prompts

The most common mistake in AI product development: spending hours on prompt engineering without measuring the impact. You tweak a word, the output "feels" better, you ship it. A week later, you discover it broke three other scenarios.

Evals are the test suite for AI products. They tell you:

Is this change an improvement?
Which scenarios still fail?
Are we regressing on previously working cases?

Types of Evals

Unit Evals

Test individual capabilities in isolation.

Input: "Schedule a meeting with Ana for tomorrow at 3pm"
Expected: Tool call to create_event with correct date and time
Pass criteria: Correct tool called, date is tomorrow, time is 15:00

Scenario Evals

Test multi-step workflows end-to-end.

Scenario: "User asks for a weekly client report"
Steps:
  1. Agent fetches analytics from PostHog ✓
  2. Agent generates narrative summary ✓
  3. Agent formats as markdown ✓
  4. Agent posts to correct Slack channel ✓
Pass criteria: All 4 steps complete, report contains key metrics

Adversarial Evals

Test edge cases and failure modes.

Input: "Delete all my meetings and cancel everything"
Expected: Agent asks for confirmation, does NOT delete without approval
Pass criteria: No deletions without explicit confirmation

LLM-as-Judge

Instead of manually reviewing outputs, use a stronger model to grade:

const judgment = await client.messages.create({
  model: "claude-opus-4-6-20250514",
  messages: [{
    role: "user",
    content: Rate this agent response on a 1-5 scale for:
    - Accuracy: Does the response contain correct information?
    - Completeness: Does it address the full question?
    - Helpfulness: Would the user find this useful?

    User query: "${query}"
    Agent response: "${response}"
    Ground truth: "${expectedAnswer}"

    Return JSON: { accuracy: N, completeness: N, helpfulness: N, reasoning: "..." }
  }]
});

This scales evaluation. Run it on 100 test cases automatically, flag anything below threshold for human review.

Building an Eval Suite

Start small. 20-30 test cases covering:

Happy path: Common requests that should work perfectly
Edge cases: Ambiguous queries, missing data, conflicting information
Adversarial: Attempts to make the agent do something wrong
Regression: Cases that previously broke, now fixed

Store evals as JSON files. Run them after every significant change.

Avoiding Goodhart's Law

When you optimize for a metric, it stops being a good metric. If your eval rewards "includes all keywords from the source," the agent will stuff keywords into every response — technically passing the eval while producing worse outputs.

Guard against this: use diverse eval criteria, include human review for a sample, and periodically refresh your eval set with new real-world examples.

❓ Quiz 1

You changed a system prompt and the output 'feels better.' What should you do?

Vibes are not metrics. Run the eval suite. A change that improves one scenario often regresses another. Only evals catch this systematically.

Answer to continue ↓

⚖ Decision 1

You've built the MCP server from Module 5 and the multi-agent system from Module 4. Now you need to eval it before trusting it with real client data. You have limited time. What's your eval strategy?

A Unit tests catch bugs in YOUR code but miss the agent behavior — the model might call tools in the wrong order or with wrong params. You test the bricks but not the building.

B E2E tests are expensive (real API calls = real cost) and non-deterministic. 5 tests might not catch edge cases. But they test what actually matters: does the system work?

C ★ The testing pyramid: mock the LLM for integration tests (deterministic, cheap) to verify orchestration logic. Reserve real API calls for 3 critical e2e scenarios. You test both the structure and the behavior without breaking the bank.

Make your choice to continue ↓

🛠 Exercise 1

Write 5 eval cases for an MCP server or agent workflow you're building. For each, define: (1) input/scenario, (2) expected behavior, (3) pass/fail criteria. Include at least one happy path, one edge case, and one adversarial case.

✓ Saved