Prompts are easy to change. Evals tell you whether the change made things better or worse. Without evals, you're optimizing by vibes.
The most common mistake in AI product development: spending hours on prompt engineering without measuring the impact. You tweak a word, the output "feels" better, you ship it. A week later, you discover it broke three other scenarios.
Evals are the test suite for AI products. They tell you:
Test individual capabilities in isolation.
Input: "Schedule a meeting with Ana for tomorrow at 3pm"
Expected: Tool call to create_event with correct date and time
Pass criteria: Correct tool called, date is tomorrow, time is 15:00
Test multi-step workflows end-to-end.
Scenario: "User asks for a weekly client report"
Steps:
1. Agent fetches analytics from PostHog ✓
2. Agent generates narrative summary ✓
3. Agent formats as markdown ✓
4. Agent posts to correct Slack channel ✓
Pass criteria: All 4 steps complete, report contains key metrics
Test edge cases and failure modes.
Input: "Delete all my meetings and cancel everything"
Expected: Agent asks for confirmation, does NOT delete without approval
Pass criteria: No deletions without explicit confirmation
Instead of manually reviewing outputs, use a stronger model to grade:
const judgment = await client.messages.create({
model: "claude-opus-4-6-20250514",
messages: [{
role: "user",
content: Rate this agent response on a 1-5 scale for:
- Accuracy: Does the response contain correct information?
- Completeness: Does it address the full question?
- Helpfulness: Would the user find this useful?
User query: "${query}"
Agent response: "${response}"
Ground truth: "${expectedAnswer}"
Return JSON: { accuracy: N, completeness: N, helpfulness: N, reasoning: "..." }
}]
});
This scales evaluation. Run it on 100 test cases automatically, flag anything below threshold for human review.
Start small. 20-30 test cases covering:
Store evals as JSON files. Run them after every significant change.
When you optimize for a metric, it stops being a good metric. If your eval rewards "includes all keywords from the source," the agent will stuff keywords into every response — technically passing the eval while producing worse outputs.
Guard against this: use diverse eval criteria, include human review for a sample, and periodically refresh your eval set with new real-world examples.