AI products are non-deterministic. The same input can produce different outputs. This breaks traditional testing assumptions — but doesn't make testing impossible. It requires a different approach.
The AI Testing Pyramid
/\
/ \ E2E: Full agent workflows
/ \ (real API calls, expensive)
/──────\
/ \ Integration: Tool chains
/ \ (mock LLM, real tools)
/────────────\
/ \ Unit: Individual functions
/ \ (no LLM calls, fast)
/──────────────────\
Unit Tests (bottom layer)
Test individual functions without any LLM calls:
Does the chunking function split correctly?
Does the tool schema validate properly?
Does the progress tracker save and load correctly?
These are fast, deterministic, and cheap. Run hundreds on every commit.
Integration Tests (middle layer)
Test chains of tool calls with a mocked LLM:
Given a mock tool_use response, does the handler execute correctly?
Does the orchestrator pass the right context between agents?
Does error handling recover properly?
Mock the LLM responses to test the orchestration logic. This isolates your code from model non-determinism.
E2E Tests (top layer)
Test full agent workflows with real API calls:
"Prepare a meeting brief" produces a reasonable output
"Generate a client report" includes actual data and correct formatting
These are expensive (real API calls = real cost) and non-deterministic. Run them sparingly — nightly, before releases, after major changes.
Handling Non-Determinism
The same prompt can produce different outputs. Strategies:
Assertion on structure, not content: Check that the output is valid JSON, has required fields, calls the right tools — not that the text matches exactly.
Fuzzy matching: Check that key concepts appear, not exact wording. "The report mentions revenue" rather than "The report says revenue was $X."
Statistical testing: Run the same test 5 times. If it passes 4/5 times, the test passes. This handles random variation while catching systematic failures.
Snapshot testing: Record a "known good" output, then flag when the new output deviates significantly. Not for exact matching — for detecting unexpected changes.
Cost-Conscious Testing
Every real API call costs money. Budget your testing:
Unit tests: free (no API calls)
Integration tests: nearly free (mock LLM)
E2E tests: allocate a monthly budget
For development, use cheaper models (Haiku) for initial testing, then validate with the production model (Sonnet/Opus) before shipping.
What to Test in Your Agent
For any agent or MCP server you build, test:
Tool selection: Given a query, does the agent pick the right tool?
Parameter extraction: Are tool parameters correct?
Error recovery: When a tool fails, does the agent handle it?
Output quality: Is the final response useful and accurate?
Safety: Does the agent refuse unsafe actions?
❓ Quiz 1
Why should you mock the LLM for integration tests?
The main reason is determinism. Mocking the LLM lets you test YOUR code — the orchestration, error handling, state management — without model randomness causing flaky tests. Cost savings are a bonus.
Answer to continue ↓
Review
Time to consolidate what you learned.
🎮 Test Pyramid Sorter
Click each test to move it between layers. Place all items, then click Check.
Full report generation with real API
Mock LLM + handler executes tool
Error recovery retries failed tool
Meeting prep produces Obsidian file
Orchestrator passes context between agents
Date parser handles edge cases
Tool schema validates
Chunking function splits correctly
Client workflow: query to Slack
Unit
Integration
E2E
Complete to continue ↓
🛠 Exercise 1
Write a test plan for an agent that generates weekly client reports. Cover all three layers: 2 unit tests (specific functions), 2 integration tests (tool chains with mocked LLM), and 1 E2E test (full workflow). Specify what you mock and what's real in each.