Testing AI Products

AI products are non-deterministic. The same input can produce different outputs. This breaks traditional testing assumptions — but doesn't make testing impossible. It requires a different approach.

The AI Testing Pyramid

/\
        /  \      E2E: Full agent workflows
       /    \     (real API calls, expensive)
      /──────\
     /        \   Integration: Tool chains
    /          \  (mock LLM, real tools)
   /────────────\
  /              \ Unit: Individual functions
 /                \ (no LLM calls, fast)
/──────────────────\

Unit Tests (bottom layer)

Test individual functions without any LLM calls:

Does the chunking function split correctly?
Does the tool schema validate properly?
Does the progress tracker save and load correctly?

These are fast, deterministic, and cheap. Run hundreds on every commit.

Integration Tests (middle layer)

Test chains of tool calls with a mocked LLM:

Given a mock tool_use response, does the handler execute correctly?
Does the orchestrator pass the right context between agents?
Does error handling recover properly?

Mock the LLM responses to test the orchestration logic. This isolates your code from model non-determinism.

E2E Tests (top layer)

Test full agent workflows with real API calls:

"Prepare a meeting brief" produces a reasonable output
"Generate a client report" includes actual data and correct formatting

These are expensive (real API calls = real cost) and non-deterministic. Run them sparingly — nightly, before releases, after major changes.

Handling Non-Determinism

The same prompt can produce different outputs. Strategies:

Assertion on structure, not content: Check that the output is valid JSON, has required fields, calls the right tools — not that the text matches exactly.

Fuzzy matching: Check that key concepts appear, not exact wording. "The report mentions revenue" rather than "The report says revenue was $X."

Statistical testing: Run the same test 5 times. If it passes 4/5 times, the test passes. This handles random variation while catching systematic failures.

Snapshot testing: Record a "known good" output, then flag when the new output deviates significantly. Not for exact matching — for detecting unexpected changes.

Cost-Conscious Testing

Every real API call costs money. Budget your testing:

Unit tests: free (no API calls)
Integration tests: nearly free (mock LLM)
E2E tests: allocate a monthly budget

For development, use cheaper models (Haiku) for initial testing, then validate with the production model (Sonnet/Opus) before shipping.

What to Test in Your Agent

For any agent or MCP server you build, test:

Tool selection: Given a query, does the agent pick the right tool?
Parameter extraction: Are tool parameters correct?
Error recovery: When a tool fails, does the agent handle it?
Output quality: Is the final response useful and accurate?
Safety: Does the agent refuse unsafe actions?

❓ Quiz 1

Why should you mock the LLM for integration tests?

The main reason is determinism. Mocking the LLM lets you test YOUR code — the orchestration, error handling, state management — without model randomness causing flaky tests. Cost savings are a bonus.

Answer to continue ↓

Review

Time to consolidate what you learned.

🎮 Test Pyramid Sorter

Click each test to move it between layers. Place all items, then click Check.

Full report generation with real API

Mock LLM + handler executes tool

Error recovery retries failed tool

Meeting prep produces Obsidian file

Orchestrator passes context between agents

Date parser handles edge cases

Tool schema validates

Chunking function splits correctly

Client workflow: query to Slack

Unit

Integration

E2E

Complete to continue ↓

🛠 Exercise 1

Write a test plan for an agent that generates weekly client reports. Cover all three layers: 2 unit tests (specific functions), 2 integration tests (tool chains with mocked LLM), and 1 E2E test (full workflow). Specify what you mock and what's real in each.

✓ Saved

← Eval Frameworks Monitoring and Observability →

→ advance · ? shortcuts 06.02