Monitoring and Observability

You shipped your agent. Now what? Without monitoring, you won't know when it breaks, costs spike, or quality degrades. Observability for AI products is different from traditional software.

What to Track

Latency

Time per tool call: Is any tool consistently slow?
Total task time: How long does the full agent workflow take?
Time to first token: For streaming responses, how quickly does the user see output?

Cost

Tokens per request: Input + output tokens. Spikes indicate prompt bloat or runaway loops.
Cost per task type: "Generate report" costs $X, "Search meetings" costs $Y. Budget accordingly.
Daily/weekly totals: Catch unexpected spikes before they become expensive.

Quality

Task completion rate: What percentage of agent runs complete successfully?
Error rate by tool: Which tools fail most? Network errors, bad inputs, rate limits?
User satisfaction: Did the user accept the output, edit it, or redo it from scratch?

Safety

Refused actions: How often does the agent correctly refuse unsafe requests?
Escalation rate: How often does the agent need human intervention?
Out-of-scope requests: How often do users ask for things the agent can't do?

Tracing Agent Execution

Traditional logging shows you lines of code. Agent tracing shows you the reasoning chain:

Trace: generate_weekly_report
├── [0ms] Start: user request received
├── [120ms] Tool: fetch_posthog_analytics → success (3 metrics)
├── [450ms] Tool: search_meetings → success (5 meetings found)
├── [800ms] Thinking: analyzing data patterns...
├── [1200ms] Tool: create_notion_page → success
├── [1500ms] Tool: send_slack_message → failed (channel not found)
├── [1600ms] Recovery: retrying with corrected channel ID
├── [1800ms] Tool: send_slack_message → success
└── [1900ms] Complete: report delivered

Each step includes: timestamp, tool called, input/output summary, success/failure. This lets you debug agent failures by replaying the decision chain.

PostHog for AI Products

You already use PostHog. For AI products, track custom events:

agent_task_started — with task_type, model, estimated_complexity
agent_tool_called — with tool_name, latency_ms, success
agent_task_completed — with total_tokens, total_cost, quality_score
agent_error — with error_type, tool_name, recovery_attempted

Build dashboards that answer:

What are the most common task types?
What's the average cost per task type?
Which tools fail most frequently?
What's the trend in task completion rate?

Alerting

Set alerts for:

Cost spike: Daily cost > 2x average → alert
Quality drop: Completion rate < 90% → alert
Error rate: Any tool > 10% error rate → alert
Latency: Average task time > 2x baseline → alert

The Feedback Loop

The most valuable monitoring signal is user behavior:

User accepts agent output → positive signal
User edits agent output → partial success, log the delta
User redoes from scratch → failure, log the original output for eval

Over time, these signals become your eval dataset. Real-world failures are the best test cases.

❓ Quiz 1

What's the most important monitoring metric for catching runaway agent loops?

Runaway loops generate massive token usage. A spike in tokens per request is the fastest signal that something is wrong — the agent is repeating actions or generating excessive content.

Answer to continue ↓

🛠 Exercise 1

Design the monitoring setup for an agent you'd build for Muno Labs. List: (1) 5 custom events you'd track in PostHog, (2) 3 alert conditions with thresholds, (3) The key dashboard you'd build. Be specific to your actual use case.

✓ Saved

← Testing AI Products Project Brief →

→ advance · ? shortcuts 06.03