0 XP
L1
?
Lessons
Monitoring and Observability
concept ⏱ 15m
3/3

Monitoring and Observability

You shipped your agent. Now what? Without monitoring, you won't know when it breaks, costs spike, or quality degrades. Observability for AI products is different from traditional software.

What to Track

Latency

  • Time per tool call: Is any tool consistently slow?
  • Total task time: How long does the full agent workflow take?
  • Time to first token: For streaming responses, how quickly does the user see output?

Cost

  • Tokens per request: Input + output tokens. Spikes indicate prompt bloat or runaway loops.
  • Cost per task type: "Generate report" costs $X, "Search meetings" costs $Y. Budget accordingly.
  • Daily/weekly totals: Catch unexpected spikes before they become expensive.

Quality

  • Task completion rate: What percentage of agent runs complete successfully?
  • Error rate by tool: Which tools fail most? Network errors, bad inputs, rate limits?
  • User satisfaction: Did the user accept the output, edit it, or redo it from scratch?

Safety

  • Refused actions: How often does the agent correctly refuse unsafe requests?
  • Escalation rate: How often does the agent need human intervention?
  • Out-of-scope requests: How often do users ask for things the agent can't do?

Tracing Agent Execution

Traditional logging shows you lines of code. Agent tracing shows you the reasoning chain:

Trace: generate_weekly_report
├── [0ms] Start: user request received
├── [120ms] Tool: fetch_posthog_analytics → success (3 metrics)
├── [450ms] Tool: search_meetings → success (5 meetings found)
├── [800ms] Thinking: analyzing data patterns...
├── [1200ms] Tool: create_notion_page → success
├── [1500ms] Tool: send_slack_message → failed (channel not found)
├── [1600ms] Recovery: retrying with corrected channel ID
├── [1800ms] Tool: send_slack_message → success
└── [1900ms] Complete: report delivered

Each step includes: timestamp, tool called, input/output summary, success/failure. This lets you debug agent failures by replaying the decision chain.

PostHog for AI Products

You already use PostHog. For AI products, track custom events:

  • agent_task_started — with task_type, model, estimated_complexity
  • agent_tool_called — with tool_name, latency_ms, success
  • agent_task_completed — with total_tokens, total_cost, quality_score
  • agent_error — with error_type, tool_name, recovery_attempted

Build dashboards that answer:

  • What are the most common task types?
  • What's the average cost per task type?
  • Which tools fail most frequently?
  • What's the trend in task completion rate?

Alerting

Set alerts for:

  • Cost spike: Daily cost > 2x average → alert
  • Quality drop: Completion rate < 90% → alert
  • Error rate: Any tool > 10% error rate → alert
  • Latency: Average task time > 2x baseline → alert

The Feedback Loop

The most valuable monitoring signal is user behavior:

  • User accepts agent output → positive signal
  • User edits agent output → partial success, log the delta
  • User redoes from scratch → failure, log the original output for eval

Over time, these signals become your eval dataset. Real-world failures are the best test cases.

❓ Quiz 1
What's the most important monitoring metric for catching runaway agent loops?
Runaway loops generate massive token usage. A spike in tokens per request is the fastest signal that something is wrong — the agent is repeating actions or generating excessive content.
Answer to continue ↓
🛠 Exercise 1
Design the monitoring setup for an agent you'd build for Muno Labs. List: (1) 5 custom events you'd track in PostHog, (2) 3 alert conditions with thresholds, (3) The key dashboard you'd build. Be specific to your actual use case.
✓ Saved
advance · ? shortcuts 06.03
Claude — Tutor
select text for context
Ask me anything about this lesson.
I can see your quiz answers and decisions.

💡 Select text in the lesson to use it as context.
CONTEXT