Detect
Monitoring catches most incidents before customers report. When customer reports beat monitoring, that’s a monitoring gap — fix it in the retro. Detection latency is a measured outcome.
Effective AI detection layers four signals. Output-quality drift via continuous LLM-judge eval against a frozen reference set, alerting on >5% degradation. Tool-call failure rate per minute, alerting at 3 sigma above baseline. CSAT or thumbs-down spike per intent in 15-minute windows. Cost-per-interaction anomaly catching runaway loops. Wire all four into PagerDuty with severity routing — Sev1 (production-wide), Sev2 (single intent), Sev3 (degraded but functioning).
Contain
Kill switch for agents must exist and be one-click. Fallback to human or simpler system. Don’t debug while customers hit broken agent. Contain first; investigate second.
Build kill switches at three granularities. Per-prompt-version: roll back to previous prompt without redeploy. Per-intent: disable a single broken flow while others run. Per-system: route all traffic to human queue or scripted fallback. Salesforce Agentforce supports per-topic disable in Setup; LangGraph and CrewAI exposed runtime config for the same. Test kill switches monthly — an untested switch is no switch.
Communicate
Status page updates within minutes. Affected users notified appropriately. Leadership and support teams briefed. Silence during incidents erodes trust more than the incident itself.
Templates accelerate response. Template 1 (initial): “We’re investigating reports of [symptom] affecting [scope]. Customers may experience [impact]. Next update in 30 minutes.” Template 2 (mitigation): “We’ve routed affected interactions to human agents while we investigate. Most customers should see normal service.” Template 3 (resolution): “Service restored at HH:MM UTC. Root cause was [X]. Full retro to follow.” Statuspage, Atlassian Statuspage, or instatus all support template libraries.
Learn
Blameless retro within 5 business days. Timeline, root cause, contributing factors, action items with owners. Track action item completion — unfollowed actions mean the learning didn’t stick.
A useful retro structure. Timeline with timestamps from first signal to all-clear. Detection: how was it noticed, what didn’t fire that should have. Containment: how long, what worked, what was clumsy. Communication: how did affected stakeholders find out. Root cause analysis using “five whys” or a fishbone — for AI incidents, common roots are prompt regressions, tool-call schema drift, knowledge-source staleness, model-version changes from the provider. Action items with owners, dates, and a 30-day completion review.
Common Failure Modes
Five recurring patterns. Silent model upgrades from the provider changing behavior overnight. Knowledge base updated without invalidating embeddings. Tool schema changed by an upstream team without notifying agent owners. Cost guardrails missing, agent loops to budget exhaustion. Kill switch lives in code only, requiring a deploy to fire — the incident outlasts the response.
What to Do This Week
Test your one-click agent kill switch in production today and document the exact RTO from decision to traffic stopped.