Incident Response for AI Failure

[object Object]

Detect

Monitoring catches most incidents before customers report. When customer reports beat monitoring, that’s a monitoring gap — fix it in the retro. Detection latency is a measured outcome.

Effective AI detection layers four signals. Output-quality drift via continuous LLM-judge eval against a frozen reference set, alerting on >5% degradation. Tool-call failure rate per minute, alerting at 3 sigma above baseline. CSAT or thumbs-down spike per intent in 15-minute windows. Cost-per-interaction anomaly catching runaway loops. Wire all four into PagerDuty with severity routing — Sev1 (production-wide), Sev2 (single intent), Sev3 (degraded but functioning).

Contain

Kill switch for agents must exist and be one-click. Fallback to human or simpler system. Don’t debug while customers hit broken agent. Contain first; investigate second.

Build kill switches at three granularities. Per-prompt-version: roll back to previous prompt without redeploy. Per-intent: disable a single broken flow while others run. Per-system: route all traffic to human queue or scripted fallback. Salesforce Agentforce supports per-topic disable in Setup; LangGraph and CrewAI exposed runtime config for the same. Test kill switches monthly — an untested switch is no switch.

Communicate

Status page updates within minutes. Affected users notified appropriately. Leadership and support teams briefed. Silence during incidents erodes trust more than the incident itself.

Templates accelerate response. Template 1 (initial): “We’re investigating reports of [symptom] affecting [scope]. Customers may experience [impact]. Next update in 30 minutes.” Template 2 (mitigation): “We’ve routed affected interactions to human agents while we investigate. Most customers should see normal service.” Template 3 (resolution): “Service restored at HH:MM UTC. Root cause was [X]. Full retro to follow.” Statuspage, Atlassian Statuspage, or instatus all support template libraries.

Learn

Blameless retro within 5 business days. Timeline, root cause, contributing factors, action items with owners. Track action item completion — unfollowed actions mean the learning didn’t stick.

A useful retro structure. Timeline with timestamps from first signal to all-clear. Detection: how was it noticed, what didn’t fire that should have. Containment: how long, what worked, what was clumsy. Communication: how did affected stakeholders find out. Root cause analysis using “five whys” or a fishbone — for AI incidents, common roots are prompt regressions, tool-call schema drift, knowledge-source staleness, model-version changes from the provider. Action items with owners, dates, and a 30-day completion review.

Common Failure Modes

Five recurring patterns. Silent model upgrades from the provider changing behavior overnight. Knowledge base updated without invalidating embeddings. Tool schema changed by an upstream team without notifying agent owners. Cost guardrails missing, agent loops to budget exhaustion. Kill switch lives in code only, requiring a deploy to fire — the incident outlasts the response.

What to Do This Week

Test your one-click agent kill switch in production today and document the exact RTO from decision to traffic stopped.

[object Object]

Detect

Contain

Communicate

Learn

Common Failure Modes

What to Do This Week

Get one CRM read per week.

Next articles to explore →

Agent Deployment: The Phased Rollout Playbook

On-Call for AI Agents: SRE Patterns

Cost-Per-Resolution: The AI Ops KPI to Track

AI Shifts from Assistive to Operational

SRE for AI Agents: The 2026 Playbook

LLMOps for CRM AI in 2026