[object Object]

Alerting

Alert on: agent error rate spike, Trust Layer block rate rise, cost anomaly, latency SLA breach, user handoff rate jump. Each signal different problem classes.

Five signal classes with thresholds. Tool-call error rate above 2% over 5-minute window indicates tool, schema, or auth failure. Guardrail block rate above 5% suggests prompt injection campaign or new failure pattern. Cost burn rate exceeding 200% of trailing-7-day baseline catches runaway loops. p95 latency above 8 seconds breaches user-facing SLA on most CRM channels. Human-handoff rate jumping 3+ standard deviations means agent quality regressed. Wire each signal into PagerDuty with severity routing distinct from infrastructure alerts.

On-Call Roles

Primary: responds to alerts within SLA. Secondary: escalation when primary stuck. AI Ops lead: escalation for systematic issues needing architecture change. Clear roles; don’t wing it.

Role definitions. Primary on-call: 15-minute acknowledge SLA, full kill-switch authority, runbook execution, customer-comms drafting. Rotates weekly. Secondary: 30-minute escalation SLA, supports primary on multi-system incidents. AI Ops lead: business hours, owns repeated-incident root cause and architectural fixes. AI product owner: communications stakeholder, brought in for any customer-impacting incident over 30 minutes. Document which role pages whom; automate the escalation chain in PagerDuty schedules.

Playbooks

Per alert type. Cost spike -> identify runaway agent, throttle, investigate. Block rate spike -> audit recent prompt changes, revert if needed. Latency breach -> scale model endpoint, fall back to smaller model if possible.

Eight working playbooks. Cost spike: gateway rate-limit, identify caller via metadata, rollback prompt or kill agent. Hallucination cluster: check knowledge base freshness, verify embeddings, lower temperature, kill if continued. Tool-call failure: confirm upstream tool unchanged, rotate auth tokens, route to fallback. Model regression after vendor upgrade: pin to previous version, file vendor ticket. Latency spike: check vendor status page, switch to fallback model in Portkey, scale Bedrock provisioned throughput. Prompt injection campaign: enable strict guardrails, inspect inputs, block IP range, notify security. PII leak: kill switch, preserve logs, page legal and DPO. Eval regression: rollback prompt to last green, file bug.

Post-Incident

Every incident gets a retro. Root cause beyond “the AI was wrong” — what in our system allowed wrongness to cause user impact. Blameless, concrete action items with owners.

Useful retro structure. Timeline with first signal, escalation, mitigation, all-clear. Customer impact quantified (interactions affected, revenue exposure, regulatory disclosure required). Root cause categorized: prompt regression, knowledge staleness, vendor model change, tool drift, infrastructure, human error. Contributing factors: alert thresholds, runbook gaps, kill-switch effectiveness. Action items with owners and 30-day completion review. Track action-item completion rate as an SRE health metric — under 70% completion within 30 days indicates the on-call function is overloaded.

Common Failure Modes

Five recurring patterns. Alerts routed to a Slack channel nobody owns. Kill switch lives in code requiring a deploy to fire — incident outlasts the response. No single incident commander, leading to conflicting decisions across teams. Vendor model upgrade causing silent regression because pin discipline broke. Action items from the last retro still open when the next incident hits.

What to Do This Week

Run a 30-minute tabletop on a cost-spike scenario with your AI on-call rotation and time the kill-switch fire from page to mitigation.

[object Object]
Share