SLO Definition
What does “working” mean for an agent? The 2026 SRE community has converged on a multi-dimensional SLO set: resolution rate above 70% for triage agents and above 50% for complex-issue agents, p95 latency under 3 seconds for chat (under 800ms for voice), error rate under 2%, cost per resolution under a use-case target ($0.30-1.50 typical), and a quality SLO measured against an offline eval set (pass rate above 85%). Pick four to five SLOs per agent, publish them, and report against them weekly. Salesforce’s Agentforce dashboards expose much of this natively; Datadog, New Relic, Langfuse, Langsmith, and Honeycomb all added agent-specific SLO templates in 2025.
Error Budgets
A 99% availability SLO yields a 1% error budget per month — roughly 7.2 hours. When the budget is healthy, ship aggressively. When the budget is burning, freeze non-critical changes until recovered. The classic Google SRE discipline applies cleanly to agent operations, with the additional twist that quality regressions consume budget the same way uptime issues do — a model upgrade that drops eval pass rate from 88% to 81% burns the quality budget even if uptime is perfect. Wire the error budget burn rate into the deployment system so a fast burn pages the on-call and pauses rollouts.
April error budget snapshot — Triage Agent EMEA
Availability SLO 99.5% Burn 0.18% Healthy
Latency p95 SLO <2.5s p95 2.1s Healthy
Quality SLO >85% pass Current 83% Burning
Cost SLO <$0.40/r Current $0.31 Healthy
Action: freeze prompt edits, investigate quality regression
Chaos Testing
Inject failures to verify resilience. LLM provider unavailable — does the agent fail over to a secondary (Anthropic Claude as backup to OpenAI GPT-5, or vice versa)? Retrieval system slow — does the agent degrade gracefully or hang? Vector DB returns empty results — does the agent acknowledge uncertainty or hallucinate? Tool API rate-limited — does the orchestrator back off correctly? Run these scenarios in a chaos schedule (Gremlin, Litmus, AWS Fault Injection Simulator); do not discover them in production. Build a chaos game day quarterly with the on-call team in the room.
On-Call Rotation
Ops engineers on rotation, with an incident response playbook specific to agent failures. The playbook should answer: how do I see the last 50 conversations? How do I roll back the last prompt change? How do I disable the agent for one segment without disabling it for all? How do I rotate the OpenAI key if it leaked? Tools and access pre-provisioned — the on-call should not be assembling permissions during the incident. Page on SLO burn-rate violations rather than raw alerts to reduce noise; PagerDuty, Opsgenie, and incident.io all support multi-window burn-rate alerts.
Common Failure Modes
The recurring failures: SLOs defined only on uptime, missing quality and cost; error budgets tracked but not enforced (the team ships through the burn); chaos testing skipped because “we test in prod naturally”; on-call rotation without runbook leading to 90-minute response times that should be 10. The most expensive failure: discovering during the incident that nobody knows the kill switch.
What Changed in 2026
Three shifts: quality SLOs joined uptime SLOs as first-class metrics, multi-provider failover became standard architecture (no team wants single-vendor dependency after the 2024 OpenAI outages), and burn-rate alerting matured beyond the original Google playbook to handle agent-specific signals.
Implementation Sequence
A defensible 60-day plan: weeks 1-2, define SLOs with the product owner; weeks 3-4, instrument and dashboard; weeks 5-6, set error budget policies and wire to deploys; weeks 7-8, run the first chaos game day. Then institutionalize as a quarterly rhythm.
What to do this week
Write down the SLOs for one agent on one page and circulate to engineering, product, and the budget owner. If the three groups disagree, that disagreement is the most important SRE conversation to resolve.