Pre-Prod
A staged environment with realistic data — production data masked or synthesized for compliance. Faker-generated names, hashed identifiers, and PII redacted to satisfy GDPR Article 32 and your own DLP policy. Run the full evaluation suite: golden-set regression (200–500 known-good cases with reference outputs), adversarial probes (Garak, PyRIT, and your hand-curated jailbreak corpus), latency and cost measurement at p50/p95/p99, and a scenario battery covering the long tail.
Pass criteria, written before the run: golden-set accuracy >= baseline minus 1 point, no new high-severity safety failures, p95 latency within budget, cost per resolution within 110% of forecast. Only after pre-prod hits all four do we proceed.
Limited Prod (Canary)
1–5% of live traffic to the new agent, routed by a feature flag (LaunchDarkly, Unleash, or Statsig) keyed on user ID hash so the same user gets a stable experience. Monitor in real time: error rate, latency, business outcome (resolved vs. escalated vs. abandoned), explicit user feedback (thumbs, CSAT), and cost. Compare side by side to the baseline cohort on the prior agent or human-only path.
Soak for 1–2 weeks. Resist the urge to expand early — week-one numbers are noisy. Watch for slow-burn issues: tone drift on long conversations, accuracy degradation on edge personas, escalation handoff quality. Only after the canary cohort meets pre-defined exit criteria (statistically significant non-inferiority on three of four metrics) do we scale.
Scaled Prod
Gradual expansion: 5% → 25% → 50% → 100%, with at least 48 hours between steps and tighter alerting thresholds as traffic grows. At 100% scale, a 1% error rate is 100x the absolute volume it was at canary. Auto-rollback hooks are mandatory — error-rate breach, cost overrun, or safety incident should pull the flag without a human in the loop. Manual rollback path also stays one click away. A broken scaled-prod deployment shouldn’t require a Sev-1 incident bridge.
Common Failure Modes
- Skipping pre-prod because “we tested it on staging.” Staging is not pre-prod unless it has production-shape data, production-scale traffic replay, and the eval suite gates promotion.
- Canary cohort unrepresentative — routing only US-English users to canary, then pushing to a global audience and discovering the agent fails in Spanish.
- No rollback plan past 50%. Once you’re at 100% on 10K conversations/hour, ripping the agent out is itself a customer event.
- Treating model version updates as zero-risk. Anthropic and OpenAI ship model updates with the same name (
claude-sonnet-4.7,gpt-5) — silent behavior changes happen. Pin model versions, treat any change as a new deployment.
Post-Deployment
Continuous monitoring, not deploy-and-forget. Agents drift. Model versions change. User behavior shifts seasonally and with marketing campaigns. Knowledge bases age. The regression suite runs nightly; outcome tracking runs continuously; weekly review of the worst 1% of conversations stays on the calendar.
Plan for re-evaluation every 90 days at minimum. Treat the agent as a living production system with the same operational discipline as a payments service — on-call rotation, runbooks, post-incident reviews, and an SLO you’d be willing to publish.
What to Do This Week
If you don’t have a written canary exit-criteria document for your last agent deployment, write one now. Then audit whether your current production agents would pass it.