Why Chaos Testing
Production AI fails in ways that don’t show up in unit or integration tests. Model endpoints slow under load, retrieval returns empty result sets after an index rebuild, embedding-service certificates expire, cost rate-limits trip silently mid-conversation, prompt-cache hit rates collapse after a system-prompt edit. Chaos engineering — controlled fault injection in a real environment — verifies the system degrades gracefully rather than failing catastrophically when these conditions arrive on their own schedule, which is always at peak traffic.
The 2025 incident pattern was consistent: outages caused by silent dependency failures (Azure OpenAI throttling, Pinecone latency spikes, knowledge-base reindex windows) cascaded into customer-facing errors because no fallback path was actually tested. Game day exercises catch these months before the real outage.
Failure Scenarios Worth Testing
A starter catalogue:
- Primary LLM unavailable (HTTP 5xx, timeout, rate-limit) — does the agent fail over to a secondary model? Within what latency budget?
- Vector DB slow or empty — does the agent degrade to keyword search, BM25, or a cached answer rather than confidently hallucinating?
- Cost rate-limit hit — graceful throttle with user-facing “checking back in a moment” message, or cascading failure with cryptic 500s?
- System prompt or tool definition rolled back to a previous version mid-traffic — does versioned routing handle the cutover?
- One tool in a chain fails — does the agent recognize and retry, escalate, or proceed with bad data?
- Embedding model upgraded — does the existing index still answer correctly, or does drift cause silent retrieval degradation?
- Knowledge base partially deleted (one source goes offline) — does the agent admit the gap or invent?
- Identity provider slow — does session auth degrade instead of looping?
Implementation
Scheduled chaos during business hours but routed to a small percentage of traffic, or run in a production-shape staging environment with synthetic traffic. Tools: AWS Fault Injection Simulator, Gremlin, Chaos Mesh (Kubernetes), Chaos Monkey for the orchestration layer. For AI-specific faults, custom middleware that intercepts model calls and synthesizes failures (latency injection, error responses, partial responses, contradictory tool outputs).
# example: inject 30% timeout rate into model calls for 10 minutes
chaos = ChaosController(target="model:claude-sonnet-4.7")
chaos.inject(fault="timeout", probability=0.3, duration_min=10)
# observe: fallback rate, escalation rate, customer-visible errors
Observability captures behavior in real time. Post-experiment retro within 48 hours: did the system respond as designed? Where were the gaps? What’s the fix and when does it ship? Without the retro, chaos is just chaos.
Game Day Pattern
Quarterly game day, two-hour cadence. Cross-functional team: agent ops, engineering, on-call, customer support lead, an executive observer. Pre-defined scenario script kept secret from the responders. Run, observe, debrief. Mature programs run game days against the full incident response process — paging, war-room formation, customer comms — not just the technical recovery.
Common Failure Modes
- Running chaos against staging that doesn’t match prod (different model versions, different traffic shapes, different scale).
- “Fault injection” that’s actually just turning off the service — too coarse to find subtle degradations.
- No blast radius limit, so the chaos experiment causes the incident.
- Skipping the retro because everything “worked” — the retro is where learning happens.
- Treating chaos as one-time validation instead of standing practice.
Cultural Shift
Chaos engineering feels scary in low-trust environments. Normalize through small, well-bounded experiments — single non-critical agent in a test environment with explicit kill switch. Build confidence iteratively. Expand scope as the team matures. The first game day is uncomfortable; the tenth is routine and the team is meaningfully more capable in real incidents.
What to Do This Quarter
Pick the dependency you’re most afraid to lose. Fail it on purpose, in a controlled window. Find out what breaks before the universe finds out for you.