Latency Budgets
Voice demands sub-second time-to-first-audio. Pauses over 2 seconds feel broken; the typical 500-800ms target sits at the edge of what users perceive as conversational. Build streaming response synthesis so speech starts before the full answer is computed — the LLM emits tokens, the TTS engine speaks them as they arrive, and the user hears the agent thinking in real time. Users tolerate short pauses mid-speech better than long lead-in silence; a one-second silence at turn start kills perceived intelligence in a way a one-second pause mid-sentence does not. Streaming ASR (Deepgram, AssemblyAI, Whisper streaming) on the input side and streaming TTS (ElevenLabs Turbo, OpenAI tts-1, Cartesia Sonic) on the output side are the table-stakes choices.
Voice latency budget — chat answer to "what's my order status?"
ASR partial transcript final ~120ms after user stops
LLM first token ~280ms (Claude 4 Haiku, Groq, fireworks)
TTS first audio frame ~110ms
Network jitter buffer ~80ms
Time to first audible word ~590ms
Prosody Design
Synthesized voices became indistinguishable from human voices for short utterances during 2024-2025; ElevenLabs v3, OpenAI tts-1-hd, and Cartesia Sonic all hit this bar. Match prosody to context — empathetic tone for complaint handling, crisp and brisk for transactional confirmation, slightly slower for older demographics or accessibility-mandated deployments. Provider APIs expose tone, speed, and stability controls; use them rather than defaulting. The 2026 best practice is per-intent voice profiles: a “billing dispute” voice profile differs from an “order status” voice profile in pace and warmth.
Interruption Handling
Users interrupt — they hear the agent start a wrong answer and they cut in. The agent must detect user speech while it is speaking, stop within 200ms, listen, and respond to the new turn. Poor interruption handling feels rude and trains the customer to wait silently for openings, which makes the conversation feel formal. Most voice-native platforms (Sierra, Decagon, Vapi, Retell) handle this well by default; bolt-on TTS over chat agents typically do not. Verify the behavior on your shortlist by deliberately interrupting during the demo.
Error Recovery
When the agent fails to understand, do not loop endlessly. After two failed attempts on the same intent, offer escalation: “Let me connect you to someone who can help.” A third attempt nearly always frustrates the user and burns trust. Escalation must preserve context — the human agent sees the transcript, the resolved customer ID, and the failed intent without the customer repeating themselves. The pattern that works: handoff includes a one-sentence AI summary the human reads in two seconds before joining the call.
Common Failure Modes
The recurring failures: long silent thinking pauses at turn start; robotic monotone TTS that signals “I am a bot” before the agent says anything; missing barge-in support so the customer cannot interrupt a wrong answer; endless retry loops on misunderstood intents; and missing AI disclosure under EU AI Act Article 50 (“This call may be handled by an AI agent.”) which now applies to most EU-bound voice deployments.
What Changed in 2026
Three shifts: TTS quality crossed the bar where most users cannot tell synthetic from human in short utterances; interruption handling became table stakes for voice-native platforms; and the EU AI Act Article 50 disclosure obligation became enforced practice in EU markets, requiring a clear statement at call open.
Implementation Sequence
A defensible build order: pick the voice-native platform (Sierra, Decagon, Vapi, Retell, hyperscaler offering); design per-intent prosody profiles; instrument latency at every stage; build the human-handoff context bridge; run a 100-call eval with internal participants before any external traffic; soft-launch on one skill at one site; expand and tune.
What to do this week
Listen to one recording of a real customer-AI voice interaction. Time the silences, note the interruption attempts, log the misunderstood intents. Three minutes of careful listening reveals the next sprint better than any dashboard.