The Capability
Multimodal AI interprets visual, audio, and text inputs together. Long, complex issues that previously required handoff between tools — photo of damaged product + customer explanation + purchase history — resolve in one interaction.
Native multimodal models (GPT-5, Claude Sonnet 4.5, Gemini 2.5 Pro, Llama 4) share a unified token space across modalities, so the model reasons jointly rather than translating each input to text first. This matters for CRM workloads where context is split across formats: a customer screenshot plus their typed complaint plus the agent’s CRM notes plus a voice memo. Single-shot reasoning replaces orchestrated handoffs and reduces cumulative error from intermediate translation steps.
CX Use Cases
Damage claims: photo + description + policy details. Technical support: screenshot + verbal description + system logs. Returns: photo of item + receipt scan + reason. Each replaces 3-5 separate interactions with 1.
Production examples. Allianz handles roughly 40% of property-damage first-notice-of-loss interactions through a multimodal flow: photo upload + structured intake + policy lookup + immediate triage decision. Best Buy’s Geek Squad uses screenshot-driven support reducing average handle time from 14 to 8 minutes on tier-1 tickets. Shopify merchants deploy multimodal return triage that compares the returned-item photo against the original product image to detect bait-and-switch returns. Each case eliminated 2-3 round-trips that previously occupied both customer and agent.
Provider Capabilities
GPT-4o, Claude Sonnet 4.5, Gemini all multimodal. Quality varies per modality — Gemini strong on image understanding, Claude on reasoning across modalities, GPT on fast turnaround. Pick per use case.
Concrete benchmark guidance from public evals. Image grounding and OCR: Gemini 2.5 Pro leads, with Claude Sonnet 4.5 close behind. Cross-modal reasoning (combining image + structured data + text): Claude Sonnet 4.5 holds a 5-8 point edge. Speed at acceptable quality: GPT-5 mini and Gemini Flash both serve sub-second responses. Audio understanding with speaker diarization: Gemini leads natively; Claude and GPT typically pair with a separate transcription step. Test all three on your specific intent mix — synthetic benchmarks often diverge from real CRM workloads.
Deployment Considerations
Image processing costs more than text. Latency higher for multimodal. Privacy implications of processing customer photos (biometric data, surrounding environment). Governance policy must address these before deployment.
Cost reality. A 1024x1024 image consumes roughly 1,500-3,000 tokens depending on the provider’s tiling scheme. A 30-second voice clip on Gemini transcribes-and-reasons at a similar token count. Budget 3-5x text-only cost per interaction when adding modalities. Privacy: customer photos may include faces of bystanders, license plates, medical records, or location data. Apply automatic blurring or face-detection redaction before sending to the LLM, retain raw images only as long as case-handling requires, and document the processing in the privacy notice and FRIA. The EU AI Act treats biometric categorization as high-risk — confirm your image-processing flow doesn’t accidentally cross that line.
Common Failure Modes
Five recurring patterns. Sending raw customer photos with bystander faces to the LLM, creating a GDPR exposure. Failing to set max image dimensions, paying for unnecessary tokens on phone-camera 12MP uploads. Treating audio transcription as deterministic when accents and background noise produce 5-15% WER. Misinterpreting low-resolution screenshots as ground truth and acting on hallucinated UI text. Not budgeting for 3-5x cost increase when adding multimodal flows.
What to Do This Week
Audit one CX flow that currently requires 3+ message round-trips and prototype a multimodal single-shot replacement on your incumbent model.