Why CRM Is Vulnerable
Customers ask agents about their data: order status, account balance, policy terms, ticket history. Fabricated answers feel plausible because the model speaks with confidence and matches the structure of correct answers. The downstream cost is churn (customer abandons after one bad interaction), refunds (the agent promised one), brand damage (the screenshot circulates), and regulatory exposure (the company is bound to the AI’s commitment, per Air Canada precedent). CRM hallucinations are 5–10x more expensive per occurrence than generic content hallucinations because they touch the customer relationship directly.
Baseline measurement: Vectara’s Hallucination Leaderboard shows even frontier models (GPT-5, Claude Sonnet 4.5, Gemini 2.5 Pro) hallucinate on 1.5–4% of summarization tasks. That’s the floor for ungrounded use; your production rate without controls is higher.
Grounded Generation First
Never let the model answer from parametric knowledge about customer data. Force retrieval from authoritative sources (Salesforce records, ServiceNow tables, HubSpot objects, the policy KB). If no source exists for the claim, decline or escalate — don’t guess.
# anti-pattern (don't)
answer = model.complete(f"What is the order status for {order_id}?")
# correct pattern
record = sf.query(f"SELECT Status, ShipDate FROM Order WHERE Id='{order_id}'")
if not record:
return escalate("no order record")
answer = model.complete(prompt_with_grounding(record))
Architectural rule: agent prompts include grounded source data inline; the model is instructed to use only that source and refuse otherwise.
Post-Generation Verification
Entity-extract the claims from the model’s output, then verify each against the source data programmatically. “Your order shipped March 3 via FedEx” — extract the date and carrier, cross-check against the Order record. Mismatches block the response or trigger a revision pass.
Tools to compose:
- NLI-based fact-checking: a smaller model (DeBERTa-NLI, Vectara’s HHEM-2.1) judges whether each claim is entailed by the source.
- LLM-as-judge: a separate model call checks the response against retrieved context, producing a faithfulness score 0–1.
- Programmatic verifiers for structured claims: dates, IDs, amounts, statuses. The cheapest and highest-precision check.
Confidence Thresholds
Modern model APIs return logprobs (OpenAI), token-level confidence (Vertex), or response-level confidence proxies. Low confidence routes to human escalation. Not perfect — overconfident hallucinations exist — but low confidence is a strong negative signal worth acting on.
Calibrate the threshold per use case. Customer-facing transactional responses might escalate below 0.85 entailment; internal summarization might tolerate 0.6.
Measurement
Track in production:
- Faithfulness rate: percent of responses fully grounded in source data, measured by sampled human review against an LLM-judge pre-filter.
- Hallucination escape rate: hallucinations that reached the customer (caught by callback, complaint, or audit).
- Decline/escalate rate: how often the agent correctly refuses to answer rather than guessing.
- False refusal rate: how often the agent declines when it should have answered.
Faithfulness above 98%, escape rate below 0.1%, and balanced decline behavior define a usable production agent.
Common Failure Modes
- “RAG fixes hallucinations” claimed without measurement — RAG reduces but doesn’t eliminate; without verification, the residual is still customer-visible.
- LLM-as-judge using the same model that generated the response — collusion; use a different model or different vendor.
- Confidence thresholds set once and never recalibrated as model versions change.
- Treating verification as optional latency overhead — the cost of one bad answer dwarfs the cost of 10,000 verification calls.
What to Do This Week
Sample 50 production agent responses. Manually fact-check each against source data. Whatever your hallucination rate is, that’s your baseline. Build the verification harness from there.