[object Object]

Traditional RAG is a one-shot lookup with a confident sounding response stapled on top. Agentic RAG is a small reasoning loop that decides what to retrieve, decides if it has enough, and decides when to try a different store. The cost difference is 4–10x. The accuracy difference, on CRM workloads, is often the entire business case.

The decision isn’t binary. The right answer is usually a triage: vanilla for the easy questions, agentic for the hard ones, hybrid+rerank for the middle.

Why this matters now

Two forces converged in early 2026. First, models with native tool use (Claude Opus 4.5, GPT-5, Gemini 2.5) made the agentic loop cheap enough to run on real workloads. Second, CRM customers got tired of confident wrong answers and started measuring grounding accuracy directly. Agentic RAG is the architectural answer to both.

Vendors are responding. Salesforce Agentforce 3 ships agentic retrieval as the default for Atlas Reasoning Engine. Microsoft Copilot Studio added “generative orchestration” mid-2025 — same idea, different name. HubSpot Breeze added multi-step retrieval in Q4 2025. ServiceNow Now Assist has had it since the start. The pattern is everywhere.

What traditional RAG actually is

Vector-embed the query. Top-k search. Stuff results into the prompt. Generate. Done.

This works when the question maps cleanly to one knowledge artifact. “What’s the refund policy” → policy doc. “How do I reset my password” → KB article. The Salesforce Knowledge + Einstein search stack, HubSpot Knowledge Base + Breeze answers, Zendesk Answer Bot all run this pattern in production. It’s cheap, deterministic, and good enough for 60% of tier-1 support.

A historical aside

The “agentic RAG” name became dominant in 2024. The pattern existed earlier — IBM Watson did it in 2011, retrieval-augmented question answering systems in academia did it in 2018, even RAG itself was always supposed to include a retrieval-deciding step. What changed in 2024 was that LLMs got good enough at tool use that the loop became reliable. What changed in 2025 was that the cost dropped low enough that running it in production became affordable.

In 2026, the question isn’t “do we use it” — the question is “where in our retrieval stack does it sit.”

Where it breaks in CRM

CRM questions are rarely single-hop. “Why is this account’s renewal at risk” requires: account record, recent cases, last QBR notes, product usage telemetry, billing history, and the AE’s last email thread. Six stores. Vanilla RAG embeds the question once, hits one index, and returns chunks that look plausible but answer 1/6 of the question.

The agent then either hallucinates the rest or stops short. Both fail in front of the rep.

What agentic RAG adds

A planner — usually the same LLM, sometimes a smaller cheaper one — that:

  1. Decomposes the query into sub-questions.
  2. Picks the right store(s) per sub-question (CRM record API, vector KB, SQL warehouse, ticketing system, email archive).
  3. Issues retrievals in parallel where possible.
  4. Evaluates retrieved evidence against the sub-question — is this enough?
  5. Re-queries with refined terms if confidence is low.
  6. Synthesizes only when coverage is sufficient.

It’s more expensive per query. It’s also the difference between a useful response and a confident wrong one.

Pseudocode for the loop

def agentic_rag(question, stores, max_iters=4):
    plan = planner.decompose(question)           # list of sub-questions
    evidence = {}
    for step in range(max_iters):
        gaps = [sq for sq in plan if not sufficient(evidence.get(sq))]
        if not gaps:
            break
        for sq in gaps:
            store = router.pick(sq, stores)      # crm_record | kb | sql | email
            hits = store.search(sq, top_k=8)
            evidence[sq] = verifier.score(sq, hits)
        plan = planner.refine(plan, evidence)    # may add sub-questions
    if not coverage_ok(plan, evidence):
        return escalate("insufficient evidence")
    return synthesizer.answer(question, evidence)

Four moving parts: planner, router, verifier, synthesizer. Each can be the same model or different models. In production at scale, the router is usually a small finetuned classifier (cheap, sub-50ms) and everything else is the main LLM.

Picking the planner model

The planner can be the same model as the synthesizer. It usually shouldn’t be. Patterns we see:

  • Synthesizer = Claude Sonnet 4.7 / GPT-5; planner = Claude Haiku / GPT-5-mini / a 7B finetune. Cheap, fast, and surprisingly good at decomposition.
  • Same model for both during early prototyping; split later when cost grows.
  • Specialized planner finetune for high-volume systems — best quality, most ops burden.

A 10x cost reduction on the planning step turns the agentic-vs-vanilla cost ratio from 7x to ~3x. That changes which workloads can afford agentic.

Cost: the honest accounting

A vanilla RAG call costs roughly 1 retrieval + 1 generation. Call it $0.003 at GPT-4o-mini-class prices, $0.02 at Claude Sonnet 4.7 prices on a long CRM context.

An agentic RAG call averages 3–5 retrievals + 2–4 model calls (plan, verify, refine, synthesize). 4–10x cost. Latency 2–4x — usually 3–8 seconds vs 1–2.

This is not a tax to swallow on every query. It’s a tax to apply selectively.

Single-hop vs multi-hop in numbers

A worked example from a sales-assist deployment we audited last quarter:

Question typeVanilla accuracyAgentic accuracyCost ratio
Policy lookup94%95%1x vs 5x
Single-record summary89%91%1x vs 4x
Account 360 question47%86%1x vs 7x
Renewal risk38%81%1x vs 9x
Cross-record investigation22%78%1x vs 8x

Read this carefully. Agentic is wasted on the top two rows. It is transformational on the bottom three. Triage decides which row each question lands in.

The routing decision

You don’t need agentic RAG on every question. You need a triage step:

Query classPattern
Simple factual (“what’s the policy?”)Vanilla RAG
Single-record summary (“summarize this case”)Direct record fetch, no retrieval
Multi-source synthesis (“why is this at risk?”)Agentic RAG
Action-taking (“update opportunity, send email”)Agentic + tool use
Compliance-sensitive (“can we offer this discount?”)Agentic RAG with mandatory citation

The triage step is itself a model call — but a tiny one. A 7B finetune on labeled CRM queries does this for sub-cent cost.

A pitfall: confusing agentic RAG with general agentic behavior

Agentic RAG is a retrieval pattern. Agentic execution is everything an agent does — tool calls, writes, decisions. They’re often conflated. The retrieval loop’s reliability is a function of the verifier; the execution loop’s reliability is a function of policy gating, audit, and tool scopes. Build them separately. Each has its own evaluation suite, its own failure modes, its own observability needs.

Mixing them in one codebase without that separation is how you end up with a system that retrieves perfectly and writes garbage, or vice versa.

When agentic earns its keep in CRM

  • Account 360 questions. Five+ stores, the answer is the join.
  • Renewal risk scoring. Usage data, support history, contract terms, sentiment all matter.
  • Cross-record investigations. “Has this customer ever asked about X before?” across cases, emails, calls.
  • Compliance-bound responses. Where you must cite source, agentic verifies before answering.
  • Multi-step workflows. Where the answer triggers a tool call that needs to be right.

For comparison, see how agentic patterns map to broader workflows — and the verification step is the one that prevents production hallucinations.

Pre-computation: the under-used optimization

For predictable high-value queries — “renewal risk for top 200 accounts”, “account health for tier-1 customers” — don’t run agentic RAG at query time. Run it nightly, cache the results, serve from cache during the day. Cost drops 50x. Latency drops to under a second. Accuracy stays the same.

This works for any query that doesn’t need real-time freshness. Most CRM “intelligence” questions don’t. Worth auditing your top 50 queries and asking which are real-time and which are not.

Where vanilla still wins

  • Tier-1 deflection on a flat KB.
  • Single-domain chatbots (“ask my docs”).
  • Anything under 200ms latency budget.
  • High-volume / low-stakes use (FAQ surfacing inside a UI).

If your accuracy ceiling on vanilla is already 92% and your customers don’t punish the other 8% — don’t move to agentic. The cost differential is real and the marginal accuracy may not be worth it.

Reranking changes the math

One pattern that bridges the gap: add a reranker after vanilla retrieval. Hybrid (BM25 + vector + cross-encoder rerank) closes 30–50% of the accuracy gap to agentic on multi-hop questions, at 1.5x cost rather than 7x. Cohere Rerank v3, Voyage rerankers, BGE-reranker-v2-M3 are all production-ready.

Reranking does not replace agentic for cross-record investigations, but it makes a strong middle tier. Many teams should be on hybrid+rerank before they consider full agentic.

The verification step is the moat

Most agentic RAG implementations skip the verifier and just chain retrievals. That’s not agentic — that’s iterative RAG with extra steps. The verifier is what makes it work:

# verifier prompt scaffold
inputs:
  sub_question: "What's the customer's current usage trend?"
  retrieved_evidence:
    - source: telemetry.usage_monthly
      content: "Jan: 12,400 / Feb: 11,200 / Mar: 9,800"
  acceptance_criteria:
    - must_have_timeframe: true
    - must_have_metric: true
    - max_age_days: 60
output:
  sufficient: true | false | partial
  missing: ["..."]
  confidence: 0.0-1.0

Without this layer the agent keeps retrieving until it runs out of budget. With it, the agent stops when coverage is real.

Implementation gotchas

  • Don’t let the planner re-decompose forever. Cap at 2 refinements. Hallucinated sub-questions are a real failure mode.
  • Cache aggressively. Sub-question → evidence pairs cache well; the planner output less so.
  • Log everything as a trace. OpenTelemetry GenAI spans per retrieval, per verification, per refinement. You will need this when accuracy regresses.
  • Eval the pipeline, not the model. LangSmith, Promptfoo, OpenAI Evals all support multi-step eval suites. Evaluate end-to-end on real CRM tasks, not on MMLU.
  • Set a hard token budget per query. $0.50 cap on agentic queries prevents runaway loops.

Vendor implementations to know

  • Salesforce Agentforce Atlas Reasoning Engine. Implements an agentic retrieval loop over Data Cloud + Knowledge. Sub-question decomposition + grounding evaluation are built-in but not deeply tunable.
  • Microsoft Copilot Studio Generative Orchestration. Agentic retrieval across Graph, Dataverse, and connected tools. Trade-off: heavy on plugin discovery, lighter on verification.
  • HubSpot Breeze. Native agentic patterns over Smart CRM. Smaller surface, fewer knobs, ships with sane defaults.
  • ServiceNow Now Assist Knowledge Graph. Built around explicit entity relationships; agentic retrieval traverses the graph rather than re-querying a vector store.
  • LangGraph / CrewAI / AutoGen. DIY agentic RAG, full control, more rope. Pair with LangSmith for eval.

The vendor implementations save you ~60% of the build effort. They cost you tuning depth and portability.

Failure modes you’ll actually see

  • Over-decomposition. The planner breaks “what’s the renewal risk” into 12 sub-questions, each cheaply answered, then synthesizes mush. Fix: cap decomposition to 4 sub-questions; force the planner to justify each.
  • Verifier confirmation bias. The verifier model is the same model that generated the sub-question — it tends to say “yes, this is sufficient” when it isn’t. Fix: use a different model for verification, or use deterministic rules where possible.
  • Citation drift. The synthesizer cites evidence that doesn’t say what the synthesis claims. Fix: post-hoc citation check, refuse to ship without it.
  • Store-routing miss. The router sends a question to the wrong store, gets nothing, the planner gives up. Fix: fallback chain across stores, log routing decisions, retrain the router monthly.

Latency budget — where the seconds go

A typical agentic RAG call in production (median):

StepLatency
Triage / decompose600ms
Retrieve (parallel, 3 stores)400ms
Verify500ms
Refine + retrieve (if needed)900ms
Synthesize1500ms
Citation check300ms
Total p50~4.2s

p95 runs 8–12s. If you need sub-2s, agentic RAG is the wrong tool; cache aggressively or pre-compute.

The pattern that works

  • Triage every query first — most don’t need agentic.
  • Reserve agentic RAG for multi-source CRM questions where the join is the answer.
  • The verifier is the load-bearing component — without it, you have expensive vanilla RAG.
  • Cap iterations and token budget hard. The agent will keep going otherwise.
  • Trace every step with OTel — debugging agentic without traces is malpractice.
[object Object]
Share