Why RAG in Agents
Agents need current, correct data. RAG retrieves grounding before generation. Without it, agents hallucinate — especially on customer data. With good RAG, agents ground responses in authoritative sources.
The 2026 shift: agents call retrieval as a tool rather than receiving pre-fetched context. This lets the agent decide when, what, and how much to retrieve based on the user’s intent. Salesforce’s Agentforce, Microsoft Copilot Studio, and ServiceNow Now Assist all expose retrieval as a first-class agent tool. The trade-off: agentic retrieval adds latency and cost compared to a single pre-fetch, so design retrieval tools with budget caps and rerank within the agent.
Chunking
Chunks too small = missing context. Too large = irrelevant content in prompt. Sweet spot for CRM knowledge: 500-800 tokens with 20% overlap. Preserve heading and section boundaries where possible.
Specific guidance by content type. Knowledge-base articles: 500-800 tokens, split at H2/H3 headers, preserve section context as metadata. Email threads: per-message chunks with thread-summary in metadata. Case histories: per-case-update chunks with case context. Product docs: 600-1000 tokens, never split code blocks or tables. Use semantic chunking libraries (Unstructured.io, LlamaIndex SentenceSplitter) over naive character splits. Re-chunk when the embedding model changes — old chunks against new embeddings produce mediocre retrieval.
from llama_index.core.node_parser import MarkdownNodeParser
parser = MarkdownNodeParser(include_metadata=True, include_prev_next_rel=True)
nodes = parser.get_nodes_from_documents(docs)
Hybrid Search
Pure vector similarity misses exact-match needs. Hybrid = BM25 + vector with reranking wins consistently. Weaviate supports native hybrid; others need layered approaches.
Hybrid wins because CRM users search for exact identifiers (account numbers, error codes, product SKUs) that BM25 catches and vector misses, plus conceptual queries that vector catches and BM25 misses. Implement with Weaviate native, or layer pgvector + Postgres full-text, or Pinecone + Elasticsearch with rank fusion. Rerank top-50 hybrid results with a cross-encoder (Cohere Rerank, Jina Reranker, BGE Reranker) and pass top-5 to the agent. Rerankers add 100-300ms latency but typically improve precision@5 by 15-30 points.
Measurement
Retrieval quality gates generation quality. Build an eval set — queries paired with ideal passages. Recall@5 and MRR track performance. Re-baseline when you change embedding model or chunking strategy.
Build the eval. Sample 200-500 production queries across intent classes. Annotate each with ideal passages (chunks that should retrieve). Compute Recall@5, Recall@10, MRR (mean reciprocal rank), and nDCG. Targets for production CRM RAG: Recall@5 above 0.85, MRR above 0.7. Run nightly via ragas or LlamaIndex evaluation. When a metric drops, the cause is almost always one of: knowledge base updated without re-embedding, chunking changed, embedding model changed, query distribution shifted (new product launched, new policy active).
Common Failure Modes
Five recurring patterns. Embedding the entire knowledge base once and never re-running on updates. Chunks split mid-sentence destroying meaning. Vector-only search missing exact-match queries. No reranking, dumping 20 mediocre chunks into the prompt and exhausting context. No retrieval evaluation, blaming the LLM for quality issues that originate upstream.
What to Do This Week
Build a 100-query eval set for your top RAG use case and measure Recall@5 today; the number will surprise you and guide every next investment.