[object Object]

How It Works

Providers (Anthropic, OpenAI) cache the static prefix of prompts. Subsequent requests with the same prefix hit cache, dropping input cost 50-90% and latency 2-4x. Structure matters: prefix must be identical byte-for-byte.

Anthropic charges 1.25x base rate for cache writes and 0.1x base rate for cache reads (a 90% discount). OpenAI’s automatic prefix caching gives a 50% discount on cached input tokens with no write surcharge. Gemini caches at 25% of base rate. Bedrock supports the same models with provider-equivalent semantics. The cache key is the byte-exact prefix; a single character drift invalidates everything from that point. Place static content (system prompt, tool definitions, persona) at the start; place dynamic content (user turn, retrieved chunks) at the end.

response = client.messages.create(
    model="claude-sonnet-4-7-20260315",
    system=[
        {"type": "text", "text": LONG_PERSONA},
        {"type": "text", "text": LONG_POLICY,
         "cache_control": {"type": "ephemeral"}}
    ],
    messages=[{"role": "user", "content": user_query}]
)

Prefix Structure

Put stable content first: persona, tool definitions, policy statements, unchanging context. Put variable content last: user turn, retrieved passages. The more stable content at the front, the more caching pays off.

Recommended ordering. (1) System message with persona, role, and unchanging instructions. (2) Tool definitions in the order they were last updated. (3) Long-lived knowledge: policies, glossaries, reference data. (4) Conversation history (caches up to the most recent user turn). (5) Newly retrieved RAG context (varies per call, after cache breakpoints). (6) Current user message. Place cache_control markers at the boundaries you want cached. Avoid timestamp injection at the top — even “Today is 2026-04-28” at position 0 invalidates every cache.

Cache TTL

Anthropic caches for 5 minutes by default; extended TTL options exist on higher tiers. Low-traffic agents (fewer than one call per 5 min) see little benefit. High-traffic agents dominate their own cache.

Anthropic’s 1-hour extended TTL costs 2x base rate to write but provides cache hits across longer idle periods — useful for support workflows with long human review pauses. OpenAI’s automatic cache lasts approximately 5-10 minutes with no extended option as of Q1 2026. Gemini caches up to 1 hour with explicit opt-in. The economic break-even: a prompt prefix needs to be reused at least 3 times within the TTL window to recoup the write surcharge on Anthropic’s standard tier, and 5 times on the 1-hour tier.

CRM Implications

Customer-service agents with long policy prompts benefit disproportionately. Orchestrator agents with consistent tool definitions benefit. Measure cache hit rate — if it’s below 80%, restructure your prompt.

Realistic CRM cost reductions. A Salesforce Service Cloud agent with 8K-token policy and tool prefix and 200-token user turns sees 75-85% cost reduction at production traffic levels. A sales-research agent making bulk account enrichment calls with consistent system prompts sees 80-90% reduction. A one-off prompt-engineering chatbot with no repeated prefix sees no benefit. Track cache-hit rate via provider headers (x-cache-creation-input-tokens, x-cache-read-input-tokens) and surface it in dashboards alongside cost.

Common Failure Modes

Five recurring patterns. Injecting a timestamp or session ID at the top of the prompt, killing all cache hits. Reordering tool definitions on each request, invalidating from that point onward. Updating a policy daily without staging the change so cache rebuilds happen during low-traffic hours. Sub-1024-token prefixes that fall under provider minimums. Putting RAG-retrieved context before the static prompt and wondering why hit rate is 5%.

What to Do This Week

Pull your AI usage report and calculate cache-hit rate per major prompt; anything below 70% is leaving 30%+ of your AI budget on the table.

[object Object]
Share