The Metric
Total agent cost divided by number of successful resolutions equals cost per resolution (CPR). For HubSpot’s outcome-priced agents — Customer Agent at $0.50/resolved, Prospecting Agent at $1.00/qualified lead — CPR is the line item on the invoice. For everyone else (token-billed Anthropic, OpenAI, Bedrock, Azure OpenAI), you compute it from a usage export. CPR is the honest measure of AI value because it normalizes across pricing models, model swaps, and prompt rewrites.
The denominator definition matters more than the numerator. “Resolution” must mean a closed case with no human handoff and no callback within 7 days. Loose definitions inflate the metric by 30–50%.
CPR = (Σ tokens_in × $/Mtoken_in
+ Σ tokens_out × $/Mtoken_out
+ Σ tool_call_cost
+ amortized_platform_fee)
/ count(resolved_cases_7d_clean)
Benchmark Ranges
Drawn from cross-industry data Q1 2026:
- Highly automated internal workflows (data lookup, ticket routing, knowledge retrieval): $0.05–$0.30 per resolution. Heavy prompt caching, small models, narrow scope.
- Customer-facing tier-1 conversations (order status, password reset, billing inquiry): $0.30–$1.50. Larger models, longer context, identity verification overhead.
- Complex multi-step workflows (claims triage, technical troubleshooting, sales qualification): $2–$10. Multi-agent orchestration, tool calls, retries.
- High-stakes advisory (financial planning, medical pre-screen, legal intake): $10–$50. Larger context, validation chains, human review.
Compare to fully loaded human cost per equivalent resolution: tier-1 voice $4–$8, tier-2 chat $6–$12, complex case $25–$80.
Optimizing the Metric
Two levers, attacked in parallel:
Improve resolution rate. Better grounding (retrieval over your KB instead of model parametric memory cuts hallucinations 60%+), better prompts with explicit acceptance criteria, fine-tuning on closed cases, model swap to a stronger reasoner for the cases the small model fails. Each percentage point of resolution rate improvement is worth roughly an equal percentage drop in CPR.
Reduce cost. Prompt caching (Anthropic, OpenAI, and Gemini all support it — typical 70–90% discount on cached prefix tokens), routing easy cases to smaller cheaper models (Haiku 4.5, GPT-5-mini, Gemini Flash), eliminating redundant tool calls, response streaming with early termination on confident answers, structured output to cut output tokens.
Common Failure Modes
- Counting deflections (no human) as resolutions — inflates CPR by 30–50% downward.
- Ignoring tool-call costs (each Salesforce SOQL through MCP is ~$0.0005 in compute) — small per call, large at scale.
- Excluding platform/license fees from the numerator. A $50K/year Agentforce platform fee on 100K cases is $0.50/resolution before any token costs.
- Comparing CPR across teams with different denominator definitions.
Reporting Cadence
Monthly to ops leadership with rolling 30-day trend. Quarterly to CFO with year-over-year and unit-economics framing. When the metric degrades (cost up, resolution flat or down), dig within 48 hours — usually a model version change, prompt regression, or upstream data drift. When it improves, understand why and replicate across other agents.
What to Track Alongside CPR
CPR alone is gameable. Pair it with: 7-day repeat-contact rate, customer-effort score, escalation rate, and cost-per-attempted-resolution (includes failed attempts). The four numbers together tell the truth.