The Surprise
Token-based pricing looks cheap per call: $3/M input, $15/M output for Claude Sonnet 4.5; similar order of magnitude for GPT-5 and Gemini 2.5 Pro. In production, calls compound. A single customer-facing agent conversation averages 15K input tokens (system prompt + retrieval context + history) and 1.5K output tokens — that’s about $0.07 per conversation. At 100K conversations/day, that’s $7,000 daily, $210K monthly, before tool-call costs and platform fees.
The 2025 cost-blowout pattern was consistent across enterprises: pilot bills of $400/month, scaled bills of $40K/month after a marketing push, and a finance escalation at month-end because no one had set a ceiling. Monthly bills 10x forecast became routine in late 2025.
Budget Alerts
Set budget thresholds per agent, per team, per business unit, per environment (dev/stage/prod). Alert at 50%, 80%, 100% of monthly budget; auto-throttle at 110%; hard-stop at 120% with explicit override. Native budget tooling exists in AWS Bedrock (Cost Anomaly Detection), Azure OpenAI (Cost Management + budget actions), Google Vertex (Quota and Budget), and Anthropic Console (organization-level spend caps as of January 2026). Wire alerts into Slack and PagerDuty, not just email — finance review cycles don’t match AI cost accrual cadence.
# example FinOps budget config
agent: customer-resolver
budget_monthly_usd: 25000
alerts:
- threshold: 0.5
action: notify_slack
- threshold: 0.8
action: notify_pagerduty
- threshold: 1.0
action: throttle_to_50_percent
- threshold: 1.2
action: hard_stop
Quotas
Per-user quotas (max agent calls per day) cap blast radius from misuse — automated test scripts, prompt-bombing, infinite retry bugs. Per-agent quotas (max conversations per hour) protect against runaway loops in agent-to-agent orchestration. Per-tenant quotas matter in multi-tenant architectures. Soft limits with notification before hard limits give users a chance to course-correct.
Users learn to use agents efficiently when they see budget consumed against quota. Hidden cost is hidden until it’s catastrophic; visible cost shapes behavior in days.
Model Routing for Cost
Route easy queries to cheaper models. A classifier (under $0.001 per call with Haiku 4.5 or GPT-5-nano) decides whether to invoke the cheap path or the expensive path. Typical savings: 40–70% on aggregate inference cost with no measurable quality loss on a well-calibrated router. Combine with prompt caching (70–90% discount on cached prefix tokens, supported across Anthropic, OpenAI, and Vertex) for compounding gains.
Outcome Pricing Alignment
HubSpot’s outcome-based pricing (April 2026) aligns vendor spend with customer value at the contract level. Even if your vendor doesn’t offer outcome pricing, internal outcome accounting does the same job. Calculate cost per resolved case, per qualified lead, per completed task — that’s the procurement-ready number. Token cost is an input; outcome cost is the metric that frames the right CFO conversation.
Common Failure Modes
- Single shared API key across all agents — no attribution, no granular quotas.
- No dev/prod budget separation — a runaway integration test eats the production budget.
- Budgets set once and never reviewed against changing volume.
- Cost dashboards no one looks at because they’re not on the team’s standing agenda.
What to Do This Week
Pull last month’s inference bill. Divide by the number of resolved cases, qualified leads, or completed tasks. If you can’t do that calculation in 10 minutes, your cost governance is incomplete.