What Scaled
Internal-facing, narrow-scope, high-volume agents with clear escalation paths scaled fastest across the Agentforce installed base by early 2026. Password resets, case summaries, knowledge-base drafting, meeting prep, expense triage. The pattern is consistent: a single, narrow job; a measurable savings per execution; a fast escalation when the agent cannot answer; and an audit trail. Salesforce’s own published examples — Wiley reducing Tier 1 ticket resolution times by 40-50%, Heathrow handling traveler queries during peak — share that profile. The teams that scaled fast moved an internal HR or IT use case to production in 6-8 weeks before touching anything customer-facing.
What Stalled
Broad customer-facing agents launched without rigorous evaluation stalled or were rolled back. Autonomous agents inserted into high-stakes workflows — refund issuance, contract amendments, lead disqualification — generated trust events that took quarters to recover from. The Air Canada chatbot ruling from 2024, where the airline was held liable for a chatbot’s bad bereavement-policy advice, kept legal teams awake; teams that did not have an evaluation harness and a kill-switch were told no by counsel before launch. Agents deployed without observability accumulated invisible regressions: the model provider pushed a minor version, prompts drifted, the same use case quietly degraded from 84% pass rate to 71% over six weeks.
Common Failure Modes
Prompt hygiene neglect tops the list — prompts ship, get edited in production for a one-off bug, and never get version-controlled. Trust erosion from un-audited changes is the second; an admin tweaks a system instruction and a regression appears two weeks later in a different region. Cost surprise from unmetered rollouts hit budgets hard in late 2025; one Fortune 500 reported a $480K Agentforce overage in a single month after an internal demo went viral inside the company. Integration fragility — an MCP server crashes and 14 dependent agent flows fail simultaneously — exposed teams that had not modeled blast radius.
Prompt drift example
v1 (Jan): "Summarize this case in 3 sentences."
v2 (Mar, edited live): "Summarize this case briefly."
v3 (Apr, reverted): "Summarize this case in 3-5 sentences focusing on root cause."
No commit history, no eval re-run, no owner.
Generalizable Principles
Start narrow. Instrument thoroughly. Measure outcomes before scaling. Audit everything customer-facing. Human-in-loop for high-stakes actions; autonomous only for deflectable, reversible ones. Phase trust carefully — a six-tier rollout (sandbox, internal pilot, opt-in beta, soft launch, expanded launch, full launch) caught problems for the customers who scaled successfully. The orgs that hit the 90-day mark with no trust events almost universally had a written eval set of 100-300 cases per agent and a weekly review meeting.
What Changed in 2026
The August 2026 EU AI Act high-risk obligations forced eval and documentation discipline that customers had previously deferred. Outcome-based pricing (Salesforce’s $2-per-conversation Agentforce model, Sierra’s per-resolution pricing) made cost-per-resolution a board metric, which in turn made eval rigor a budget question. Multi-vendor strategies emerged — fewer customers go all-in on one provider, more pair Agentforce with Anthropic Claude or OpenAI GPT-5 via the Models API.
What to do this week
Pick the narrowest, highest-volume internal use case you can find. Write the evaluation set first, before any prompt. If the eval set is hard to write, the use case is not narrow enough — pick a smaller one.