[object Object]

Phase 1: Pilot

Pick a narrow use case, a small user group, and tight measurement before anything else. A useful starting filter: high volume, low blast radius, reversible decisions. Service-team meeting prep, sales call summary generation, internal knowledge-base search. State the hypothesis as a number — “we believe Agentforce case summaries will reduce average handle time by at least 90 seconds per case for the Tier 1 EMEA queue” — and write the kill criteria before launch. Common kill criteria: pass rate below 80% on the eval set after two prompt iterations, average customer-handling minutes worsening, or any compliance flag from privacy review. Time-box at 30-60 days. Pilots that quietly extend to six months are failed pilots that the team did not have the courage to call.

Phase 2: Measure

Establish a baseline before the pilot starts. Track outcomes against baseline weekly. Verify the hypothesis holds with statistical sanity — a sample size that gives 95% confidence on the headline metric, not n=8 anecdotes. Use control groups when possible; randomly assign half of the eligible reps to the AI feature and compare. Honest assessment is the harder discipline: if the pilot disappointed, document why before designing the next iteration. Common rationalizations to refuse: “users would have liked it more with training”, “the integration was unfair to the AI”, “next quarter when the model is better”. Disappointing pilots that get rationalized into scale create the trust events that derail the whole AI program two quarters later.

Phase 3: Scale

If the pilot meets criteria, scale in stages — never go from 50 users to 5,000. A defensible cadence: 50 -> 250 -> 1,000 -> 4,000 -> all, with a one-week soak at each step and a written go/no-go before the next ramp. At each stage, monitor pass rate on a fixed eval set, cost per resolution, escalation rate, and a free-text user satisfaction survey. Invest in change management — Spekit or WalkMe in-app guidance, office-hours sessions, a Slack channel for bug reports. Expect friction; the second wave of users always exposes assumptions the pilot wave did not.

Stage     Users   Soak    Eval pass    CPR     Esc rate    Decision
Pilot     50      30d     86%          $0.32   12%         Proceed
Ring 1    250     7d      83%          $0.31   13%         Proceed
Ring 2    1000    7d      81%          $0.34   16%         Pause
Ring 3    -       -       -            -       -           After fix

Phase 4: Govern

Ongoing oversight is not a project phase; it is permanent operating discipline. Regression testing on a versioned eval set every model upgrade, every prompt edit, every retrieval-corpus refresh. Cost monitoring through Salesforce Digital Wallet, Anthropic Admin API, OpenAI usage exports, and a FinOps dashboard the budget owner sees weekly. Policy updates as the threat landscape evolves — when a new prompt-injection technique appears on the OWASP LLM Top 10, the red-team gets a sprint to test it. EU AI Act Article 17 quality management obligations require documented procedures for change control on high-risk systems, which translates to: every prompt change is a pull request with an eval result attached.

Common Failure Modes

Skipping baseline measurement and then claiming success is the most common failure. The second is scaling on enthusiasm rather than data. The third is treating governance as a one-time launch checklist rather than an operating cadence. A fourth, increasingly common in 2026: declaring the pilot a success because the model provider’s marketing case study made it sound like one, without independent verification.

What to do this week

Write the kill criteria for your most prominent pilot, in numbers, and circulate it for sign-off. If the team resists writing numerical kill criteria, the pilot is not yet ready to start.

[object Object]
Share