[object Object]

Every vendor pitch in 2026 opens with a leaderboard chart and a number. 87% on Tau-Bench. 92% on CRMArena. Top of the airline category. The number is real. What it measures is rarely what your CRM workload actually does, and the gap between leaderboard score and production performance is where projects fail. Read leaderboards like you read pharma trial results — useful, narrow, easy to misinterpret.

What leaderboards exist for CRM agents

The active ones, mid-2026:

  • Tau-Bench (Sierra). Tool-using agents in airline / retail / customer service settings. Closest to CRM workloads of any public benchmark.
  • CRMArena (Salesforce Research). 9 personas across service, sales, marketing tasks. Synthetic Salesforce env.
  • SWE-bench / SWE-bench Verified (Princeton + OpenAI). Code agents, not CRM, but the leaderboard winners often correlate with general agentic competence.
  • AgentBench (Tsinghua). General agentic tasks across 8 environments.
  • HELM Agent (Stanford). Holistic evaluation, slower to update, more rigorous methodology.
  • Vendor self-reports. Agentforce, Copilot, Now Assist all publish their own internal eval results. Caveat emptor.

Tau-Bench and CRMArena are the most relevant. Both have well-known holes.

Tau-Bench: what it actually tests

Tau-Bench evaluates an agent’s ability to use tools, follow domain policies, and interact with a simulated user across multi-turn tasks. Two domains in the public version (airline, retail). Tasks like “user wants to cancel a flight, check policy, process refund.”

What it does well:

  • Multi-turn, tool-using, policy-bound. That matches real CRM service work.
  • Pass-at-k metric (was the task completed correctly k times in a row). Catches flakiness.
  • Programmatic verification. The simulator knows ground truth.

What it doesn’t test:

  • Your CRM schema. Your sharing rules. Your custom fields.
  • Real customer messiness — typos, sentiment, multi-language, abuse.
  • Long-context grounding over thousands of records.
  • Latency. Cost. Governance.
  • Edge cases your business knows about but the benchmark doesn’t.

A model scoring 70% on Tau-Bench airline is not 70% likely to handle your refund desk. It’s “demonstrated capability on a stripped-down version of similar work.”

CRMArena: the closer one

CRMArena uses synthetic Salesforce orgs and 9 personas (Service Agent, Sales Rep, Marketing Manager, etc.). Tasks are CRM-shaped: lookup, update, escalate, draft, summarize.

What it does well:

  • Salesforce-shaped data and APIs. Lookups, SOQL-ish queries, record updates.
  • Multiple roles, not just service.
  • Open to vendor submissions across model + framework combinations.

What it doesn’t test:

  • Your data quality. The synthetic data is clean. Yours isn’t.
  • Custom objects, custom validation rules, custom approval processes.
  • Real production scale (record counts, sharing complexity).
  • Multi-org or cross-system flows.

CRMArena is useful for comparing models on a like-for-like task set. It is not useful as a procurement scorecard on its own.

The leaderboard reading checklist

Before you let a leaderboard number influence a decision, answer:

  1. What metric? Pass-at-1 vs pass-at-k vs partial credit. They’re different.
  2. What tasks? Read the task list. Are they your tasks?
  3. What agent harness? Same model + different scaffolding can swing 20 points. The number is the system, not the model.
  4. What tool set? Did the agent get tools your production agent won’t have?
  5. How many runs? Pass-at-1 from a single run is noise. You want averaged-over-N or pass-at-k.
  6. Public eval set or held-out? If contestants saw the eval, the number is inflated.
  7. Last updated when? A score from a 2025 leaderboard is on a model that may be deprecated.
  8. Cost? Most leaderboards report accuracy without cost. A 92% agent that costs $4/task is worse than a 85% agent at $0.40 for many workloads.

If you can’t answer half of these, the leaderboard is decoration.

Vendor self-reports: how to discount them

Vendor evals are not lies, but they’re not neutral. Common patterns:

  • The vendor picks tasks where their agent is strong.
  • The vendor uses their internal harness, tuned for their agent.
  • The vendor reports the best run of N, not the average.
  • The vendor compares to competitors using older versions or default settings.

Adjust by:

  • Halving the headline number for procurement modeling.
  • Demanding the raw eval traces.
  • Running the same eval on a competing vendor’s agent under your supervision.
  • Building an eval slice from your own tasks and refusing to sign without seeing performance.

What to evaluate instead — your own bench

Public benchmarks tell you whether a system is in the right league. They don’t tell you whether it works for you. Build a private eval suite:

# crm_eval_suite.yaml
suite: sales_assist_v1
tasks:
  - id: account_summary_001
    input:
      type: question
      account_id: 0014x000ABC
      question: "What's the renewal risk?"
    must_include: ["usage trend", "support history", "contract term"]
    must_not_include: ["hallucinated_metric_names"]
    citations_required: true
  - id: opportunity_update_001
    input:
      type: instruction
      text: "Update opp 0064x...DEF stage to Closed Won, amount 124000"
    tool_calls_required:
      - tool: crm.opportunity.update
        args_contain: {stage: "Closed Won", amount: 124000}
    side_effects_forbidden:
      - tool: email.send
metrics:
  - pass_rate
  - p95_latency_ms
  - cost_per_task_usd
  - tool_call_count_p95
  - human_override_rate
runs: 25
seed: stable

Run weekly. Tools like LangSmith, Promptfoo, OpenAI Evals, and Braintrust support this shape. The bench evolves with the system.

Pareto frontier: cost vs accuracy

A single accuracy number is misleading. Plot accuracy vs cost-per-task for every system you’re considering. The right model lives on the Pareto frontier, not at the top of the leaderboard.

SystemAccuracy$/taskLatency p95
Vendor A (Opus-class)91%$0.428.1s
Vendor A (Sonnet-class)87%$0.093.4s
Vendor B (default)89%$0.315.2s
Vendor B (small model)76%$0.041.8s
In-house (Llama 4 + RAG)82%$0.022.1s

There is no winner here. There’s a workload-dependent decision. Tier-1 deflection with volume? Bottom of the table. High-stakes account research? Top.

What public leaderboards DO tell you

They are not useless. They are useful for:

  • Trend tracking. Is the field improving? Is the gap closing on certain task types?
  • Sanity checking. A model below SOTA-30 on every relevant benchmark is probably not the choice.
  • Vendor honesty checking. If a vendor claims top performance and doesn’t appear on any public leaderboard, ask why.

Treat them as field signal, not procurement spec.

Vendor benchmark theater

A familiar pattern at vendor events: the slide says “92% on industry benchmark X.” The footnote says “internal eval against our own harness with our recommended configuration.” That’s not a benchmark — that’s a marketing artifact.

Red flags to spot:

  • Benchmark name not linked to a public leaderboard.
  • “Internal” or “proprietary” evaluation.
  • Comparison to competitors using “default settings” — the vendor’s competitor was sandbagged.
  • No release of eval traces.
  • A leaderboard the vendor built themselves and runs themselves.

When you see these, treat the number as opinion, not data.

The contamination problem

Public eval sets leak. Models train on the web. Leaderboard task descriptions get scraped. By the time a benchmark is six months old, the headline numbers are inflated by memorization, not capability. SWE-bench Verified was created specifically because the original SWE-bench had this problem.

Held-out evals — where the test set is private and submissions run in a controlled environment — partially fix this. They’re slower to update and have fewer submissions, but the numbers are more honest.

Linking to broader evaluation practice

A leaderboard is one slice of agent evaluation with OpenTelemetry. Online evaluation on production traffic is the other slice — and the one that matters more. Leaderboards say “can this thing do task X.” Production traces say “did this thing do task X for my customers.” Both. Always both.

Practical procurement workflow

  1. Shortlist 3 vendors / models from public leaderboards (in-league check).
  2. Build a 25–100 task private eval from your actual workflows.
  3. Run all 3 on your bench. Report accuracy, cost, latency, tool-call efficiency.
  4. Pick top 2. Pilot with real users for 4 weeks.
  5. Measure production: human override rate, escalation rate, CSAT, resolution.
  6. Decide on production metrics, not leaderboard metrics.

The metrics nobody reports but you need

Vendor leaderboards and most public benchmarks under-report:

  • Refusal rate. How often the agent says “I can’t do that.” Sometimes the right answer. Sometimes a quality-killer.
  • Hallucination-in-citation rate. Of tasks that required citations, how often was the cited source wrong or fabricated.
  • Off-policy tool calls. Did the agent call tools outside the allowed scope.
  • Recovery rate. When a tool call failed, did the agent retry sensibly or give up.
  • Steering response. When a human corrects the agent mid-task, does it incorporate the correction.

These differentiate good agents from leaderboard-tuned agents. Bake them into your private eval.

Eval framework comparison

FrameworkStrengthsWatch-outs
LangSmithTight integration with LangChain/LangGraph; trace + eval in one placeVendor lock-in to LangChain ecosystem
PromptfooLightweight, YAML-driven, CI-friendlyLess polish on multi-turn eval
OpenAI EvalsReference implementation, broad model supportOpenAI-centric; some patterns awkward elsewhere
BraintrustStrong UI, dataset versioning, online evalSmaller community
Inspect (UK AISI)Rigorous methodology, safety-eval lineageSteeper learning curve
Custom in-houseExactly matches your workflowsMaintenance burden, easy to drift

Pick the one that fits your stack. Don’t pick three.

Online evaluation: the missing half

Offline eval (public benchmarks + private suite) is necessary but insufficient. Online eval — running evaluators against production traffic — catches drift, distribution shift, and emergent failures that offline never sees.

Pattern: sample 1–5% of production traces, run them through an LLM-as-judge or a heuristic evaluator, track quality metrics over time. Alert on drift. This is how you catch a model regression two days after a vendor silently updates a model behind a stable-named endpoint.

Vendors are silently updating models more in 2026 than in 2025. The online eval is now load-bearing.

The pattern that works

  • Public leaderboards say “is this thing in the league.” They do not say “does this thing work for me.”
  • Always read the harness, the task set, the metric, and the date before trusting a number.
  • Halve every vendor self-report for budgeting.
  • Build a private eval suite from your tasks. Run it weekly. Make it the procurement gate.
  • Pareto frontier (accuracy vs cost vs latency) — not a single number — drives the choice.
  • Production traces beat any benchmark once you’re live. Instrument early.
[object Object]
Share