Virtual Agent Confidence Thresholds: Tuning Without Breaking Trust

[object Object]

Two failure modes destroy Virtual Agent trust faster than any other. One: the bot answers a confident-sounding nothing, the user reads it, gets nothing useful, and rage-types until a human appears. Two: the bot routes everything to a human in 30 seconds, agents complain about volume, and finance asks why you bought the license. Both are confidence-threshold problems and both are tunable.

The threshold is not one number

Virtual Agent’s confidence handling is often discussed as if there were a single threshold above which the bot answers and below which it escalates. There are at least three thresholds in any honest tuning:

Auto-resolve floor — the score above which the bot will answer without asking the user to confirm intent.
Confirm-intent floor — the score band where the bot asks “Did you mean X?” before proceeding.
Handoff floor — the score below which the bot stops trying and escalates to a live agent.

The space between these three numbers is where almost all the user-experience value lives. Collapse them to a single threshold and you give up the most useful tool you have.

Start from the data, not from the dial

Do not tune by guessing. Pull a sample of the last 30 days of conversations with their NLU scores, their actual outcomes (resolved, abandoned, escalated, surveyed CSAT), and look at the score distribution by outcome.

// Background script — survey the conversation log
var ga = new GlideAggregate('sys_cs_conversation');
ga.addQuery('sys_created_on', '>=', gs.daysAgoStart(30));
ga.addAggregate('COUNT');
ga.addAggregate('AVG', 'nlu_top_intent_score');
ga.groupBy('outcome');
ga.query();
while (ga.next()) {
    gs.info(ga.outcome + ' | count=' + ga.getAggregate('COUNT') +
            ' | avg_score=' + ga.getAggregate('AVG', 'nlu_top_intent_score'));
}

What you are looking for is the inflection point. In most well-trained models you will see a clear bimodal distribution: a hump of high-scoring conversations that resolved successfully, and a separate hump of low-scoring conversations that got escalated. The valley between the humps is roughly where your handoff floor wants to sit.

A working starting point

For a model trained on at least 200 utterances per intent, with at least 30 days of production traffic to learn from, these defaults survive contact with reality better than the platform’s out-of-box numbers:

Auto-resolve floor: 0.78
Confirm-intent floor: 0.55
Handoff floor: 0.40

Below 0.40, do not try to negotiate — escalate. Between 0.40 and 0.55, ask a clarifying question. Between 0.55 and 0.78, confirm the intent before executing. Above 0.78, act.

These are starting points, not gospel. The right numbers for your tenant depend on the cost of a false positive (acting on the wrong intent) versus the cost of a false negative (escalating a resolvable conversation). For internal IT use cases, false positives are cheap and false negatives are expensive — bias the thresholds down. For customer-facing flows, false positives are expensive — bias them up.

Tune the handoff message, not just the threshold

Once you decide to escalate, the message itself matters as much as the threshold. The right pattern:

"I want to make sure you get the right help. Connecting you to someone who
can sort this out. Here's what I've got so far:

Issue: [first user message]
Tried: [intents the bot attempted]
"

Three elements: a non-defensive framing, the original message verbatim (so the user does not have to retype), and the bot’s attempted path (so the human agent does not waste 90 seconds re-asking). The handoff transcript should land in the conversation context on the agent workspace side, not in a separate audit log nobody reads.

The clarifying question that is worth asking

Between confirm-intent and auto-resolve, the bot has options. The cheap option is “Did you mean to reset your password?” The expensive-but-better option is multi-choice:

I think you're asking about one of these. Which is closest?
  1. Reset my password
  2. Unlock my account
  3. Update my security questions
  4. None of the above

Option 4 is non-negotiable. Without it, users force their question into one of the bot’s categories, the bot acts on a wrong intent, and CSAT tanks. With it, the bot learns from its own mistakes — log every “none of the above” as a training signal for the next NLU model rebuild.

The retraining feedback loop

Confidence thresholds without a retraining loop are a static guard against a moving target. Every escalated conversation below the auto-resolve floor is a free training example. Capture them:

// Business Rule on sys_cs_conversation: after, when escalated
(function executeRule(current, previous /*null when async*/) {
    if (current.escalated == true && previous.escalated != true) {
        var entry = new GlideRecord('u_va_training_candidate');
        entry.initialize();
        entry.u_conversation = current.sys_id;
        entry.u_first_user_message = current.first_user_message;
        entry.u_top_intent = current.nlu_top_intent;
        entry.u_top_intent_score = current.nlu_top_intent_score;
        entry.u_escalation_reason = current.escalation_reason;
        entry.insert();
    }
})(current, previous);

Review the candidate table weekly. Promote the high-quality examples into the training corpus. This is the single most reliable way to drift the model upward over time — better than any one-time tuning pass.

Channel-specific thresholds

The same NLU model serves Teams, Slack, the portal widget, and the mobile app. Should it use the same thresholds? Usually no. Mobile users type less; their average message length is shorter and their NLU scores are systematically lower. If you use the same threshold across channels, mobile users hit the handoff floor far more often.

Carry channel as a context variable and apply a per-channel offset:

var baseHandoff = 0.40;
var channelOffset = {
    'mobile': -0.05,
    'teams': 0.0,
    'portal': 0.0,
    'slack': -0.02
};
var effectiveHandoff = baseHandoff + (channelOffset[channel] || 0);

The numbers should come from your data, not from this article. The point is the structure: do not pretend channels are identical.

UI signals that build trust

When the bot is in the confirm-intent band, the UI should show it. A subtle “I’m not 100 percent sure — is this right?” framing, with the suggested intent rendered as a chip rather than a sentence, signals to the user that the bot is being honest about uncertainty. Users forgive uncertainty; they do not forgive false confidence. The portal widget designers got this right when they introduced the suggestion-chip pattern in Vancouver; copy it.

For the conversation-flow side of this same problem, our piece on knowledge taxonomy rebuilds covers how the underlying content shapes what the bot can confidently answer.

What to measure weekly

Five metrics on one dashboard, refreshed weekly:

Auto-resolve rate (above auto-resolve floor, no escalation, no user-initiated abandon)
Confirm-intent success rate (user confirmed the suggested intent)
Escalation rate (handoff floor or user-requested)
CSAT among auto-resolved conversations specifically
Volume of “none of the above” responses

The fifth one is the early-warning indicator. When “none of the above” volume rises, your intent coverage is degrading even if the other numbers look fine.

Intent overlap as the silent killer

Two intents with similar training utterances will produce confidence scores within a few hundredths of each other on almost every input. The bot picks one; the other is the right answer half the time. The threshold cannot rescue you from intent overlap because the top score still passes the auto-resolve floor.

Detect overlap by looking at the gap between the top intent’s score and the second intent’s score:

// Background script — find conversations where intent gap was small
var gr = new GlideRecord('sys_cs_conversation');
gr.addQuery('sys_created_on', '>=', gs.daysAgoStart(14));
gr.addEncodedQuery('nlu_top_intent_score>0.6');
gr.query();

var narrow = 0, total = 0;
while (gr.next()) {
    total++;
    var top = parseFloat(gr.nlu_top_intent_score);
    var second = parseFloat(gr.nlu_second_intent_score || 0);
    if (top - second < 0.08) narrow++;
}
gs.info('Narrow-margin conversations: ' + narrow + ' of ' + total);

A high narrow-margin rate means the model is guessing among siblings. The fix is not threshold tuning; the fix is to merge the overlapping intents or to rebuild the training utterances so the model can tell them apart. Threshold tuning on top of intent overlap is treating symptoms.

A/B testing changes the right way

When you change thresholds, do not change them everywhere at once. The platform supports conversation-context-based routing; use it to A/B test by routing a fraction of conversations through new thresholds while the rest stay on the old ones. Two-week minimum per test; one full business cycle is better. Compare auto-resolve CSAT, handoff rate, and “none of the above” rate across the two arms.

Untested threshold changes are how teams break the bot for a week and only find out in the Monday metrics review.

Tradeoffs to be honest about

Higher thresholds mean more escalations, which costs agent time. Lower thresholds mean more bot answers, some of which will be wrong, which costs CSAT and creates work-rework cycles. There is no setting that minimizes both. The right balance is the one that minimizes the combined cost in your environment, and that ratio differs between IT, HR, and customer service.

Do not tune in isolation. Whatever you do to the thresholds, the agents on the other end of the handoff will feel it. Tell them before, not after.

Bottom line

Treat confidence as three thresholds, not one — auto-resolve, confirm-intent, handoff.
Tune from data, not from the dial. Histogram scores by outcome before touching a number.
Always offer “none of the above” in clarifying questions; log every selection as a training signal.
Use per-channel offsets; mobile users systematically score lower and deserve different math.
Track auto-resolve CSAT specifically. A high auto-resolve rate with low CSAT is worse than a lower auto-resolve rate with high CSAT.

[object Object]

The threshold is not one number

Start from the data, not from the dial

A working starting point

Tune the handoff message, not just the threshold

The clarifying question that is worth asking

The retraining feedback loop

Channel-specific thresholds

UI signals that build trust

What to measure weekly

Intent overlap as the silent killer

A/B testing changes the right way

Tradeoffs to be honest about

Bottom line

Get one CRM read per week.

Next articles to explore →

Tuning Virtual Agent NLU for Real Conversations, Not Demos

ServiceNow Virtual Agent: Multi-Turn Context in 2026

Survey and Assessment Design That People Actually Complete

ServiceNow Virtual Agent Setup: A Step-by-Step Guide

GlideAggregate Count: Real Query Cost in 2026

UI Policy vs Client Script onLoad: The Real Diff