Tuning Virtual Agent NLU for Real Conversations, Not Demos

[object Object]

A 95% intent accuracy in a demo collapses to 78% the first week real employees use Virtual Agent. The gap is not the model. The gap is utterance distribution and intent confusability, both of which are fixable in a fortnight.

Sample utterances, not synthetic ones

The default intent training set ships with five to ten utterances per intent, all written by the implementer. Production utterances look nothing like them. Pull the last 30 days of sys_cs_conversation_message rows where direction='inbound' and use the raw text as your candidate set.

var gr = new GlideRecord('sys_cs_conversation_message');
gr.addQuery('direction', 'inbound');
gr.addQuery('sys_created_on', '>=', gs.daysAgo(30));
gr.setLimit(5000);
gr.query();

Cluster these with the platform’s built-in similarity action and you will discover three things: the misspellings you never anticipated, two intents nobody bothered to train, and one intent that should be deleted.

Confusability is the silent killer

Two intents with overlapping vocabulary will degrade both. Run the NLU Inference dashboard and look at the confusion matrix. Anywhere two intents trade misclassifications above 8%, you should either merge them or split them with a clarifier prompt.

The 50/15/35 split

Train on 50% of mined utterances, hold out 15% for validation in the NLU workbench, and reserve 35% for an offline regression suite you re-run before every model publish. Without the 35% reserve you will overfit to the validation set within three iterations.

Publish on a schedule, not on a whim

Adopt a two-week publish cadence. Treat the model like code: PR review of new utterances, validation report attached, rollback plan if intent confidence on the regression suite drops more than 3 points.

Watch the fallback rate

The single best health metric is the fallback (no-match) rate. Anything above 12% means either an intent is missing or your routing is wrong. Anything below 4% likely means you are overfitting and rejecting nothing.

What to do this week

Export the last 30 days of inbound utterances, cluster them, and add the five highest-volume unmatched clusters as new training examples. Republish, then watch fallback rate in the analytics dashboard for the next week.

[object Object]

Sample utterances, not synthetic ones

Confusability is the silent killer

The 50/15/35 split

Publish on a schedule, not on a whim

Watch the fallback rate

What to do this week

Get one CRM read per week.

Next articles to explore →

Virtual Agent Confidence Thresholds: Tuning Without Breaking Trust

Predictive Intelligence Model Lifecycle: From Train to Retire

ServiceNow Virtual Agent: Multi-Turn Context in 2026

ServiceNow Virtual Agent Setup: A Step-by-Step Guide

GlideAggregate Count: Real Query Cost in 2026

UI Policy vs Client Script onLoad: The Real Diff