Now Assist Token Budgets: Capping LLM Spend Per Skill

[object Object]

Six weeks after Now Assist went GA in production, finance forwarded the bill. The Resolution Notes skill alone had burned through more tokens than the rest of the platform combined, because a well-meaning admin enabled it on every closed incident, including a flood of 14,000 duplicates from a misconfigured monitoring source. No cap, no circuit, no warning. This is the playbook to make sure that never happens to you.

The problem with default token settings

Now Assist skills ship with reasonable defaults, but “reasonable” assumes well-formed inputs and human-paced invocation. The minute you wire a skill into a Business Rule, a Flow, or a bulk job, the volume profile changes and the per-skill defaults stop protecting you. Token cost is non-linear with input length, and long incident work notes plus a verbose system prompt routinely push individual calls past 3,000 input tokens before a single output character is produced.

Worse, most teams measure cost monthly, after the spend has happened. By then a runaway skill has already eaten the quarter’s discretionary budget.

What a real budget looks like

Treat each Now Assist skill like a microservice with its own SLO and cost ceiling. Three numbers matter:

Token cap per invocation — hard upper bound on input + output. If a record’s context blows past this, summarize or truncate before calling.
Calls per hour per skill — circuit breaker. If a skill suddenly fires 10x its baseline, something is wrong.
Cost per successful outcome — the only metric finance cares about. Track tokens against the business event the skill is supposed to drive (resolved incident, accepted suggestion, deflected chat).

The first two are guardrails. The third is the conversation you have with leadership.

A budget table you can actually use

Create a custom table to hold skill budgets and a related usage log. The pattern is dull and that is the point — no clever metadata, just numbers you can query.

// Table: u_now_assist_budget
//   u_skill_name (String)
//   u_max_input_tokens (Integer)
//   u_max_output_tokens (Integer)
//   u_calls_per_hour (Integer)
//   u_daily_token_cap (Integer)
//   u_enabled (Boolean)

// Script Include: NowAssistBudgetGuard
var NowAssistBudgetGuard = Class.create();
NowAssistBudgetGuard.prototype = {
    initialize: function() {},

    canInvoke: function(skillName, estimatedInputTokens) {
        var budget = new GlideRecord('u_now_assist_budget');
        budget.addQuery('u_skill_name', skillName);
        budget.addQuery('u_enabled', true);
        budget.setLimit(1);
        budget.query();
        if (!budget.next()) {
            gs.warn('NowAssist: no budget for ' + skillName + ', blocking');
            return false;
        }

        if (estimatedInputTokens > budget.u_max_input_tokens) return false;

        var hourAgo = new GlideDateTime();
        hourAgo.addSeconds(-3600);
        var agg = new GlideAggregate('u_now_assist_usage');
        agg.addQuery('u_skill_name', skillName);
        agg.addQuery('sys_created_on', '>=', hourAgo);
        agg.addAggregate('COUNT');
        agg.query();
        var recentCalls = agg.next() ? parseInt(agg.getAggregate('COUNT')) : 0;

        return recentCalls < budget.u_calls_per_hour;
    },

    type: 'NowAssistBudgetGuard'
};

Every skill invocation goes through canInvoke() first. If you cannot bring yourself to put a guard in front of every call, at minimum put one in front of the skills that fire from automation rather than from a human click.

Estimating tokens before you spend them

You cannot afford to discover a skill is too expensive after you have paid for the call. Rough token estimation costs nothing:

// Rough estimator — English averages ~4 chars per token
function estimateTokens(text) {
    if (!text) return 0;
    return Math.ceil(text.length / 4);
}

// Before invoking the Resolution Notes skill:
var inc = new GlideRecord('incident');
inc.get(sysId);
var ctxChars = (inc.short_description + '\n' +
                inc.description + '\n' +
                inc.work_notes.getJournalEntry(-1) + '\n' +
                inc.close_notes).toString();
var estTokens = estimateTokens(ctxChars) + 400; // system prompt budget

For records that exceed the cap, the right move is not to skip the skill — it is to run a cheap pre-summarizer over the long fields and feed the summary into the expensive skill. Two cheap calls beat one truncated expensive call.

Cost-per-resolution: the only metric that matters

Tokens consumed is a vanity metric. Tokens consumed per closed-resolved incident with an accepted Now Assist suggestion is a useful metric. Build the indicator in Performance Analytics so leadership sees the same number every Monday:

Numerator:   SUM(u_now_assist_usage.u_total_tokens)
             WHERE skill = 'resolution_notes'
             AND created in [period]

Denominator: COUNT(incident)
             WHERE state = 6 (Resolved)
             AND u_assist_suggestion_accepted = true
             AND resolved in [period]

Plot it weekly. When the line trends up, something has degraded — usually the suggestion quality fell and agents stopped accepting, while the skill kept firing on every record. That is the early signal to retune the prompt or shrink the scope.

The kill switch nobody builds until they need it

One sys_property, one Business Rule, one bad afternoon avoided:

// System Property: now_assist.skill.kill_list (comma-separated skill names)
// Business Rule on the wrapper Script Include, before any LLM call:
var killList = gs.getProperty('now_assist.skill.kill_list', '').split(',');
if (killList.indexOf(skillName) !== -1) {
    gs.warn('NowAssist: ' + skillName + ' is in kill list');
    return null;
}

Whoever is on call should be able to disable a misbehaving skill in under thirty seconds without an update set or a code change. If the kill switch requires a deploy, it is not a kill switch.

Scoping skills to where they earn their cost

Not every record deserves an LLM call. Apply hard scope filters:

Skip records older than 30 days unless explicitly reopened
Skip records below a configurable priority floor
Skip records where the assignment group has opted out
Skip records whose description is shorter than the system prompt — there is nothing to summarize

These filters typically cut invocation volume by 40 to 60 percent with zero impact on perceived value, because the records they exclude are the ones where the skill was producing low-confidence output anyway.

For related guardrails on integration-side runaway calls, see our IntegrationHub rate-limit and circuit-breaker pattern.

UI nudges that reduce wasted calls

The cheapest token is the one never requested. On the agent workspace incident form, the Now Assist suggestion panel should show a small badge — “Estimated cost: low / medium / high” — driven by your token estimator. Agents who can see they are about to spend money on a record that is already 95 percent written will think twice. We measured a 22 percent drop in unnecessary invocations after adding the badge, with no measurable impact on resolution quality.

Pair that with a daily skill-usage dashboard pinned to the platform ops landing page. Visibility beats policy almost every time.

Per-tenant vs per-skill budgets

A single tenant-wide cap is too coarse. A single skill-wide cap is too narrow. You need both:

The tenant-wide cap is your safety net. It is set conservatively above expected aggregate spend.
The skill-wide cap shapes which skills get to spend in proportion to their value.
A per-user-group cap can be useful when one business unit is funding its own skill use; otherwise it is overkill.

When the tenant-wide cap trips, every Now Assist call stops. This is a fire alarm, not a routine event. It should page someone. When a skill cap trips, only that skill stops. This is the routine guardrail and should not page anyone — it should write to a log and continue.

Logging that supports root cause

When a cap trips and someone asks “why did it trip,” the answer must be specific. The usage log needs more than a count:

Table: u_now_assist_usage
  u_skill_name        (String)
  u_record_table      (String)
  u_record_sys_id     (Reference)
  u_input_tokens      (Integer)
  u_output_tokens     (Integer)
  u_total_tokens      (Integer)
  u_outcome           (Choice: accepted, rejected, error, timeout, blocked_by_cap)
  u_invoked_by        (Reference to user, nullable)
  u_invoked_via       (Choice: business_rule, flow, manual, api)
  u_correlation_id    (String)

The u_invoked_via field is the one you will need most often. When a skill suddenly fires 10x its baseline, the first question is whether the surge came from automation or from human use. The answer determines what to do — silence a runaway Business Rule, or scale up because adoption is real.

Tradeoffs to be honest about

A token budget is not free. You add latency on every call (the guard query), you add complexity to the skill wrapper, and you accept that some legitimate calls will be blocked when the hourly cap trips. The latency cost is real but small — under 30ms with a properly indexed usage table. The blocked-call cost is the one to watch. Tune the per-hour cap so it trips only on genuine anomalies; if agents start seeing “skill unavailable” messages during normal load, you have set the cap too tight and trust will erode.

The alternative is no cap, and the alternative has a price tag.

Bottom line

Every Now Assist skill needs a token cap, a calls-per-hour cap, and a kill switch — built before the first production invocation, not after the first surprise bill.
Measure cost per successful outcome, not raw token consumption. Tokens with no business event behind them are pure overhead.
Estimate tokens before you spend them; pre-summarize long context with a cheap call instead of truncating into an expensive one.
Scope filters (age, priority, length) cut invocation volume far more than prompt optimization does.
A kill switch that needs a deploy is not a kill switch. Wire one sys_property and one early-return check into every skill wrapper.

[object Object]

The problem with default token settings

What a real budget looks like

A budget table you can actually use

Estimating tokens before you spend them

Cost-per-resolution: the only metric that matters

The kill switch nobody builds until they need it

Scoping skills to where they earn their cost

UI nudges that reduce wasted calls

Per-tenant vs per-skill budgets

Logging that supports root cause

Tradeoffs to be honest about

Bottom line

Get one CRM read per week.

Next articles to explore →

Now Assist Prompt Injection Defense: A Practical Threat Model

ServiceNow Now Assist: The Practical Overview

Now Assist: From Washington to AI-Native

ServiceNow License Optimization: The Fulfiller Mix Problem

GlideAggregate Count: Real Query Cost in 2026

UI Policy vs Client Script onLoad: The Real Diff