The data residency review committee approved Now Assist on the condition that no customer data leaves the regional Azure tenant. The procurement team modeled spend assuming 4 cents per call. Production traffic landed at 14 cents per call against the wrong region’s endpoint with no fallback when the endpoint went down. BYO LLM solves real problems and creates new operational ones in equal measure.
Endpoint Registration
Configure the LLM endpoint in Now Assist admin — URL, authentication, model capabilities (context length, modality, tool support). Common choices: Azure OpenAI on an enterprise commitment with regional deployment for residency, self-hosted Llama or Mistral on your own infrastructure for full control, or specialty providers per use case (Anthropic for long-context summarization, smaller models for simple classification). Capability declaration matters because Now Assist features check the declared capability before attempting to use it.
Endpoint registration fields:
display_name, base_url, auth_type, credential_ref,
model_name, max_context_tokens, supports_tools (bool),
supports_vision (bool), region, fallback_endpoint_ref
Auth and Secrets
Credentials go in the platform’s credential store (sys_auth_credential or scoped equivalents), never hardcoded in connection records or scripts. Rotate on schedule appropriate to the credential type — API keys quarterly, OAuth client secrets annually with a 30-day overlap window. Audit access — BYO LLM credentials unlock expensive inference, and a compromised credential can create runaway bills before anyone notices. Restrict credential read to the integration service account; humans should never need to retrieve the credential after initial registration.
Fallback Strategy
BYO endpoints have availability considerations the SaaS default does not. Configure fallback to ServiceNow’s default model when yours is unavailable, slow, or rate-limited. Fallback degrades the feature (may lose fine-tuning specifics, may shift residency for the duration of the failure) but maintains continuity. Set the fallback policy explicitly per use case — for residency-required workflows, the fallback may need to be “fail closed and escalate” rather than “fail to default model.”
function callWithFallback(endpoint, payload) {
try {
return callEndpoint(endpoint, payload, {timeout: 8000});
} catch (e) {
if (endpoint.fallback_policy === 'fail_closed') throw e;
gs.warn('BYO endpoint failure, falling back: '+e);
return callEndpoint(endpoint.fallback_ref, payload, {timeout: 8000});
}
}
Monitoring
Track latency, error rate, and cost per call. BYO LLM costs often surprise — calls that were cheap in sampled testing end up 10x in production volume because the prompt grew or the context expanded. Set cost alerts before deploying broadly: per-day spend threshold, per-feature spend threshold, anomaly detection on call volume. The AI Control Tower exposes most of these natively in 2026 releases; on older releases, build the dashboards manually.
Cost Considerations
Token cost compounds in three ways: bigger prompts (more context retrieved into the prompt), more retries (network failures, rate limits), and more calls per session (multi-turn conversations and agentic playbooks). Cap per-conversation token budgets, cap retries explicitly, and instrument prompt length per feature. Expect production cost per call to be 2-5x higher than initial benchmark; budget accordingly and revisit monthly.
Common Failure Modes
Cross-region calls because the endpoint URL was pasted from the wrong region’s documentation — verify region match against the residency requirement at registration. Rate limits hit silently and Now Assist features degrade quietly — surface 429 rates as a first-class metric. Endpoint deprecation by the provider with no migration plan — track provider deprecation announcements and have a tested swap procedure ready, not a future intent.
Implementation Sequence
Register one endpoint for one feature on one user group. Validate latency, residency, and cost against the assumed model. Expand to additional features only after the pilot endpoint has been stable for 30 days. Multi-endpoint, multi-feature rollouts on day one produce too many simultaneous variables when something goes wrong.
What to do this week: pull the per-call cost and per-call latency for your existing Now Assist usage; if you have not measured these, you are not ready to BYO an endpoint yet.