[object Object]

Your AWS spoke is calling a vendor endpoint that returns HTTP 503 for two hours. IntegrationHub retries with backoff, fills the scheduled job queue, and now every flow on the instance is delayed. Welcome to the cascade. The fix is a circuit breaker, and the platform does not give you one out of the box.

The pattern in three states

A breaker has three states: closed (calls go through), open (calls fail fast), and half-open (one trial call to test recovery). Implement it as a scoped Script Include backed by a simple state table.

Table: u_integration_breaker
Columns:
  endpoint (string)
  state (closed | open | half_open)
  failure_count (integer)
  opened_at (datetime)
  cooldown_seconds (integer, default 300)

The breaker Script Include

var Breaker = Class.create();
Breaker.prototype = {
  shouldExecute: function(endpoint) {
    var gr = new GlideRecord('u_integration_breaker');
    if (!gr.get('endpoint', endpoint)) return true;
    if (gr.state == 'closed') return true;
    if (gr.state == 'open') {
      var elapsed = (gs.nowDateTime() - gr.opened_at) / 1000;
      if (elapsed >= gr.cooldown_seconds) {
        gr.state = 'half_open';
        gr.update();
        return true;
      }
      return false;
    }
    return true;
  },
  recordResult: function(endpoint, success) {
    // standard breaker state machine
  },
  type: 'Breaker'
};

Wrap every IntegrationHub Action that calls an external endpoint with Breaker.shouldExecute() as the first step. Skip the call when open.

Tune the thresholds per endpoint

A payments API and a sales-tax lookup have different acceptable failure rates. Per-endpoint config lives in u_integration_breaker_config:

endpoint: vendor.payments.v2
failure_threshold: 5
cooldown_seconds: 600
half_open_trial_calls: 1

Document the rationale for each threshold in a short ADR.

Observe the breaker, not the endpoint

A Performance Analytics indicator on % time in open state, last 7 days is the metric your operations team should watch. A breaker that is open more than 2% of the time means the vendor SLA needs renegotiation.

What you must not do

Do not put the breaker logic inline in every flow. Centralize it. The breaker only works if every call path goes through the same gate.

What to do this week

Pick your three most-called external endpoints. Wire the breaker Script Include in front of each. Watch the next outage stay contained instead of cascading.

[object Object]
Share