[object Object]

A fault path that emails the admin is not a recovery pattern. It is a confession. If your Flow can fail mid-transaction and you can’t replay it safely, you have a data integrity bug waiting to fire.

Here are the four patterns we use in production. Pick one based on the cost of duplicate writes versus the cost of missed writes.

The default fault path is broken by design

The auto-generated fault path in Flow Builder routes to a “Send Email Alert” element with the system error variable. The admin gets a screenshot of the failure and the record sits in a half-applied state.

Three problems:

  • No retry. The record is poisoned until someone manually replays it.
  • No idempotency. If you do replay, you re-fire any side effects that already succeeded.
  • No quarantine. The failed record stays in the main object, polluting reports.

Pattern 1: Inline retry with exponential backoff (cheap operations only)

Use when the downstream is flaky but cheap to call (custom metadata read, internal Apex action).

[Subflow: RetryWithBackoff]
Inputs:
  - actionToCall (String, e.g. "InvokeWebhook")
  - payload (Apex-defined type)
  - maxAttempts (Number, default 3)

Loop:
  Decision: hasAttemptsLeft?
    Yes -> Action: actionToCall
            Decision: success?
              Yes -> End
              No  -> Assignment: attempt = attempt + 1
                     Assignment: waitMs = 2 ^ attempt * 1000
                     Wait Element: waitMs
                     (back to loop)
    No -> Quarantine path

Wait elements in screen flows are pause elements; in record-triggered flows you cannot wait, so wrap the retry in an Invocable Apex action and let Apex sleep with System.enqueueJob finalizer chaining. More on that in the Queueable finalizer pattern.

Pattern 2: Idempotency key on the side-effect call

Any time a Flow calls an external system, generate an idempotency key and store it on the record before the call.

public class WebhookCall {
  @InvocableMethod(label='Call Webhook with Idempotency')
  public static List<Result> call(List<Request> reqs) {
    List<Result> results = new List<Result>();
    for (Request r : reqs) {
      String key = r.recordId + ':' + r.changeVersion;
      // Downstream MUST check this key and short-circuit if seen
      HttpRequest http = new HttpRequest();
      http.setEndpoint('callout:Vendor/notify');
      http.setMethod('POST');
      http.setHeader('Idempotency-Key', key);
      http.setBody(JSON.serialize(r));
      HttpResponse res = new Http().send(http);
      results.add(new Result(res.getStatusCode() == 200, res.getBody()));
    }
    return results;
  }
}

The Flow stores changeVersion on the record (an integer that increments on every save). If the downstream is replayed, it sees the same key and no-ops. You can now retry with abandon.

Pattern 3: Dead-letter quarantine object

The pattern that scales. Create a FlowQuarantine__c object with these fields:

  • SourceRecordId__c (Text 18)
  • FlowApiName__c (Text 80)
  • FaultMessage__c (Long Text 32K)
  • PayloadJson__c (Long Text 32K)
  • AttemptCount__c (Number)
  • Status__c (Picklist: New, Retrying, Resolved, Abandoned)
  • NextAttemptAt__c (Datetime)

The fault path of every record-triggered Flow does exactly one thing: create a FlowQuarantine__c record with the serialized payload. A scheduled Flow picks them up every fifteen minutes, attempts retry, and updates Status__c.

The win: your main object is never in a half-applied state. The poison is isolated, queryable, and visible on a dashboard. Ops triages the quarantine queue like a JIRA backlog.

Pattern 4: Compensating action (when partial success is unavoidable)

Sometimes a Flow does two writes — one to Salesforce, one to an external system — and the second fails. You cannot roll back the external write. You must compensate.

Decision: ExternalCallSucceeded?
  No -> Update Record: rollback the Salesforce-side change
        Create Record: FlowQuarantine with status=Abandoned
        Notify oncall via Platform Event

The rollback path explicitly reverses what the success path did. Write the compensating logic at the same time as the forward logic. Never bolt it on later.

What about Pause and Custom Errors?

Pause elements with platform events can do retry for free, but they have governor surprises:

  • Each Pause consumes a chunk of Flow interview limit.
  • A flood of paused interviews can lock out new ones until they timebox out.
  • Auto-launched flows in transactions cannot Pause anyway.

Custom Errors (added Spring ‘24, expanded in Spring ‘26) are great for user-facing screen flows but useless for unattended record-triggered flows — there is no user to read them.

UX note for screen flows

If your screen flow has a fault path, do NOT route to a generic “Something went wrong” screen. Show the user:

  1. What action failed (in plain English, not the system fault).
  2. Whether their data was saved (the answer is usually “yes, partially”).
  3. A “Retry” button that reruns only the failed step, using the idempotency key.

A trust-preserving error screen takes thirty extra minutes to build and saves a hundred support tickets.

Monitoring

Build a Quarantine Aging report — count of FlowQuarantine__c records by Status__c and age bucket. If New is growing faster than Resolved, your retry strategy is broken. If Abandoned is non-zero you have a data loss event to investigate.

SELECT FlowApiName__c, Status__c, COUNT(Id)
FROM FlowQuarantine__c
WHERE CreatedDate = LAST_N_DAYS:7
GROUP BY FlowApiName__c, Status__c
ORDER BY COUNT(Id) DESC

Bottom line

  • Default Flow fault paths are confessions, not patterns.
  • Always pass an idempotency key when a Flow calls external systems.
  • A dead-letter quarantine object is the single highest-leverage investment for Flow reliability.
  • For unavoidable partial failures, write the compensating action at the same time as the forward path.
  • Measure quarantine aging weekly — it is the cleanest signal of integration health.
[object Object]
Share