[object Object]

A flow with 12,000 runs a day will hide its real failures inside thousands of “Succeeded with retry” rows. The default 28-day run history viewer is not built for triage at that volume. You need a method.

Stop scrolling, start filtering

The Power Automate portal lets you filter by status, trigger time, and identifier, but most teams ignore the URL parameters. Append ?statuses=Failed&triggerTimeFrom=2026-04-21T00:00:00Z to jump straight to a window. Save these URLs as bookmarks named after your top 10 production flows. Triage time drops from 15 minutes to 60 seconds.

For volumes above 1,000 runs/day, abandon the portal entirely. Pull workflowruns and flowsessions via the Dataverse API and ship them to Application Insights or a Log Analytics workspace. Kusto queries cluster failures by error.code in seconds.

Cluster errors before reading them

Eighty percent of flow failures are five root causes. Group by code and message substring before opening a single run. Common clusters:

  • 429 throttling on Dataverse or SharePoint connectors
  • ExpiredAuthenticationToken from connection references that lost consent
  • ScopeFailed inside a Try/Catch when an inner action returned 4xx
  • ContentTransferEncodingNotSupported from email attachments
  • RequestEntityTooLarge on file actions exceeding 100 MB

Tag each cluster with an owner and a known fix. New failures get triaged against the cluster table first.

Isolate poison messages

When a flow loops on the same record forever, you have a poison message. The signature is identical inputs across consecutive failed runs, often spaced exactly at the retry interval (default 4 retries, exponential backoff). Add a Compose action at the top that hashes the trigger payload and writes to a Dataverse flow_poisonqueue table. After three failures on the same hash, terminate with Failed and skip retries by setting runtimeConfiguration.retryPolicy.type to none on that branch.

Kill silent retries

A retry that eventually succeeds masks a real problem. In your scope action settings, configure retryPolicy explicitly rather than relying on defaults. For idempotent calls use count: 2, interval: PT10S. For non-idempotent ones (creating a record without a dedupe key) use type: none and handle the failure path yourself. Log every retry attempt to App Insights with the flow run ID so you can graph the silent failure rate.

Build a weekly health view

Schedule a flow that runs Sundays at 23:00, queries the past week’s run history for your top 25 flows, and writes a summary to a Teams channel: total runs, failure rate, top three error clusters, longest-running action. This is your Monday morning standup input. It also catches drift before users notice.

Document the runbook

For each flow, maintain a one-page runbook in the same solution: trigger, downstream systems, known failure modes, escalation contact, rollback steps. Link it from the flow description so the next on-call engineer is not learning at 2 AM.

What to do this week: pick your three highest-volume flows, build the bookmarked filter URLs, and add explicit retry policies to every HTTP and Dataverse action inside them.

[object Object]
Share