Dual-write Replication Lag: A Monitoring Playbook That Wakes You Up

[object Object]

Dual-write is the synchronous bridge between Finance and Operations and Dataverse. The word “synchronous” is misleading — under sustained write load, the queue accumulates and you discover three hours later that customer records updated in F&O are not yet in Dataverse, and sales reps are quoting against stale credit limits. The platform does not page you. You build that yourself.

What dual-write actually is

Each dual-write map is a paired set of mappings between an F&O table and a Dataverse table. Writes on either side hit a sync engine that propagates the change. The engine has a queue per direction. Latency under nominal load is a couple of seconds. Under burst load — month-end invoice posting, a bulk import — the queue grows.

The four lag signals

Initial sync running: a map is still in its first hydration. New writes do not propagate cleanly until it finishes.
Queue depth: pending changes per direction per map. Normal is < 100. Concerning is > 5000.
Error count: row-level failures sitting in the error queue. Each one blocks downstream rows on the same key.
End-to-end latency: a sample row written on one side, measured arriving on the other.

The Power Platform admin center surfaces the first three at a glance and lies about latency. To know real end-to-end latency, you write your own probe.

The probe pattern

Create a paired probe table on both sides — cdm_dwprobe in F&O and cdm_dwprobes in Dataverse — with three columns: probe_id (string), written_at_source (datetime), written_at_sink (datetime). A scheduled Azure Function writes a row to the F&O side every 60 seconds with written_at_source = now. A Dataverse plugin on Create stamps written_at_sink = now when the row arrives. A second scheduled job reads the latest probe and computes lag = written_at_sink - written_at_source.

// Azure Function: write probe
import { app, Timer } from '@azure/functions';

app.timer('dualwriteProbe', {
  schedule: '0 */1 * * * *',
  handler: async () => {
    const probe = {
      probe_id: crypto.randomUUID(),
      written_at_source: new Date().toISOString()
    };
    await fetch(`${FO_BASE}/data/CdmDualWriteProbes`, {
      method: 'POST',
      headers: {
        Authorization: `Bearer ${await getToken()}`,
        'Content-Type': 'application/json'
      },
      body: JSON.stringify(probe)
    });
  }
});

// Dataverse plugin: stamp arrival
export class StampProbeArrival implements IPlugin {
  execute(ctx: IPluginExecutionContext) {
    const target = ctx.inputParameters.Target as Entity;
    if (target.LogicalName !== 'cdm_dwprobe') return;
    target['written_at_sink'] = new Date();
  }
}

The plugin runs in the PreOperation stage of Create so the stamp lands in the same transaction.

Alert thresholds that map to user experience

Warning at 30 seconds: sales reps notice this when they refresh a record after editing F&O.
Page at 5 minutes: bulk sync is happening or a map is wedged. Either way, on-call eyeballs.
Page hard at 30 minutes: data is materially drifting. Stop dependent automation if you have a kill switch.

Push the lag metric to Application Insights as a custom metric, then alert from there. Do not use Power Platform monitoring alerts for this — the granularity is too coarse.

When you see lag, what is wrong

Three buckets of causes:

Sustained write storm: bulk import or background job is producing more writes than the engine can drain. Throttle the source.
A poisoned row: one row in the error queue is blocking propagation on its key. Find it in the error log, fix or skip, drain proceeds.
Map stopped: someone disabled or reconfigured a map. Initial sync running again until it finishes.

The error queue is what you most often hit. The errors are clear once you find them. The problem is finding them — the admin UI lists errors per map but does not group by error message. We export to a SQL table and group there. The same five SKUs cause 90% of errors.

The error pattern we see most

A required field on the Dataverse side that is nullable on the F&O side. F&O writes null, Dataverse rejects. The fix is either:

Make the Dataverse column nullable.
Add a default in the dual-write map.
Filter the row at the map level.

Pick the third option only when the row is genuinely irrelevant. Otherwise you create silent drift.

Network considerations

Dual-write traffic goes through the Power Platform network plane, not your VNet. Outbound proxies and firewalls do not see it. But the F&O side does run an outbound webhook for change events, and that path can be throttled by your egress controls if you have customized them. Check EgressFirewallLog in F&O if you see one-sided lag (writes propagate F&O → Dataverse fine, Dataverse → F&O fails).

The kill switch

When lag exceeds the page-hard threshold, automation that depends on cross-side consistency must stop. Build a flag in Dataverse — cdm_systemstatus.dualwrite_active = false — and have every dependent Power Automate flow check it as the first step. When the probe-based alert fires, your runbook flips the flag. Downstream automation pauses, no orchestrator dies on a stale read.

Pixel notes

Build a tiny model-driven dashboard that shows current lag, queue depth, and error count per map. Three widgets, refreshes every 60 seconds. The admins love it because the platform’s built-in view requires four clicks to surface the same data. Visibility is a forcing function for ownership.

Read also

For solution boundaries that constrain dual-write maps, see Dependency hell in solutions. Maps are themselves solution components and inherit the same hazards.

Key takeaways

The admin center does not show end-to-end latency. Build a probe.
Push lag as a custom metric, alert at 30s / 5m / 30m.
Most errors are field-level mismatches; export the error log and group.
Build a kill switch flag for dependent automation.
One-sided lag often means egress firewall, not the sync engine.

[object Object]

What dual-write actually is

The four lag signals

The probe pattern

Alert thresholds that map to user experience

When you see lag, what is wrong

The error pattern we see most

Network considerations

The kill switch

Pixel notes

Read also

Key takeaways

Get one CRM read per week.

Next articles to explore →

F&O Data Management Packages: A Versioning Discipline That Holds

Customer Insights: Real-Time Segment Overlap Detection

D365 SLA Pause vs Stop: A 2026 Design That Doesn't Lie

Power Pages + Entra External ID: A Clean Migration From B2C

Customer Insights Segment Thrash: Stop the Recompute Loop

Power Platform DLP Policy Design: A Tiered Pattern That Scales