CRM Failover Runbook 2026: When the Vendor Goes Down

[object Object]

Your CRM vendor’s SLA is 99.9%. That allows 8 hours and 45 minutes of downtime per year, and 2025 proved they’re willing to use every minute. Salesforce had a four-hour Hyperforce regional event in July. Dynamics 365 had two multi-hour Power Platform incidents. HubSpot had a backend degradation that stretched into a full day. The runbook below is what separates revenue teams that kept selling from teams that watched.

What “down” actually means

There are four failure modes, each with a different runbook:

Total outage — vendor status page red, all APIs return 503.
Regional outage — one geography down, others fine.
Partial degradation — APIs slow, UI flaky, writes intermittently failing.
Silent corruption — writes accepted but not persisted, or persisted to the wrong region.

Most actual incidents are #3 and #4. Your runbook should cover all four.

The pre-incident posture

Failover starts before the incident. You need, in place and tested:

Multi-region active CRM tenant (where the vendor supports it) or a documented read-only shadow.
Local write buffer that queues mutations during outage.
Caches of hot read data (account, contact, open opportunity) that survive vendor downtime.
Communication tree — who calls what, in what order, in what channel.
Manual workflow fallbacks — pen-and-paper sales flows, email-only support intake, status page templates ready.

Vendors won’t sell you “active-active CRM across regions” because they can’t deliver it reliably. What they sell is “tenant-pinned-to-region with documented DR.” That’s not failover. That’s hope.

The architecture that survives

# resilience_architecture.yaml
client_apps:
  - sales_app:
      primary: salesforce.us-east
      cache: redis_local         # 5-min TTL on hot reads
      write_buffer: kafka_local  # durable, replayable
      fallback_ui: minimal_offline_mode
  - service_console:
      primary: salesforce.us-east
      readonly_replica: warehouse_postgres  # CDC'd, 90s lag
      ticket_intake: kafka_local

data_replication:
  cdc_source: salesforce.us-east
  cdc_target: warehouse_postgres  # operational + analytic
  lag_alert_threshold_s: 120

write_path:
  online:
    queue: kafka_local
    consumer: cdc_writer
    target: salesforce.us-east
  offline_degraded:
    queue: kafka_local
    consumer: paused
    backlog_retention_h: 72

The non-obvious move: even in steady state, route writes through a local durable buffer. Same hot path, no production traffic difference, infinite resilience.

Detection: don’t trust the status page

Vendor status pages lag the actual incident by 15–45 minutes. Build your own detection:

# health_check.sh
#!/usr/bin/env bash
# probes every 30s, alerts on 3 consecutive failures
sf_probe() {
  curl -fsS --max-time 5 \
    -H "Authorization: Bearer $TOKEN" \
    "https://${INSTANCE}.salesforce.com/services/data/v60.0/sobjects/Account/describe" \
    > /dev/null
}
mutate_probe() {
  # canary write to a known sandbox account
  curl -fsS --max-time 10 \
    -H "Authorization: Bearer $TOKEN" \
    -H "Content-Type: application/json" \
    -X PATCH -d '{"Description":"probe-'"$(date +%s)"'"}' \
    "https://${INSTANCE}.salesforce.com/services/data/v60.0/sobjects/Account/${PROBE_ID}"
}

Probe both read and write paths. Many outages let reads through and silently drop writes. Status page won’t say that for an hour.

The incident phases

T+0 (detection). Probes red. Alert fires. On-call confirms — is this us, or them? Check vendor status page, peers in vendor community, and recent deploy log.

T+5 (declaration). If confirmed vendor, declare incident. Page revenue ops, customer support lead, comms. Switch the war-room channel on.

T+10 (degradation mode). Apps switch to fallback UI. Writes buffer locally. Reads serve from cache + warehouse replica. Customer-facing teams pivot to email + phone scripts.

T+30 (extended outage protocol). If still down: enable manual workflows. Pen-and-paper deal capture for AEs. Email-only intake for support. Comms to customers (status page, banner in app).

T+? (recovery). Vendor reports green. Run health probes. If clean, drain write buffer in order. Reconcile against vendor for duplicates. Lift degradation UI. Post-mortem within 5 business days.

The write buffer is the load-bearing piece

Without a durable write buffer, every write during the outage is lost. Salespeople type into a UI that throws errors and gives up. With a buffer, writes succeed locally and replay on recovery.

The buffer must:

Persist durably (Kafka, SQS, durable Redis stream — not in-memory).
Preserve order per entity (Account A updates apply in submission order).
Be idempotent on replay (use client-side request IDs).
Have a backlog retention of at least 72h (real outages can run long).
Be drainable manually if vendor doesn’t recover cleanly.

Replay order matters: if a Contact and the Account it belongs to are both new, replay Account first or you get foreign-key failures.

Read fallback: what users see

When the CRM is down, sales and service still need to function. Options:

Warehouse replica. Read-only Snowflake / BigQuery / Postgres replica CDC’d from CRM. 60–120s lag is normal. Good enough for “show me my pipeline” but not for live editing.
Local cache. Redis or browser-side IndexedDB with the user’s hot records. Fastest fallback, smallest coverage.
Email-as-intake. For support: customer emails forwarded to a parsed queue, ticketified on recovery. Crude. Works.
Print-and-fax mode. Sounds absurd, used by call centers in 2025. Have the template ready.

Multi-region: what’s actually possible

Vendor	Multi-region claim	Reality
Salesforce Hyperforce	”available in 20+ regions”	One tenant pinned to one region. Cross-region read replica via Data Cloud is best you get.
Dynamics 365	Geo redundancy via Azure	Pair-region failover, customer-triggered, slow (hours).
HubSpot	Single region per portal	No customer-controlled multi-region.
ServiceNow	Paired DR	Standby instance, customer-initiated failover (RTO measured in hours).
Zoho	Region pinned	No customer-controlled DR.
Microsoft Sales Copilot	Tenant region	Same as Dynamics.

None offer active-active. All offer some form of paired DR. The DR is contractual, not architectural — you ask, the vendor flips you over. Hours of RTO.

This is why local buffering matters. You can’t out-architect the vendor’s regional pinning. You can decouple your applications from it.

Reconciliation: the painful part

When vendor is back, write buffer drains. Some of those writes may conflict with writes the vendor accepted before going dark, or with admin recovery work. Reconciliation rules:

Use server-side timestamps, not client-side.
Last-write-wins is dangerous on amounts; prefer field-level merge for monetary fields.
Flag anything where the vendor’s current state contradicts the buffered intent — route to ops for review.
Generate a reconciliation report: writes succeeded, writes deduped, writes rejected, writes flagged.

Plan for 0.5–2% of buffered writes to need human review.

The communication playbook

Internal. Pinned message in war-room channel updated every 15 minutes. Status update template ready to copy.
Sales / service teams. Cadence email + Slack at T+15, T+45, then hourly. Include manual workflow instructions.
Customers. Status page banner at T+30 if customer-facing impact. Proactive email at T+60 if revenue impact.
Executives. Snapshot at T+30, T+120, T+resolution. Don’t make them ask.

Pre-write all of these. Editing comms during an incident wastes the wrong people.

Post-incident

Drain buffer. Reconcile. Audit.
Customer comms: what happened, what we did, what we’ll do.
Vendor: file the SLA credit request. Most never do. The credit isn’t the point — the documentation trail for future negotiation is.
Post-mortem: timeline, contributing factors, what worked, what didn’t, action items with owners and dates.

Roll the runbook forward.

Drills: the only thing that proves the runbook works

Quarterly. Block 90 minutes. Simulate vendor outage. Probe says red. Run the steps. Time how long until writes queue, reads serve from cache, users keep working. Find what broke. Fix it before the real one.

Most teams don’t drill. Most teams’ runbooks are fiction.

A drill checklist:

Block probes from reaching vendor (firewall rule, not actual outage).
Confirm health-check pipeline detects within 90 seconds.
Confirm alerting fires to on-call.
Switch apps into degraded mode; observe user-facing behavior.
Submit test writes; confirm they queue locally and survive a buffer restart.
Submit test reads; confirm fallback sources serve them.
Restore connectivity; drain buffer; verify reconciliation.
Debrief: where did things stick? Document and fix before next quarter.

AI agents during outage

Your customer-facing AI agents call CRM APIs. When the CRM is down, agents return errors or, worse, fabricate responses based on stale cache. Decide upfront:

Agents enter “read-only mode” — answer from cache, refuse to commit any write, tell the user explicitly.
Agents escalate to human — every request routes to a person, no autonomous action.
Agents halt — refuse to take the call until CRM is healthy.

Halt is safest. Read-only with explicit disclosure is the usual compromise. Fabrication is malpractice. Wire the agent’s tool-call layer to the same health probes so it knows the CRM is down.

SLA credits and what they’re actually worth

Every major CRM vendor’s SLA includes credits when uptime falls below the commitment. Typical:

Below 99.9%: 10% credit on the affected month.
Below 99.0%: 25% credit.
Below 95.0%: 50% credit.

You have to file. The window is usually 30 days. The credit is on next month’s bill, not a cash refund. The actual revenue impact of an outage routinely exceeds the SLA credit by 10x.

File anyway. The paper trail matters for renewal negotiation.

What ties to broader resilience

The same instincts that build SRE practices for AI agents apply here. Probes, queues, runbooks, drills. The new wrinkle in 2026 is that AI agents are now critical-path users of the CRM — when the CRM is down, your agents are down, and they fail loudly. Plan for the cascade.

Bottom line

99.9% SLA = 8h45m / year of permitted downtime. Plan for it.
Buffer writes locally always — not just during incidents. Same hot path, infinite resilience.
Read fallback via warehouse replica + local cache. Don’t depend on the CRM for read uptime.
Vendor “multi-region” is contractual DR with hours of RTO. Don’t confuse it with active-active.
Drill quarterly. The runbook you haven’t run is a document, not a runbook.

[object Object]

What “down” actually means

The pre-incident posture

The architecture that survives

Detection: don’t trust the status page

The incident phases

The write buffer is the load-bearing piece

Read fallback: what users see

Multi-region: what’s actually possible

Reconciliation: the painful part

The communication playbook

Post-incident

Drills: the only thing that proves the runbook works

AI agents during outage

SLA credits and what they’re actually worth

What ties to broader resilience

Bottom line

Get one CRM read per week.

Next articles to explore →

Agent Deployment: The Phased Rollout Playbook

On-Call for AI Agents: SRE Patterns

Vector DB Cost Pitfalls in 2026

Cost-Per-Resolution: The AI Ops KPI to Track

AI Chaos Engineering for CRM

AI Shifts from Assistive to Operational