Ensure data quality when migrating a large dataset into Salesforce | CRM Curator

Large data migrations fail in production not because the loader broke but because the data was wrong before it ever hit Salesforce, and nobody noticed until reports started returning garbage. The interview answer interviewers want is a pre-load → load → post-load discipline, not just “use Data Loader.”

The 60-second answer

(1) Profile the source data — row counts, null %, value distributions, uniqueness on the natural key. (2) Normalize picklist values, dates, phone formats before load — fix it in staging, not in Salesforce. (3) Dedupe against itself and against existing Salesforce data using an External ID. (4) Upsert with External ID so reruns are idempotent. (5) Load in batch sizes of 5,000–10,000 with the Bulk API, parent objects first. (6) Reconcile post-load — row counts, sum-of-amounts, sample-record verification. (7) Never load straight to production; always full → partial sandbox → production with a written runbook.

Phase 1: Profile the source data

Before you write a single field mapping, run profiling on the source. The fastest way is SQL or a Pandas notebook.

For each column, report:
  - Row count
  - Distinct count
  - Null count + null %
  - Min / Max / Mean (numeric)
  - Min / Max length (text)
  - Top 10 most frequent values
  - Distinct unique count of the natural-key field

What you’re looking for:

High null % in fields that are required in Salesforce → either source-data fix or a default value strategy.
Distinct counts on the natural key less than total row count → duplicates in the source itself.
Unexpected value variants in what should be a picklist (U.S., USA, United States, united states) → normalize before load.
Date format inconsistencies — Salesforce wants ISO 8601 (2026-05-19), most source systems do not.

Phase 2: Normalize and clean in staging

Do this work in the source / ETL layer, not via post-load Salesforce updates.

Cleanup	Fix it in staging because
Picklist value mapping	Salesforce only accepts exact picklist values; anything else either errors or gets stored as the API name silently
Phone format	Salesforce phone field has 40 chars; format consistency makes Lightning click-to-call work
Required-field defaults	Loading nulls into required fields is the #1 cause of mid-load failure
Owner mapping	If the source `created_by` is “John Smith” you need to map to the Salesforce `User.Id` — do this in a lookup, not at runtime
Currency / Money rounding	Salesforce stores up to 18 decimal places — match the source’s precision exactly

Phase 3: Dedupe against itself and against Salesforce

Two dedup passes:

Source-side: GROUP BY natural_key HAVING COUNT(*) > 1 — fix in source.
Cross-Salesforce: query Salesforce for any matching natural keys (SELECT Id, Email__c FROM Contact WHERE Email__c IN (:sourceEmails)), then upsert instead of insert.

The cleanest path is a Unique External ID field on every migrated object. The pattern:

Contact:
  External_ID__c   Text(40)   Unique: ✓   External ID: ✓

Account:
  Legacy_System_ID__c  Text(40)  Unique: ✓  External ID: ✓

With those in place, Data Loader can match on the external key and Upsert instead of Insert — duplicates become updates, and reruns are safe.

Phase 4: Load order and batch sizing

Salesforce data has a dependency graph. Load parents before children:

1. Users + Roles + Profiles  ← must exist for owner assignment
2. Accounts                   ← parent of most things
3. Contacts                   ← FK to Account
4. Opportunities              ← FK to Account
5. OpportunityLineItems       ← FK to Opportunity + Pricebook2
6. Cases                      ← FK to Account/Contact
7. Tasks/Events               ← FK to many
8. Custom objects             ← in their own dependency order

Use Bulk API 2.0 with batch sizes of 5,000–10,000. Smaller batches mean more API calls and slower throughput; larger batches risk per-batch timeout on heavy triggers. Disable triggers/workflows for the load if business logic isn’t needed during migration (then re-run them via batch Apex post-load — this is the cleanest pattern for very large loads).

Phase 5: Reconcile post-load

A migration is “done” only when reconciliation passes:

Check	Pass criteria
Row count	Source rows = Salesforce inserted rows + Salesforce skipped rows
Sum reconciliation	`SUM(Amount)` in source = `SUM(Amount)` in Salesforce
Sample inspection	Hand-pick 20 records and walk every field
Parent-child integrity	`COUNT(Contact)` per Account matches source per `Legacy_Account_ID`
Picklist values	No `NULL` in required picklists
Owner mapping	No records owned by the integration user when they should be owned by a real rep

Failures on any reconciliation check = roll back and reload, not “fix in production.”

Anti-patterns that fail migrations

Insert instead of Upsert on rerun — creates duplicates the second time you run.
No External ID — leaves you reconciling on Name (ambiguous) instead of a true key.
Triggers and validation rules on during the load — every rule fires per record; a 1M-row load might take a week and break on a Validation Rule that nobody remembered.
Loading to production first — always do full sandbox → reconcile → production. Treat the sandbox load as a dress rehearsal.
No rollback plan — if reconciliation fails, you need a documented DELETE WHERE Migration_Batch__c = 'X' to undo the load cleanly.
Ignoring picklist value mismatches — Salesforce will accept an unrecognized value via the API silently and reports will start returning blank cells. Validate the picklist values in a pre-load script.

How to answer in 30 seconds

“Profile the source, normalize and dedupe in staging, use External ID + upsert for idempotency, load parents-before-children in Bulk API batches of 5–10k, then reconcile row counts and sums post-load. Never load straight to production — full sandbox dress rehearsal first.”

How to answer in 2 minutes

Walk the five phases — profile, clean, dedupe, load order, reconcile. Mention the External ID + upsert pattern is what makes reruns safe. Mention turning off triggers/workflows during very large loads and re-running them via batch Apex. End with reconciliation as the definition of done.

Likely follow-up questions

What’s the difference between Bulk API 1.0 and 2.0?
When would you use Data Loader vs Workbench vs the Bulk API CLI?
How do you handle owner mapping from a legacy system?
What’s the trade-off between turning off triggers during a load vs leaving them on?
How would you migrate Attachments / Files / ContentDocument?

Verified against: Bulk API 2.0 Developer Guide, External ID best practices, Data Loader Guide — Upsert. Last reviewed 2026-05-19.