Why Synthetic
Real customer data carries privacy risk in test environments — every Salesforce sandbox, every staging cluster, every developer laptop with a copy of production data is a potential incident. Teams need realistic volumes and distributions to test AI features properly because the agent’s behavior depends on the data it sees. Synthetic data bridges the gap: realistic patterns and distributions without identifiable records, no GDPR Article 4(5) headaches, no HIPAA Safe Harbor calculus, no vendor questionnaire that asks “do customer records leave production?”
Generation Approaches
Three families dominate. Rule-based generation (Faker, Mimesis, Bogus libraries) is fast and cheap but produces unrealistic relational distributions — every account has the same number of contacts, every customer has the same lifecycle. Statistical sampling fits distributions to real data and regenerates from those distributions; the resulting data preserves marginals but loses joint patterns. Model-based generation (GANs, VAEs, diffusion models, LLM-driven synthesis) preserves complex relationships but requires careful privacy-leak testing because poorly trained generators can memorize and emit training examples. The Synthetic Data Vault (SDV) implements all three patterns and is the most accessible open-source option.
# SDV example — generate synthetic Accounts preserving relationships
from sdv.relational import HMA1
from sdv.metadata import MultiTableMetadata
metadata = MultiTableMetadata.detect_from_csvs({
'accounts': 'accounts.csv',
'contacts': 'contacts.csv',
'opportunities': 'opportunities.csv'
})
model = HMA1(metadata)
model.fit(real_tables)
synthetic = model.sample(scale=1.5) # 50% more rows than real
Tool Landscape
Tonic.ai is the leader for relational tabular data with native connectors for Postgres, MySQL, Snowflake, Databricks, and MongoDB. Gretel emphasizes generative quality and ships differential-privacy guarantees. MOSTLY AI focuses on banking and insurance. YData covers time-series patterns well. Salesforce Data Mask provides on-platform generation for sandboxes — useful for UAT, limited for AI training. Open-source SDV for DIY teams with data-engineering capacity. Pick based on use case, compliance posture, and budget; the wrong tool is one that does not preserve the relational distributions your AI agent depends on.
AI Testing Specifics
For agent eval, synthetic conversations matter more than synthetic records. The eval set needs realistic customer queries (top 20 intents covering 80% of volume; long-tail intents with at least 30 examples each), realistic message patterns (typos, frustrated tone, multi-turn clarifications, off-topic excursions, mixed languages), and realistic support cases with the metadata an agent would see. Volume matters: a few hundred examples does not test rare-intent handling. Plan for 2,000-10,000 conversations per agent, version-controlled in git, refreshed quarterly. LLM-driven synthesis (Claude, GPT-5, Llama 4) can generate conversation transcripts at scale; combine with human review of a sample to catch unrealistic patterns.
Privacy Posture
Synthetic data is not automatically private. Verify through differential-privacy guarantees (epsilon-delta budgets), membership inference attack testing, and explicit reconstruction-attack red teaming. The 2023 NeurIPS work “Are Synthetic Data Private?” showed reconstruction is feasible against generators trained without DP. For high-sensitivity domains (health, finance, government), require DP guarantees in the procurement contract.
Common Failure Modes
The recurring failures: synthetic data with uniform random distributions when reality is power-law, leading to eval pass rates that do not predict production behavior; reusing the same synthetic eval set for so long that the agent overfits to its quirks; failing to refresh when the customer base shifts; trusting “this is synthetic, so it cannot leak” without measurement; and conflating data masking with synthetic generation — they have different privacy properties.
What Changed in 2026
Three shifts: regulators began accepting synthetic data as a recognized technique under EU AI Act Article 10 data-governance obligations; LLM-based conversation synthesis matured to the point that it works well for agent eval generation; and Salesforce expanded Data Mask in Spring ‘26 with stronger relational preservation.
What to do this week
Audit one agent eval set. Confirm whether it uses real or synthetic data; if real, the privacy review is overdue. If synthetic, verify someone has measured distributional fidelity against production.