[object Object]

The Problem

Prompt changes, model upgrades, retrieval tweaks, and even Salesforce-managed Atlas planner updates all affect agent behavior. Without regression tests, drift is discovered in production by an unhappy customer. A working suite covers 50–200 test cases across four categories: happy path, edge cases, adversarial (prompt injection, jailbreak), and out-of-scope (refusal expected). Use the Agentforce Testing Center as the canonical home — it integrates with the Atlas Reasoning trace so failures point to the exact span that diverged.

Test Case Structure

Each case has an input (user message plus optional structured context such as the active record), an expected outcome (topic selected, actions called with argument shape, response semantic content, refusal flag), and a scoring rubric. Pass/fail by semantic match or LLM-as-judge for flexibility — strict string match is too fragile for LLM outputs and produces false-positive failures on cosmetic phrasing changes.

Test case (YAML, Agentforce Testing Center):
- id: case-resetpw-001
  input:
    user_message: "I forgot my VPN password"
    context:
      user_id: 005xx000001abcd
  expected:
    topic: password_reset
    actions:
      - name: VerifyMfa
      - name: ResetVpnPassword
        args_shape:
          user_id: string
    response_must_contain: ["temporary password", "expires in 24 hours"]
    refusal: false
  scoring:
    method: llm_judge
    judge_model: claude-haiku-4
    threshold: 0.85

CI Integration

Run the suite on every metadata change to the agent. Block deploys when regressions exceed a defined threshold (e.g., >5% case failure delta or any failure on a Severity-1 case). Auto-run on model version upgrades — these are the changes most likely to cause invisible drift, since you don’t control the upgrade cadence. Wire the runner into your existing GitHub Actions or Salesforce DX pipeline:

sf agent test run \
  --agent-name SupportAgent \
  --suite ./tests/agent/regression.yaml \
  --result-format json \
  --fail-on-severity 1

Adversarial Coverage

Allocate 20–30% of cases to adversarial prompts: instruction overrides, role-play jailbreaks, indirect injection via document content, structured-output abuse, and PII probing. Refresh the adversarial set monthly with examples from public LLM red-team corpora. Without this slice, the suite tests competence but not safety.

Maintenance

Test sets rot. Review quarterly. Add cases for every production bug as you fix it (regression-driven development for agents). Remove tests that no longer represent real usage — agent behaviour expectations drift as product scope evolves. Unmaintained regression suites become theatrical rather than protective and erode developer trust faster than no suite at all.

Cost Considerations

A 200-case suite running on every change can cost $5–$30 per run depending on model choice. Use a tiered strategy: full suite on main-branch merges, smoke subset (20 cases) on PRs. LLM-as-judge with a smaller model (Haiku, Gemini Flash) cuts judge cost ~10x without measurable accuracy loss for binary pass/fail.

What to Do This Week

Build the first 30 regression cases for your highest-traffic agent: 15 happy-path, 10 edge cases, 5 adversarial. Wire the runner into your deploy pipeline.

[object Object]
Share