A team’s Customer Agent stops rejecting refund requests it should reject after the GPT-5 migration. Same prompt, different model, different behavior on edge cases. Model swaps are never silent in agent systems even when the platform calls them seamless. The 2026 Breeze move to GPT-5 brought genuine reasoning improvements and a handful of regressions that show up only when you re-run real traffic.
The migration in brief
HubSpot migrated Breeze Studio agents to GPT-5 in 2026. Same agent configuration, new underlying model. The headline gains: reasoning depth, tool-call planning, and multi-turn coherence. The headline costs: per-token pricing increased and edge-case behavior shifted in ways prompts written for the previous model did not anticipate.
What got measurably better
- Multi-step intent: agent stays on task across 5+ turns
- Tool selection: picks the right action when 8+ are available
- Ambiguous input: clarifies before acting more often
- Long context: handles 30+ message threads without forgetting state
- Code generation: in-agent code helpers are more accurate
Pull your agent transcripts from before and after the migration. The improvements are visible in handle-time and handoff-rate metrics within a week of switchover.
What broke or shifted
- Refusal patterns: more conservative on edge requests
- Format adherence: occasional deviation from JSON schema
- Tone calibration: skews more formal than prior model
- Latency: slightly higher per response
- Hallucinations: lower frequency but harder to catch when present
Hallucinations did not disappear. They became less common and more confident-sounding. A wrong answer in a confident voice is harder to flag than a wrong answer hedged with uncertainty.
Re-test protocol post-migration
Run a regression suite of real prior interactions through the new model:
1. Pull 200 representative interactions from last 90 days
- Spread across intents, channels, customer segments
- Include known edge cases that previously failed
2. Replay through new model with same prompt
3. Diff outputs: classify as
- Equivalent
- Better (prior was wrong, new is right)
- Different but acceptable
- Regressed (prior was right, new is wrong)
4. Tune prompts for regressions before broad rollout
Without this protocol, you discover regressions through customer complaints.
Prompt updates that paid off
Old prompt:
"Help the customer with their question."
Better for GPT-5:
"Help the customer with their question. If you are
uncertain, ask one clarifying question rather than
guessing. Never invent product features that do not
exist in the knowledge base. Format your response as
JSON with fields: answer, confidence (0-1),
needs_human (boolean)."
Explicit format requirements and uncertainty handling tighten the new model’s outputs.
Cost watch
- Per-token cost up vs prior model
- Outcome-based pricing partly hides this
- Watch: average tokens per resolution
- Watch: spend per channel (chat vs email differs)
- Set alert: 30% spend increase week over week
The first month after migration is the right window to recalibrate spend forecasts.
Production rollout sequence
Week 1: replay regression suite, fix prompts
Week 2: 10% traffic shadow mode (run both models, compare)
Week 3: 25% traffic live on new model
Week 4: 50% traffic, expand if metrics hold
Week 5: 100%, prior model decommissioned
Shadow mode catches regressions without exposing customers to them.
Governance updates
Update your agent governance doc:
- Model: GPT-5 (was GPT-4-class)
- Effective date: post-migration
- Regression suite: linked, last run date
- Prompt version: v23 (post-migration tuning)
- Rollback path: prompt v22 known-good
What to do this week
Pull your post-migration agent transcripts, run the regression diff against pre-migration baseline, prioritize prompt updates for the top three regression patterns, and configure spend alerts before next month’s bill.