[object Object]

Versioning

Prompts are code. Version them in Git. Model pins explicit (not “latest”). Fine-tunes versioned. Deployment rollbacks possible. This discipline prevents “why did the agent start failing yesterday” post-mortems.

Pin model versions explicitly: claude-sonnet-4-7-20260315, not claude-sonnet-latest. Anthropic, OpenAI, and Google all silently update the unversioned aliases, and the behavior change shows up in production. Store prompts as .md or .yaml files in the same repo as the code that calls them, with semver tags. CI enforces that no prompt change deploys without an evaluation run. Rollback is a Git revert plus a canary deploy — minutes, not hours.

Prompt Management

Prompt management platforms (Langfuse, Portkey, custom) catalog prompts, version them, enable A/B testing, track performance. Beats copy-pasting prompts in code repos.

Choose the layer that matches your scale. Under 20 prompts and one team: Git is enough. Above that, a dedicated platform earns its keep — Langfuse for self-host, LangSmith for LangChain shops, Portkey for multi-provider gateway. Key features to demand: variable substitution, environment-specific overrides (dev/staging/prod), A/B labels, traffic splitting, audit log of who changed what when. Couple with a CI gate that blocks deploys when the new prompt version’s eval scores regress beyond a threshold.

Evaluation Pipelines

Nightly eval runs against regression set. Alert on regression. Model upgrade evaluation before production. Evaluation infrastructure is ongoing investment, not a one-time build.

A working eval pipeline has four pieces. A frozen golden dataset of 200-1,000 representative inputs with reference outputs or rubrics. A runner (ragas, promptfoo, LangSmith eval, custom) that executes prompts against the dataset on schedule. A scorer that combines deterministic checks (exact match, regex, JSON schema) with LLM-as-judge for nuanced criteria. A diff view that surfaces regressions per prompt version. Run nightly against the live production prompt and on every PR for changed prompts.

# .promptfoo/eval.yaml
prompts: [prompts/case_summary_v3.md]
providers: [anthropic:claude-sonnet-4-7-20260315]
tests: tests/case_summary_golden.yaml
assert:
  - type: llm-rubric
    value: "Captures customer issue, action taken, next step"
  - type: cost
    threshold: 0.02

Incident Response

On-call rotation for agent issues. Playbook for common failures — cost spike, hallucination cluster, retrieval failure. Post-incident review with root cause and action items. Professionalize or suffer.

Six common AI incidents with first-five-minute responses. Cost spike: rate-limit at gateway, identify offending prompt, throttle or rollback. Hallucination cluster: check knowledge-base freshness, verify embeddings haven’t drifted, inspect retrieval scores. Tool-call failure: confirm upstream tool schema unchanged, check auth tokens. Model regression: pin to previous version, file vendor ticket. Latency spike: check provider status, route fallback. PII leak: kill switch, preserve evidence, page legal. Run a quarterly tabletop exercise on each scenario.

Common Failure Modes

Five recurring patterns. No prompt registry, prompts copied across services that drift apart. No eval coverage, regressions caught by customers. Model alias instead of pinned version, behavior changes silently. No cost ceilings, runaway loops surface on the invoice. No on-call rotation, the original developer is the de facto pager — until they leave.

What to Do This Week

Pin every model reference in your codebase to an explicit dated version and add a CI check that blocks PRs introducing the unversioned alias.

[object Object]
Share