Agent Evaluation with OpenTelemetry

[object Object]

Why OpenTelemetry

OTel is the vendor-neutral observability standard, governed by the CNCF. The OpenTelemetry GenAI semantic conventions (gen_ai.* attributes) reached stable status in late 2025 and are now the canonical schema for AI workload traces. Salesforce Agentforce 3 adopted them for session tracing; Anthropic, OpenAI, and Vertex SDKs ship native OTel exporters; LangSmith, Langfuse, Arize, and Weights & Biases all consume the same schema.

Using OTel means your agent telemetry survives vendor migrations. Switch from Agentforce to a custom LangGraph build and your dashboards still resolve — same span names, same attributes, same percentile math.

What to Trace

One root span per conversation/session. Child spans for: topic or intent classification, retrieval operations (one span per query, attributes for store, top_k, scores), each tool call (attributes for tool name, args hash, result status, latency), each model invocation (model name, prompt token count, completion token count, cost), guardrail evaluations, and final response generation. Parent-child relationships render the reasoning tree in any OTel-compatible UI.

from opentelemetry import trace
tracer = trace.get_tracer("crm.agent")

with tracer.start_as_current_span("agent.session") as session:
    session.set_attribute("gen_ai.system", "anthropic")
    session.set_attribute("gen_ai.request.model", "claude-sonnet-4.7")
    with tracer.start_as_current_span("agent.retrieve") as r:
        r.set_attribute("retrieval.store", "salesforce.kb")
        r.set_attribute("retrieval.top_k", 5)
        # ... do retrieval
    with tracer.start_as_current_span("agent.generate") as g:
        g.set_attribute("gen_ai.usage.input_tokens", 12450)
        g.set_attribute("gen_ai.usage.output_tokens", 312)

Metrics to Emit

gen_ai.client.token.usage (histogram, by model and direction).
gen_ai.client.operation.duration (histogram, p50/p95/p99 per operation type).
agent.tool.call.count (counter, by tool and status).
agent.session.outcome (counter, by outcome class — resolved, escalated, abandoned).
agent.cost.usd (counter, by agent and tenant).

These aggregate into dashboards (Grafana, Datadog, Honeycomb, New Relic) for operational insight. Alert on p95 latency regression, cost anomaly, error rate spike, and outcome-mix drift.

Evaluation Hooks

Span attributes feed offline evaluation. Tag spans with the conversation’s eventual outcome (resolved at 7 days, escalated at hour 2, customer satisfied) so you can correlate intermediate signals (retrieval scores, tool failures, guardrail trips) with downstream business results. The evaluation pipeline pulls traces, joins to outcomes, computes per-cohort accuracy, and writes results back as another span — closing the loop.

Sampling

Full tracing on every conversation is expensive at scale (storage, indexing, network). Sampling strategy:

100% on errors and exceptions (head-based via ParentBased + tail-based for unsampled errors).
100% during deploys and the first 24 hours after a model version change.
10% of happy-path traffic, statistically valid for trend monitoring.
100% of customer-flagged conversations and a 1% audit sample retained indefinitely.

Use OTel Collector’s tail sampling for outcome-aware retention — keep traces that ended in escalation or low CSAT regardless of head sampling decision.

Common Failure Modes

Logging prompts in span attributes (PII leak to your APM vendor) — use attribute filters or hash content.
Cardinality explosion from per-user attributes — rotate to high-cardinality storage or aggregate.
Sampling that drops the rare failures you actually need.
No correlation between spans and audit logs — instrumentation must share session IDs end to end.

What to Do This Week

If your agents emit traces but no one looks at them, schedule a weekly trace-review on the worst 1% of conversations. The instrumentation pays back the moment it changes a decision.

[object Object]

Why OpenTelemetry

What to Trace

Metrics to Emit

Evaluation Hooks

Sampling

Common Failure Modes

What to Do This Week

Get one CRM read per week.

Next articles to explore →

Open-Source Agent Frameworks for CRM

Agentic CMS: When Autonomous Agents Join the Content Team

Red Teaming CRM Agents

AI Agent Audit Trails: Non-Optional in 2026

RAG Patterns for Agentic Workflows

Data Exfiltration via CRM Agents