[object Object]

Mean time to resolve is the metric every executive asks for and the metric most likely to mislead them. Optimize it without a counterweight and you will get a beautiful trendline, a frustrated agent population, and a backlog that nobody is looking at because it does not appear in the dashboard the executive is reading.

What MTTR actually measures

MTTR is an average over a population of resolved incidents in a window. That single sentence contains all of MTTR’s problems:

  • Average — sensitive to outliers, hides distribution shape.
  • Resolved — counts only what reached resolution. Says nothing about what is still open.
  • Population in a window — a slow week of throughput with a normal mix of resolution times will look identical to a fast week of throughput with a backlog growing in the corner.

A team can drive MTTR down by closing easy tickets fast and leaving hard tickets open. Both halves of that sentence are bad. The metric, in isolation, rewards the first half and ignores the second.

The counterweights you actually need

To make MTTR honest, pair it with three companions:

  • Incidents resolved per agent per day — throughput. If MTTR is dropping and throughput is dropping faster, agents are gaming the metric or the easy incidents are being deflected elsewhere.
  • Aging open population — incidents older than 7, 14, 30 days. If MTTR is improving while aging open grows, you are deferring hard work.
  • Reopen rate — incidents that closed and then reopened within 14 days. Fast closure that does not stick is worse than slow closure that does.

These are not vanity metrics. They are the antibodies that prevent MTTR from getting gamed.

Build the indicator set in Performance Analytics

Build the four indicators as a coherent set, not as separate dashboards. The set tells a story; the individuals do not.

Indicator: MTTR (Incident)
  Definition: AVG(resolved_at - opened_at) WHERE state = 6 in [period]
  Aggregation: Daily
  Breakdown by: priority

Indicator: Incidents resolved per agent per day
  Definition: COUNT(incident) WHERE state = 6 in [period] / COUNT_DISTINCT(assigned_to)
  Aggregation: Daily
  Breakdown by: assignment_group

Indicator: Aging open incidents (>7d)
  Definition: COUNT(incident) WHERE state IN (1..5) AND opened_at < NOW() - 7 days
  Aggregation: Daily snapshot
  Breakdown by: assignment_group

Indicator: Reopen rate (14-day window)
  Definition: COUNT(incident) WHERE reopened_at - last_resolved_at < 14 days
             / COUNT(incident) WHERE resolved_at in [period]
  Aggregation: Daily
  Breakdown by: assignment_group

Plot them on the same dashboard. If only one is allowed on the executive deck, make it aging open — it is the metric most resistant to gaming and most predictive of customer pain.

The p50, p90, p99 honest replacement for MTTR

Average is the wrong central tendency for incident resolution time because the distribution is right-skewed. A handful of very long incidents pull the mean upward, and a small drop in those long-tail incidents shows up as a “great month” for MTTR while the typical agent experience did not change.

Replace MTTR with three percentile metrics:

// Background script — compute resolution percentiles
function percentiles(values, ps) {
    values.sort(function(a, b) { return a - b; });
    var out = {};
    ps.forEach(function(p) {
        var idx = Math.floor(values.length * p / 100);
        out['p' + p] = values[Math.min(idx, values.length - 1)];
    });
    return out;
}

var gr = new GlideRecord('incident');
gr.addQuery('state', 6);
gr.addQuery('resolved_at', '>=', gs.daysAgoStart(30));
gr.query();

var durations = [];
while (gr.next()) {
    var open = new GlideDateTime(gr.opened_at.toString()).getNumericValue();
    var done = new GlideDateTime(gr.resolved_at.toString()).getNumericValue();
    durations.push((done - open) / 1000 / 60); // minutes
}
gs.info('p50/p90/p99 (minutes): ' + JSON.stringify(percentiles(durations, [50, 90, 99])));

p50 tells you the typical agent experience. p90 tells you the worst-quarter incident an agent will actually remember. p99 tells you whether your tail is getting better or worse. All three on one chart, all three required to claim “MTTR improved.”

Where MTTR gaming actually happens

The patterns are predictable enough to detect:

  1. Status hopping — agents move incidents to “resolved” prematurely to stop the clock; the user reopens within hours. Caught by reopen rate.
  2. Quick close on duplicates — every easy incident gets closed as “duplicate” of a parent that itself never resolves cleanly. Caught by examining duplicate-closure rate and the resolution state of the supposed parent.
  3. Reassignment ping-pong — agents bounce a hard incident between groups so each individual group’s MTTR stays clean. Caught by tracking number of reassignments per incident.
  4. Selective deferral — hard incidents get moved to “Awaiting User” or “On Hold” indefinitely. Caught by aging-open metrics that include awaiting-user time, not just active time.

Number 4 is the most common. The defense is to define MTTR to include awaiting-user time, not exclude it. Yes, this is unfair to agents waiting on slow users. Yes, it is the only way to keep the metric honest.

Assignment group tradeoffs

When you publish these metrics broken out by assignment group, two things happen. The good thing: groups whose performance was hiding inside an organization-wide average become visible, and you can target support to the groups that need it. The bad thing: groups compare each other and the comparisons are not fair, because incident mix differs by group.

The protocol: never compare raw MTTR across groups. Compare each group’s metrics against its own trailing baseline. A group that improved 15 percent against its own history is doing well; a group with a 2-hour MTTR that just got worse by 30 percent needs attention regardless of where it sits in the cross-group ranking.

The category-mix problem

A subtle but real problem: if you change the category mix in your incident population (more password resets, fewer hardware failures), MTTR will shift even with no change in actual team performance. Detect this by tracking MTTR within a stable category basket — a fixed set of categories whose volume is roughly constant — and compare changes there.

Basket categories:
  - password reset
  - laptop performance
  - software install request
  - access request (standard role)
  - email distribution list change

MTTR over basket = AVG(resolution_time) WHERE category IN (basket)

If basket MTTR is improving, you are genuinely getting faster. If total MTTR is improving and basket MTTR is flat, you are mostly shifting mix.

What to put on the executive dashboard

One chart, four lines:

  1. p50 resolution time
  2. Throughput per agent per day
  3. Aging open count (>7d)
  4. Reopen rate

Add a narrative bullet beside each: what the line shows, what would be concerning, what the team is doing about it. The narrative is the dashboard. The chart is the evidence.

For the related metric-architecture question of breakdown cardinality, see our piece on Performance Analytics breakdowns and cardinality.

UI for the agent’s view

Agents need their own metrics, separately. Show each agent: today’s resolved count, this week’s reopen rate, this week’s aging open assigned to them. Make them visible from the agent workspace landing page. Visibility shifts behavior; concealment lets bad habits compound.

Do not show agent-level metrics on the executive dashboard. The aggregation level matters; mixing agent-level granularity with executive-level rollup creates pressure to game the agent-level number.

Survey-anchored quality metric

The metrics so far measure what happened. None measures whether the user was actually helped. Layer in a short survey, fired on close, with a single question — “Was this resolved to your satisfaction? Yes / No / Partially” — and a free-text follow-up.

The metric to track is not response volume (everyone undersurveys). The metric is the No-and-Partially rate among closed incidents that the response did not arrive on within 7 days. Plot it weekly. A rising No-rate while MTTR is falling is the loudest signal that fast closure is not the same as good closure.

Do not over-survey. One survey per closed incident is enough; surveying every interaction creates fatigue and lowers response rate to the point of statistical noise.

Cohort views over time-window views

Most MTTR dashboards show metrics over a calendar window — last 7 days, last 30 days. A subtler view that adds insight: cohort metrics by incident creation week. For each cohort, plot how many were resolved in the first 24 hours, first 7 days, first 30 days.

The shape of the cohort tells you about the resolution funnel in a way time-window metrics never can. A cohort that shows 80 percent resolved in 24 hours and then a long flat tail is a different problem than a cohort that shows 40 percent in 24 hours and gradual draining. The two cohorts could have identical MTTR by coincidence; the operational reality is very different.

Tradeoffs to be honest about

Tracking four metrics instead of one is more work and more meetings. Some leadership audiences will resist. The blunt response: the alternative is being misled by a single average that hides every dynamic that matters. There is no cheaper honest answer.

Honest metrics also surface ugly truths. Be prepared for the first month after publishing aging-open metrics to show numbers nobody wants to see. That is the point; that is what you needed to see.

Bottom line

  • MTTR alone is gameable. Pair it with throughput, aging open, and reopen rate. Refuse to publish it alone.
  • Use p50, p90, p99 instead of mean. The distribution is right-skewed and the mean lies.
  • Include awaiting-user time in MTTR calculations. Excluding it lets selective deferral hide.
  • Compare groups against their own baseline, not against each other. Cross-group comparisons mislead because incident mix differs.
  • Aging open is the single most resistant-to-gaming metric on the list. If forced to publish one number, publish that one.
[object Object]
Share