[object Object]

The RCA document is six pages of timeline narrative, ends with “ongoing investigation,” and is referenced exactly once during the post-mortem meeting. Three months later the same incident recurs and nobody connects them. Problem Management’s value depends on the RCA template producing artifacts that downstream processes can consume — not just well-written prose.

Templates per problem type

A single RCA template forces network outages, code defects, and human errors into the same shape. Define three to five templates keyed off problem category. Each template asks the questions that matter for that class. A code defect template asks for the offending commit; a network outage template asks for affected segments and BGP state. Generic templates produce generic RCAs.

Required fields, not optional

The default Problem record has many fields but few of them are required. Make required at least: trigger event, contributing factors (multi-row), detection method, time-to-detect, time-to-resolve, recurrence likelihood, recommended preventive changes (linked Change records). A Problem closed with these fields blank is not actually closed; it is filed.

required_fields_for_close:
- trigger_event
- contributing_factors (>= 1)
- detection_method
- preventive_changes (>= 1 OR explicit "no change required" with reason)

Five Whys is a starting point, not the answer

Five Whys works for linear causal chains and fails for systemic issues. For complex incidents, supplement with a fishbone diagram captured as structured data (categories: People, Process, Technology, External). Store the fishbone categories on the Problem record so they roll up across problems and reveal patterns — “60% of database problems have a Process root cause” is a leadership conversation.

The temptation is to describe the timeline in prose. Better: link the related incidents (problem_task_rel_incident), the related changes that introduced the issue, the affected CIs (with impact severity), and the runbook entries that did or did not help. The narrative is supporting context; the links are the action surface.

Preventive change is the deliverable

Every Problem closure should produce one of three outcomes: a change record to prevent recurrence, an explicit risk-acceptance with sign-off, or a documented detection improvement. Anything else means the Problem was a documentation exercise. The Performance Analytics indicator that matters is “preventive change closed within 30 days of Problem closure,” not Problem volume.

// Business rule on problem table, before update to closed
if (current.state == 4 && // closed
    current.preventive_changes.nil() &&
    current.no_action_justification.nil()) {
  current.setAbortAction(true);
  gs.addErrorMessage('Closure requires preventive change or risk acceptance.');
}

Recurrence detection

A Problem closed without recurrence detection re-opens manually when someone happens to notice. Add a scheduled job: for each closed Problem, query incidents in the last 90 days matching the original CI and category. If the count exceeds threshold, re-open the Problem and notify the original assignee. Recurrence is the test of whether the RCA worked.

Common failure modes

RCA written by the on-call who fought the fire — they have the freshest memory but the least objectivity; have a different person co-author. RCA blocked on missing data from third parties — capture what you have and mark unknowns explicitly; a Problem that hangs in “investigating” for 90 days is a process failure. RCA used to assign blame — destroys the next post-mortem; the template should disallow individual names in causal text.

What changed in 2026

Now Assist’s RCA assist (where licensed) drafts the timeline section from related incident worklogs and Change records. The output is editable and rarely correct on the first pass; treat it as a head-start, not a finished section. Verification of the assist’s claims is on the human author.

What to do this week: pull every Problem closed in the last quarter without a linked preventive change — that is your retroactive cleanup queue.

[object Object]
Share