The problem queue has 280 open records, the average age is 14 months, and the team’s quarterly review skips it because nobody can summarize what is actually happening. Problem Management does not fail because people refuse to do it; it fails because the process never had explicit creation criteria, named owners, or a way to measure whether the work prevents future incidents. The patterns below address each in turn.
Creating a Problem
Not every incident becomes a problem. Define explicit criteria: a repeated pattern (three or more matching incidents in 30 days), a high-impact one-off (severity 1 or 2 with business service impact), or an unresolved incident with no known cause. Without criteria, problem count explodes, RCA loses focus, and the queue becomes a graveyard for tickets the team did not want to close.
Auto-create problem when:
- 3+ incidents on same ci_class+category+subcategory in 30 days
- severity == 1 with business_critical_service flag
- incident closed without root cause and impact_assessed > threshold
RCA Frameworks
Pick one framework per problem type and stick with it through the investigation. Five Whys for linear causal chains. Ishikawa (fishbone) for systemic issues with multiple contributing factors. Kepner-Tregoe for complex multi-cause incidents requiring structured comparison. Switching frameworks mid-investigation produces a record nobody can review. Record the RCA on the problem with the framework name and the reasoning at each step, not just the final conclusion.
Known Errors
A known error is a problem with a documented workaround and a pending permanent fix. Publish known errors to the knowledge base so service desk can resolve repeated tickets without re-investigating. The known error record links to the problem (cause), the workaround (KB article), and the change record that will deliver the fix. Service desk’s first action on a matching incident is to apply the workaround and link the incident to the known error, not to redo the investigation.
// Suggest matching known errors on incident form
function suggestKnownErrors(shortDesc, ciSysId) {
var ke = new GlideRecord('problem');
ke.addQuery('known_error', true);
ke.addQuery('cmdb_ci', ciSysId);
ke.addEncodedQuery('short_descriptionLIKE'+shortDesc.substring(0,40));
ke.setLimit(5);
ke.query();
return ke;
}
Linking Incidents
Every new incident should search for a matching open problem at creation and again at resolution. Configure the incident form to suggest matches based on CI, category, and short-description similarity. An incident “resolved via problem” means the fix is in the pipeline through the linked change, not that the incident is unresolved. The link is what produces the prevention metric later.
Measuring Value
Problem Management’s value is prevention, which is hard to measure directly. Track: incidents avoided (incidents that match a closed problem’s pattern in the period after the fix), mean time to known error (from problem creation to documented workaround), open problem age distribution (the long-tail problems are the management focus), and preventive change closure rate (percentage of closed problems with a linked change actually delivered). Publish quarterly to the operations leadership.
Common Failure Modes
Problems opened with no assigned owner — they age forever. Make owner required at creation and route unassigned-after-7-days problems to a triage role. RCA recorded as a free-text narrative rather than structured data — the patterns are not queryable across problems. Use structured contributing-factor fields and let the narrative be supporting context. Known errors published to KB without expiration — when the permanent fix ships, the workaround should retire automatically; otherwise users keep applying outdated workarounds.
What Changed in 2026
Now Assist’s RCA assist (where licensed) drafts the contributing-factor analysis from related incidents and worklogs. The output is editable and rarely correct on the first pass; treat it as a productivity aid for the human investigator, not a replacement. The framework selection and final conclusion remain human decisions.
Implementation Sequence
Establish creation criteria and the auto-creation rule first. Add RCA framework templates next. Wire the known error to KB integration. Add the prevention metric last, once enough closed problems exist to measure against. Trying to bootstrap all of these together in week one produces a process the team treats as overhead.
What to do this week: pull every open problem older than six months and either assign an owner with a 30-day SLA or close as unactionable with documented reason — that triage clears the queue for actual work.