[object Object]

Freshservice IT Operations Management (ITOM) ingests events from monitoring tools (DataDog, Nagios, SCOM, custom webhooks) and creates incidents. Without correlation tuning, a single outage spawns 200 tickets. With it, the right person gets one ticket and acts.

Sources and ingestion

ITOM accepts alerts via:

  • Native integrations (DataDog, New Relic, Solarwinds, Nagios).
  • Email (parse subject and body for severity).
  • Webhook (REST POST with JSON payload).

For each source, configure:

  • Severity mapping (source’s “warning” maps to your “P3”).
  • Asset matching (source’s hostname maps to CMDB asset by name or IP).
  • Default group routing (which queue gets it).

Without source mapping, every alert lands in a generic queue and gets manually routed, defeating the automation purpose.

Deduplication

Default behavior: identical alerts within a window collapse into one. Window is configurable per source, default 5 minutes. For flappy alerts (network blips), set window to 30 minutes.

Identity for dedup is alert title plus source plus asset. If your monitoring tool sends slightly different titles for the same condition (timestamp in title), the dedup misses. Strip variable elements in the source mapping.

Correlation rules

Beyond dedup, correlation groups related alerts into one incident. Example: a database alert plus an app server alert plus a load balancer alert at the same time on related assets correlate into one “service degradation” incident.

Build correlation rules using the asset graph. Alerts on assets with a “Hosted on” or “Depends on” relationship within 10 minutes correlate.

Without correlation, each layer of the stack files its own ticket and three teams scramble in parallel.

Auto-incident creation

Configure thresholds for auto-incident:

  • P1: critical asset, severity high, auto-incident with paging.
  • P2: critical asset, severity medium, auto-incident without paging.
  • P3: non-critical asset, severity high, queue for review.
  • P4: anything else, log to event stream, no incident.

The threshold prevents alert storms from creating ticket storms. Paging reserved for P1 only.

Routing

Alerts route to groups based on asset’s owner team. Set the asset’s group field via discovery or manual assignment. Without owner, alerts fall to a default group; that default is where alerts go to die.

Audit monthly: assets with no owner should be near zero.

Maintenance windows

When a planned change is happening, suppress alerts for affected assets. Configure: change window has linked assets; alerts on linked assets during the window auto-suppress (logged but no incident).

Without maintenance window suppression, every planned reboot generates incidents and on-call fatigue rises.

Self-healing automations

For known fixable conditions, trigger an automation instead of an incident. Example: disk space alert on a log volume triggers a log-rotation script via webhook. Incident only if the script fails.

This requires a runbook automation tool (Rundeck, Ansible Tower) called from the ITOM workflow. The integration is a webhook out, status check, conditional incident creation.

Reporting

Track:

  • Alert volume per source per day (baseline for tuning).
  • Alerts not creating incidents (good; correlation working).
  • Mean time to acknowledge (alert to first agent action).
  • Mean time to resolve (alert to incident close).

A spike in alerts not creating incidents is healthy when correlation is working; concerning if monitoring is broken (dedup eating valid alerts).

What to do this week

Pull your last 7 days of alert ingestion. Count duplicates that did not get deduplicated. If the number is over 20 percent, tune the dedup window and asset matching for the worst-offending source.

[object Object]
Share