·5 min read

Stop Writing Alert Rules by Hand

You can't predict every failure mode. Static thresholds miss novel incidents and drown on-call in false alarms. Anomaly detection is the way out.

alertinganomaly-detectionon-call

At some point every engineering team has the same meeting. "We need better alerting." Someone opens a spreadsheet. You list every service. You decide on thresholds. Error rate above X. Latency above Y. CPU above Z. You spend a day writing rules in Prometheus, CloudWatch, or Datadog.

Two weeks later, three of the rules are too noisy and get silenced. Five more never fire because the thresholds are too conservative. And the next production incident is something nobody predicted, so none of the rules cover it.

This cycle repeats every six months. Sometimes every quarter. The alert rules pile up but coverage never feels complete.

The fundamental problem with static thresholds

A static threshold assumes the system behaves the same way all the time. "Alert when error rate exceeds 5%" treats Monday at 9am the same as Sunday at 3am. But your traffic patterns are different. Your error baseline is different. The same error rate might be normal during a traffic spike and catastrophic during off-hours.

Some teams respond by creating time-based rules. "Alert when error rate exceeds 5% during business hours and 2% outside business hours." Now you have two rules per metric, and you still haven't accounted for holidays, deploy windows, or gradual traffic changes as your product grows.

The deeper problem is that static rules require you to predict failure modes in advance. You write a rule after an incident and hope it catches the same kind of failure next time. But production systems find new ways to break. The failures that hurt the most are the ones that don't match any existing rule.

What anomaly detection does differently

Instead of asking "is this above a threshold," anomaly detection asks "is this different from what's normal for right now."

It builds a baseline from your actual data. Monday 9am has its own expected range. Sunday 3am has its own. When the observed value deviates significantly from the expected range for that specific time window, that's an anomaly.

This catches two things that static thresholds miss:

  • Novel failure modes. You don't need to predict them. Anything that deviates significantly from normal gets flagged.
  • Context-dependent anomalies. 50 errors per minute at 3am is a disaster. 50 errors per minute at peak traffic is normal background noise. Anomaly detection knows the difference because it learned what normal looks like for each time window.

Why most teams haven't adopted it

Anomaly detection isn't new as a concept. The reason most teams still write static alert rules is that building good anomaly detection is hard. You need baseline computation, seasonal pattern recognition, a reasonable statistical model, and enough operational experience to set the sensitivity right.

Datadog and Grafana both offer anomaly detection features. But they're add-on features that you have to configure per metric. You're still deciding which metrics to monitor and what sensitivity to use. It's better than raw thresholds, but it's still manual work per signal.

The approach we took with Epok is different. Anomaly detection runs automatically on every log stream. You don't configure it. You don't select metrics. You don't choose sensitivity levels. Epok watches every service's log volume, error patterns, and log cadence. When something deviates from the learned baseline, it alerts.

Where static rules still make sense

There are cases where a hard threshold is the right tool. Disk space above 90% should always alert, regardless of what's "normal." Payment processing success rate below 99.9% should always alert. These are business SLOs, not anomaly detection problems.

Epok supports threshold rules too for exactly these cases. But they should be the exception, not the primary detection mechanism. Use thresholds for hard business constraints. Use anomaly detection for everything else.

Try it on your logs

Epok's free tier includes volume anomaly detection, new error detection, and silence detection. Point your log shipper at Epok and see what it catches in the first week. Most teams find something within the first 24 hours that their existing monitoring missed.

The best alerting system is the one that catches things you didn't think to look for.

Try Epok free. 150 GB/month, no credit card.

All core detection features included. See what your logs are trying to tell you.

Start Free