How to Catch New Errors in Production Before Users Report Them
Most error monitoring tools count known errors. The dangerous ones are the errors you've never seen before. Here's how automatic error fingerprinting works.
You deploy on Friday afternoon. Tests pass. Staging looks good. You close your laptop. Forty minutes later, a customer emails support: "I can't log in." Your auth service started throwing a new error that didn't exist in any of your monitoring because it never existed before.
This happens because most monitoring tools are built around known failure modes. You write an alert rule for database connection errors. You create a dashboard for HTTP 500 rates. You set up a Slack notification for high latency. All of these catch problems you've already thought about.
The dangerous errors are the ones you never anticipated.
What makes an error "new"
Every error message has a pattern. "Connection timeout after 30s on port 5432" and "Connection timeout after 45s on port 5432" are the same error with different numbers. The pattern is "Connection timeout after Ns on port N."
New error detection works by normalizing these patterns. Strip out numbers, IP addresses, UUIDs, hex strings, and timestamps. What's left is the error's fingerprint. If that fingerprint has never appeared in your logs before, you've got a new error.
This sounds simple, and the core idea is. But doing it well requires some care:
- "error from 10.0.1.55:8080" and "error from 10.0.2.30:9090" need to collapse to the same fingerprint
- "request abc-123-def failed" and "request xyz-789-ghi failed" are the same error with different request IDs
- "disk 87% full" and "disk 92% full" are the same warning with different percentages
- But "connection refused" and "connection reset" are different errors, even though they're similar
Why grep and regex won't cut it
The first instinct is to grep for "ERROR" or "FATAL" in your logs and pipe the results somewhere. This works until you're processing a million log lines per hour across 20 services. You'll get thousands of matches, most of which are the same three errors with different parameters.
You need fingerprinting to group them, and you need a baseline to know which ones are new. Without a baseline, everything looks new on Monday morning because you haven't seen weekend traffic before.
Building this yourself means maintaining a fingerprint database, handling normalization edge cases, managing a rolling baseline window, and setting up the alerting pipeline. It's about two weeks of focused work for one engineer. Then you have to maintain it forever.
How Epok handles this
Epok runs new error detection automatically on every log stream. It normalizes error messages using pattern-aware tokenization (numbers, IPs, hex strings, UUIDs all get replaced with placeholders), computes a SHA-256 fingerprint of the normalized message, and checks it against a 7-day rolling baseline.
When a fingerprint appears that isn't in the baseline, Epok flags it immediately. The alert includes the normalized pattern, how many times it's fired, which services are affected, and when it first appeared relative to your last deploy.
Severity is automatic too. A new error that fires 100+ times in 5 minutes is critical. One that fires 10 times is a warning. A known error that resurfaces after being gone for 24+ hours gets flagged at lower severity, because it might be a regression.
What this looks like in practice
You deploy. Two minutes later, Epok sends a Slack message: "New error detected in api-server: 'FATAL: connection pool exhausted' (47 occurrences in 5 min, severity CRITICAL). First seen 2 min after deploy v2.4.1."
You didn't write a rule for this. You didn't create a dashboard. You didn't know this error was possible. But Epok caught it because it was new, and new errors in production deserve attention.
That's the gap between error counting and error intelligence. Counting tells you how many errors you have. Intelligence tells you which ones you've never seen before.
Try it
New error detection is included in Epok's free tier. Send your logs and it starts working immediately. No configuration, no baseline warmup period for error detection (volume baselines take 7 days, but error fingerprinting works from the first log). 150 GB/month free, no credit card.
Try Epok free. 150 GB/month, no credit card.
All core detection features included. See what your logs are trying to tell you.
Start Free