Silent Failures: The Bug That Won't Page You
The scariest production failures aren't the ones that throw errors. They're the ones where a service dies and the logs just stop. Here's why silence detection matters more than error alerting.
Your worker process crashes at 2am. No error log. No exception. The process just dies. Maybe it was an OOM kill. Maybe a segfault in a native library. Maybe the container runtime pulled the rug out.
Whatever the cause, the result is the same: the logs stop. And because there's no error to trigger an alert, nobody gets paged. The job queue backs up. Emails stop sending. Payments stop processing. Six hours later, someone notices.
This is the most dangerous class of production failure, and almost nobody monitors for it.
Why error-based alerting misses this
Every alerting system you've used probably works the same way: watch for a condition, fire when the condition is true. CPU above 90%. Error rate above 5%. Latency above 500ms. Response code is 500.
All of these require something to happen. They need data to evaluate against. When a service dies silently, there is no data. There's nothing to evaluate. The alert rule sits there, perfectly happy, because zero errors is technically below the threshold.
Some teams work around this with heartbeat checks or synthetic monitors. Ping the service every 30 seconds, alert if it doesn't respond. This catches some cases, but only for services that expose a health endpoint. Background workers, cron jobs, queue consumers, and batch processors often don't have an HTTP endpoint to ping.
Watching for absence
The fix is simple in concept: if a service that normally logs every few seconds stops logging for several minutes, something is wrong.
Your API server processes 200 requests per minute and logs each one. If that drops to zero for 3 minutes straight, either the service is down or something fundamental has changed. Either way, you want to know about it.
The implementation is harder than it sounds. You need to know what "normal" looks like for each service. A batch job that runs once an hour will naturally have 59 minutes of silence between runs. Your API server at 3am on a Sunday will log much less than Monday at noon. You can't just set a static threshold for "too quiet."
How silence detection should work
Good silence detection learns each service's log cadence over time. It builds a baseline per service, per hour of day, per day of week. Monday 9am for your API server has a different expected volume than Sunday 3am.
Then it watches the live log stream. If a service's log volume drops to zero and the baseline says it should be producing logs, that's a silence alert. If the baseline for this time slot is already near zero (like a batch job between runs), no alert.
This is what Epok's silence detector does. It activates within about an hour of seeing a new log stream, using short-term cadence analysis. Full weekly baselines build over 7 days for hourly and daily patterns.
Real examples
A background worker that processes webhook events from Stripe crashes after an OOM kill. No error log because the kernel killed it. Epok notices the log stream went silent and alerts within 5 minutes.
A cron job that runs every 15 minutes stops running because someone accidentally deleted the cron entry during a deploy. No errors anywhere. Epok flags it when the expected log output doesn't appear at the next scheduled time.
A database replica falls behind and stops accepting queries. The app fails over to the primary, which works fine, so there are no application errors. But the replica's log stream goes quiet. Silence detection catches it before the primary gets overloaded.
Start monitoring for silence
Silence detection is included in Epok's free tier. Point your log shipper at Epok, and within an hour it starts learning your service cadences. When something goes quiet that shouldn't be quiet, you'll get a Slack message or a PagerDuty page.
Because the scariest production bug isn't the one that fills your logs with errors. It's the one that leaves them empty.
Try Epok free. 150 GB/month, no credit card.
All core detection features included. See what your logs are trying to tell you.
Start FreeRelated
How to Catch New Errors in Production Before Users Report Them
Most error monitoring tools count known errors. The dangerous ones are the errors you've never seen before. Here's how automatic error fingerprinting works.
You Don't Need Dashboards to Monitor Your Logs
The logging industry sold us on dashboards. Build panels. Write queries. Tune thresholds. But what if the tool just told you when something was wrong?