·8 min read

20 Kubernetes Failures You Should Be Alerting On

CrashLoopBackOff is just the beginning. Here are 20 Kubernetes failure modes that show up in logs and deserve automated alerts.

kubernetesk8salertingdevops

If you're running Kubernetes, you probably have alerts for CrashLoopBackOff and maybe OOMKilled. That covers about 30% of the failure modes that regularly break production workloads.

Here are the 20 failure patterns that show up in Kubernetes logs and events. Each one has bitten a production cluster somewhere. Most of them don't get monitored until after the first incident.

Pod lifecycle failures

1. CrashLoopBackOff

The container starts, crashes, restarts, crashes again. Kubernetes backs off exponentially between restarts. The pod is technically "running" but never healthy. This is the most common K8s failure and the one most teams already monitor.

2. OOMKilled

The container exceeded its memory limit and the kernel killed it. The tricky part is that OOMKilled doesn't always produce an application error log. The process just vanishes mid-execution. You need to watch for this in pod events, not application logs.

3. ImagePullBackOff

The node can't pull the container image. Usually a registry auth issue, a deleted image tag, or a typo in the image name. Shows up during deploys and can silently block rollouts if your deployment strategy allows partial failures.

4. CreateContainerConfigError

A referenced ConfigMap or Secret doesn't exist. The pod can't start because its configuration is missing. This often happens when someone deploys a new version that references a config that hasn't been created yet.

5. RunContainerError

The container runtime failed to start the container. Often caused by invalid commands, missing entrypoints, or filesystem permission issues. The pod stays in a non-ready state indefinitely.

Resource and scheduling failures

6. Insufficient CPU / Insufficient Memory

The scheduler can't find a node with enough resources. Pods stay in Pending state. This sneaks up on you as your cluster grows and resource requests accumulate.

7. FailedScheduling

Broader than just resource constraints. Includes node affinity mismatches, taints without tolerations, and topology spread violations. The error message usually tells you exactly what's wrong, but nobody sees it unless they're watching events.

8. Evicted pods

Kubernetes evicts pods when a node runs low on disk, memory, or PID space. Evicted pods don't get rescheduled automatically by Deployments (they do by ReplicaSets, but it's not always immediate). A burst of evictions usually means a node is in trouble.

9. Node NotReady

A node stops responding to the control plane. All pods on that node become suspect. If the node doesn't recover within the configured timeout, pods get rescheduled elsewhere. But during the grace period, workloads on that node may be silently failing.

Storage failures

10. FailedMount / FailedAttachVolume

Persistent volumes can't be mounted. Common with cloud provider volumes (EBS, GCE PD) when a volume is still attached to a terminated node. The pod hangs in ContainerCreating state.

11. ProvisioningFailed

Dynamic volume provisioning failed. Usually a quota issue, permission issue, or storage class misconfiguration. The PVC stays in Pending state and any pod depending on it can't start.

Networking failures

12. DNS resolution failures

CoreDNS is overloaded, misconfigured, or down. Pods can't resolve service names. This causes cascading failures across the cluster because every service-to-service call fails. Watch for "NXDOMAIN" and "connection refused" to port 53 in your logs.

13. Connection refused to ClusterIP services

The service exists but no endpoints are ready. Usually means all pods behind a service have failed their readiness probes. The service looks healthy from the outside but rejects every connection.

14. NetworkPolicy blocking traffic

Someone applied a restrictive NetworkPolicy that blocks legitimate traffic. Hard to debug because the connection just times out with no error message. Look for sudden timeouts after a policy change.

Deployment and rollout failures

15. ProgressDeadlineExceeded

A deployment has been rolling out for longer than its progress deadline. The new pods aren't becoming ready. This means your deploy is stuck, and depending on your strategy, you might be running with reduced capacity.

16. FailedRollingUpdate

The rolling update can't proceed because new pods keep failing. If maxUnavailable is 0, the old pods stay running but you're stuck on the old version. If maxUnavailable is higher, you could be losing capacity.

17. Readiness probe failures

Pods are running but not passing readiness probes. They're removed from service endpoints so they don't receive traffic. A few failing probes is normal during startup. Sustained failures mean the pod is unhealthy.

18. Liveness probe failures

The container is running but not responding to liveness checks. Kubernetes will restart it. Frequent liveness restarts indicate a resource leak, deadlock, or configuration problem.

Other critical patterns

19. RBAC permission denied

Service accounts don't have the right permissions. Shows up as "forbidden" errors in pod logs when they try to talk to the Kubernetes API. Often caused by deploying to a new namespace without copying RBAC rules.

20. HPA unable to scale

The Horizontal Pod Autoscaler wants to add pods but can't. Either the metrics server isn't providing data, the cluster is out of resources, or the HPA has hit its maxReplicas limit during a traffic spike.

How to monitor all 20

You could write 20 alert rules in Prometheus or your monitoring tool of choice. Each one needs a query, a threshold, and a notification target. Then you maintain them as your cluster evolves.

Or you can send your Kubernetes logs and events to Epok. It has built-in detection rules for all 20 of these failure patterns. No queries to write, no thresholds to set. When a pod starts CrashLooping or a node goes NotReady, you get a notification with the relevant details: which pod, which node, what the error message says, and how long it's been happening.

Kubernetes intelligence is included in Epok's Pro tier ($49/month). Point your FluentBit or Vector at Epok, make sure you're shipping kubelet logs and Kubernetes events, and the detection starts automatically.

Try Epok free. 150 GB/month, no credit card.

All core detection features included. See what your logs are trying to tell you.

Start Free