๐งTroubleshooting Kubernetes
This guide will help you debug common Kubernetes issues.
Here's a list of commands to detect these issues:
kubectl get events --all-namespaces
kubectl get pods --all-namespaces
kubectl get deploy --all-namespaces
kubectl get svc --all-namespaces
However, we recommend using a Kubernetes UI like Headlamp or Freelens to detect these issues. We also recommend using Prometheus to monitor your cluster, and Alertmanager to alert you when something goes wrong.
CrashLoopBackOff
CrashLoopBackOff
A container is crashing repeatedly.
The pod is scheduled and the container has started, but is crashing repeatedly, stuck in crash loop.
The issue is either a configuration issue or a issue in the container.
Detection
kubectl get pods --all-namespaces
kubectl get events --all-namespaces
Possible causes
Configuration issue causing the container to crash
Issue in the application causing the container to crash
Bad entrypoint causing the container to exit
Investigation
kubectl describe pod <pod-name> -n <namespace>
kubectl logs -f <pod-name> -c <container-name> -n <namespace>
Fetching the Exit Code
of the container can help you find the reason why the container is crashing. The exit code often corresponds to Linux signals. Common exit codes are:
0: Success
1: Failure
137: SIGKILL, perhaps killed by the OOM killer or manually
143: SIGTERM, perhaps killed by the Kubernetes pod manager (failing healthchecks for example)
Fetching the logs can help you determine if the container is crashing because of a configuration issue, an issue in the application, or a bad entrypoint.
At this point, we recommend to look on the web the reason of the issue based on the logs and exit code.
Mitigations
Check the configuration of the container and the application. Report crashes to Toucan Toco with logs.
ImagePullBackOff
ImagePullBackOff
Image cannot be pulled from the pod and is stuck in ImagePullBackOff
.
Detection
kubectl get pods --all-namespaces
kubectl get events --all-namespaces
Possible causes
Bad image name or tag
Missing image pull secret
Registry failure
Investigation
kubectl describe pod <pod-name> -n <namespace>
Check if the pod has the correct image name and tag.
Check if the registry requires authentication, and if so, check if the image pull secret is properly configured.
Check if the registry is reachable.
Mitigations
Correct the image name and tag, or add the missing image pull secret in the Helm Charts:
global:
imagePullSecrets:
- <name of the image pull secret>
Pod is unschedulable
A pod cannot be schedule by the Kubernetes scheduler.
Detection
kubectl get pods --all-namespaces
kubectl get events --all-namespaces
Possible causes
Causes are linked to constraints: node selectors, affinities, anti affinities, resource allocations, etc.
Investigation
kubectl describe pod <pod-name> -n <namespace>
kubectl describe node <node-name-hosting-the-pod>
0/X nodes are available: X insufficient memory/cpu
: check the resourcesrequests
.pod has unbound immediate PersistentVolumeClaims
: check the PVCs in the Helm charts.0/X nodes are availables: X node(s) had taint {key=value: NoSchedule}, that the pod didn't tolerate.
:If you didn't set any tolerations, check the node taints (maybe it is caused by a node failure).
If there is a taint, consider adding a tolerations or remove the taint. Taints are used to allow only certain pods to be scheduled on certain nodes.
0/X nodes are available: X node(s) didn't match node selector/affinity
: check the node selectors and affinities in the Helm charts.0/X nodes are available: X node(s) didn't satisfy topological constraints.
: check the topology spread constraints in the Helm charts.Error: configmap/secret "<name>" not found
: check the references of the configmap/secret in the Helm charts.
For PVC issues:
kubectl describe pvc <pvc-name> -n <namespace>
If the PVC has an issue, check the logs of the storage provisioner.
Mitigations
Check and fix the resources requests, node selectors, tolerations, topology spread constraints, PVCs, and node taints in the Helm charts.
Check the references of the configmap/secret in the Helm charts.
Check the logs of the storage provisioner. Contact the cloud provider if using a cloud storage. If you are self-hosting your own storage, check your storage configuration.
Network connectivity issues
A service is not reachable.
We assume the service is meant to be reachable through Ingress Controller.
Overview
โโโโโโโโโโโโโ โโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโ โโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ
โ container โ <-- โ pod โ <-- โ service โ <-- โ ingress controller โ <-- โ service lb โ <-- โ router/lb โ <-- โ external network โ
โโโโโโโโโโโโโ 1 โโโโโโโ 2 โโโโโโโโโโโ 3 โโโโโโโโโโโโโโโโโโโโโโโ 2 โโโโโโโโโโโโโโ 5 โโโโโโโโโโโโโ 6 โโโโโโโโโโโโโโโโโโโโ
Detection
kubectl get pods --all-namespaces
kubectl get events --all-namespaces
kubectl get svc --all-namespaces
kubectl get endponts <service-name> -n <service-namespace>
kubectl logs -f <ingress-controller-pod> -c <ingress-controller-container> -n <ingress-controller-namespace>
curl -fsSL https://<your-service-url>
You would see either:
Failing health checks probes.
Alerts on invalid routes in the logs of the ingress controller.
Services without endpoints.
Missing external IP on the ingress controller
Service
object.curl
failing, with the possible errors:Cannot resolve the domain name
Connection refused
Hanging/waiting for response
Possible causes
Kubernetes resource configuration issues, causing a "missing" or "failing" link between the components.
Kubernetes CNI issues causing Pod-to-Pod and Service connectivity issues.
Firewall issues, causing a closed port or unable to send a response.
Routing issues, causing external network access issues from the containers.
Application configuration issues.
Investigations & mitigations
Container -> External network (rare)
Investigation
First, detect if the container can reach the external network to eliminate the hypothesis of a firewall/routing issue.
# From an existing container
kubectl exec -it <container-name> -c <container-name> -n <namespace> -- curl -fsSL https://google.com
# If the tool is not available
kubectl run -it --rm --image alpine/curl --restart=Never curl -- curl -fsSL https://google.com
# You are looking to see if you receive a response. It should be fast.
If this fails, check further by running:
kubectl debug -it --image=alpine node/<node-name> -- ip a # 4
# You are looking for interfaces with "UP" "state UP" and with an "inet" address.
# 3: kapsule0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq state UP qlen 1000
# link/ether 02:00:00:1a:d7:bc brd ff:ff:ff:ff:ff:ff
# inet 172.16.20.43/22 brd 172.16.23.255 scope global dynamic kapsule0
# valid_lft 85077sec preferred_lft 85077sec
# inet6 fd6e:74b1:2e54:cb50:5f4e:9705:dbd3:ff1f/128 scope global dynamic noprefixroute
# valid_lft 45423sec preferred_lft 31023sec
# inet6 fe80::ff:fe1a:d7bc/64 scope link
# valid_lft forever preferred_lft forever
kubectl debug -it --image=alpine node/<node-name> -- ip route # 5
# You are looking for "default via ..."
# default via 172.16.20.1 dev kapsule0 src 172.16.20.43 metric 50
kubectl debug -it --image=alpine node/<node-name> -- ip neigh # 6
# You are looking for "<gateway ip> dev ..."
# 172.16.20.1 dev kapsule0 lladdr 02:00:00:1c:3a:11 ref 1 used 0/0/0 probes 4 REACHABLE
Check if the IP is allowed to send outbound traffic by the firewall.
Check if the route has the correct gateway.
Check if the node detects the gateway as neighbor.
Mitigation
Check the outbound rules of the firewall at every "routing" step (the machine, the router, the modem, the proxies...).
If you are using iptables
or nftables
, check the PREROUTING
and OUTPUT
chains.
Check if the routing is correctly configured by checking the DHCP configuration on the router (on the cloud, you are looking for IPAM configuration).
Container <-> Pod (rare)
Investigation
Normally, the Helm Charts should be configured to have the correct network configuration. Report the issue to Toucan Toco if there is this issue.
Check the application configuration and the container configuration.
kubectl describe pod <pod-name> -n <namespace>
kubectl logs -f <pod-name> -c <container-name> -n <namespace>
Mitigation
Check the application configuration and the container configuration.
Pod <-> Service (rare)
Investigation
kubectl describe pod <pod-name> -n <namespace> # 1
kubectl describe svc <service-name> -n <namespace> # 2
kubectl get endpoints <service-name> -n <namespace> # 3
kubectl port-forward pods/<pod-name> <host-port>:<pod-port> -n <namespace> # 4 and access to http://localhost:<host-port>
Check:
If the pod has an IP.
If the service has endpoints.
If the ports are configured correctly:
# Service configuration
spec:
ports:
- name: http
protocol: TCP # Should match the container's protocol
port: 80 # Internal service port
targetPort: 8080 # Should match a container port or container port name
If the port-forward is working. If you receive a response (even if the reponse is an error), there is probably no issue.
No pod IP means a critical failure in the Container Network Interface (CNI). Check if your Kubernetes installation has a working CNI.
No endpoints means either:
There is no pod running.
The label selector of the service does not match any pod.
Mitigations
If there is no working CNI, fix the CNI or install a working one.
If there is no endpoints, fix the service labels selector or make sure there is a pod running with proper labels.
Service <-> Ingress Controller
Investigation
kubectl describe svc <service-name> -n <namespace> # 1
kubectl describe ingress <ingress-name> -n <namespace> # 2
kubectl logs -f <ingress-controller-pod> -c <ingress-controller-container> -n <ingress-controller-namespace> # 2
kubectl port-forward svc/<service-name> <host-port>:<service-port> -n <namespace> # 3 and access to http://localhost:<host-port>
Check:
If the service has an IP.
If the ingress targets the service and if the ingress logs has errors.
If the port-forward is working. If you receive a response (even if the reponse is an error), there is probably no issue.
No service IP means a critical failure in the Container Network Interface (CNI). Check if your Kubernetes installation has a working CNI.
A wrong service in the ingress rule will cause errors in the ingress controller logs.
Mitigations
Make sure the service has an IP and is reachable through port-forwarding.
Check and fix the Ingress
object.
Service LB <--> Load Balancer
Investigations
kubectl describe svc <service-name> -n <namespace>
Check if the service has an external IP. If there isn't, your cloud controller manager is probably not configured correctly.
If you are using MetalLB/ServiceLB, check the MetalLB/ServiceLB configuration.
Mitigations
On cloud, check if the cloud controller manager is configured correctly.
On MetalLB/ServiceLB, check the MetalLB/ServiceLB configuration.
Ingress <-- External Network
Investigations
curl -fsSL https://<your-service-url>
Check if:
You have a response (even if it's an error). No response means a firewall issue.
If the response is 404 Not Found, the
Ingress
object is not configured correctly.If there is a TLS issue, the
Ingress
object is not configured correctly.If the response is 502 Bad Gateway, the
Ingress
object is not configured correctly, or there is no pod running, or the service is not reachable.If the response is 504 Gateway Timeout, the Pod is able to take the request but takes too much time to respond.
Mitigations
Look for inbound rules and outbound rules in your firewall.
Check and fix the Ingress
object.
Check if the pods and services are reachable.
Node failure
A node is not reachable.
Detection
kubectl get nodes
kubectl get events --all-namespaces
Investigation
Sadly, we cannot give you a proper guide for node failures as there could be many reasons, and this goes out of the scope of this guide.
Here's an non-exhaustive list of potential causes:
The worker node is not able to reach the control nodes.
Kubelet is failing. Check with
journalctl -u kubelet.service
. (Note: different kubernetes distributions use different service names.)Kernel is panicking. Check with
dmesg
.Not enough memory on the node. Check with
free -m
.Not enough disk space on the node. Check with
df -h
.
OOMKilled
OOMKilled
Check the tuning guide:
โ๏ธTuning resourcesLast updated
Was this helpful?