🔧Troubleshooting Kubernetes

This guide will help you debug common Kubernetes issues.

Here's a list of commands to detect these issues:

bash

kubectl get events --all-namespaces
kubectl get pods --all-namespaces
kubectl get deploy --all-namespaces
kubectl get svc --all-namespaces

However, we recommend using a Kubernetes UI like Headlamp or Freelens to detect these issues. We also recommend using Prometheus to monitor your cluster, and Alertmanager to alert you when something goes wrong.

`CrashLoopBackOff`

A container is crashing repeatedly.

The pod is scheduled and the container has started, but is crashing repeatedly, stuck in crash loop.

The issue is either a configuration issue or a issue in the container.

Detection

bash

kubectl get pods --all-namespaces
kubectl get events --all-namespaces

Possible causes

Configuration issue causing the container to crash
Issue in the application causing the container to crash
Bad entrypoint causing the container to exit

Investigation

bash

kubectl describe pod <pod-name> -n <namespace>
kubectl logs -f <pod-name> -c <container-name> -n <namespace>

Fetching the Exit Code of the container can help you find the reason why the container is crashing. The exit code often corresponds to Linux signals. Common exit codes are:

0: Success
1: Failure
137: SIGKILL, perhaps killed by the OOM killer or manually
143: SIGTERM, perhaps killed by the Kubernetes pod manager (failing healthchecks for example)

Fetching the logs can help you determine if the container is crashing because of a configuration issue, an issue in the application, or a bad entrypoint.

At this point, we recommend to look on the web the reason of the issue based on the logs and exit code.

Mitigations

Check the configuration of the container and the application. Report crashes to Toucan Toco with logs.

`ImagePullBackOff`

Image cannot be pulled from the pod and is stuck in ImagePullBackOff.

Detection

bash

kubectl get pods --all-namespaces
kubectl get events --all-namespaces

Possible causes

Bad image name or tag
Missing image pull secret
Registry failure

Investigation

bash

kubectl describe pod <pod-name> -n <namespace>

Check if the pod has the correct image name and tag.

Check if the registry requires authentication, and if so, check if the image pull secret is properly configured.

Check if the registry is reachable.

Mitigations

Correct the image name and tag, or add the missing image pull secret in the Helm Charts:

yaml: values.override.yaml

global:
  imagePullSecrets:
    - <name of the image pull secret>

Pod is unschedulable

A pod cannot be schedule by the Kubernetes scheduler.

Detection

bash

kubectl get pods --all-namespaces
kubectl get events --all-namespaces

Possible causes

Causes are linked to constraints: node selectors, affinities, anti affinities, resource allocations, etc.

Investigation

bash

kubectl describe pod <pod-name> -n <namespace>
kubectl describe node <node-name-hosting-the-pod>

0/X nodes are available: X insufficient memory/cpu: check the resources requests.
pod has unbound immediate PersistentVolumeClaims: check the PVCs in the Helm charts.
0/X nodes are availables: X node(s) had taint {key=value: NoSchedule}, that the pod didn't tolerate.:
- If you didn't set any tolerations, check the node taints (maybe it is caused by a node failure).
- If there is a taint, consider adding a tolerations or remove the taint. Taints are used to allow only certain pods to be scheduled on certain nodes.
0/X nodes are available: X node(s) didn't match node selector/affinity: check the node selectors and affinities in the Helm charts.
0/X nodes are available: X node(s) didn't satisfy topological constraints.: check the topology spread constraints in the Helm charts.
Error: configmap/secret "<name>" not found: check the references of the configmap/secret in the Helm charts.

For PVC issues:

bash

kubectl describe pvc <pvc-name> -n <namespace>

If the PVC has an issue, check the logs of the storage provisioner.

Mitigations

Check and fix the resources requests, node selectors, tolerations, topology spread constraints, PVCs, and node taints in the Helm charts.

Check the references of the configmap/secret in the Helm charts.

Check the logs of the storage provisioner. Contact the cloud provider if using a cloud storage. If you are self-hosting your own storage, check your storage configuration.

Network connectivity issues

A service is not reachable.

We assume the service is meant to be reachable through Ingress Controller.

Overview

bash

┌───────────┐     ┌─────┐     ┌─────────┐     ┌─────────────────────┐     ┌────────────┐     ┌───────────┐     ┌──────────────────┐
│ container │ <-- │ pod │ <-- │ service │ <-- │ ingress controller  │ <-- │ service lb │ <-- │ router/lb │ <-- │ external network │
└───────────┘  1  └─────┘  2  └─────────┘  3  └─────────────────────┘  2  └────────────┘  5  └───────────┘  6  └──────────────────┘

Detection

bash

kubectl get pods --all-namespaces
kubectl get events --all-namespaces
kubectl get svc --all-namespaces
kubectl get endponts <service-name> -n <service-namespace>
kubectl logs -f <ingress-controller-pod> -c <ingress-controller-container> -n <ingress-controller-namespace>
curl -fsSL https://<your-service-url>

You would see either:

Failing health checks probes.
Alerts on invalid routes in the logs of the ingress controller.
Services without endpoints.
Missing external IP on the ingress controller Service object.
curl failing, with the possible errors:
- Cannot resolve the domain name
- Connection refused
- Hanging/waiting for response

Possible causes

Kubernetes resource configuration issues, causing a "missing" or "failing" link between the components.
Kubernetes CNI issues causing Pod-to-Pod and Service connectivity issues.
Firewall issues, causing a closed port or unable to send a response.
Routing issues, causing external network access issues from the containers.
Application configuration issues.

Investigations & mitigations

Container -> External network (rare)

Investigation

First, detect if the container can reach the external network to eliminate the hypothesis of a firewall/routing issue.

bash

# From an existing container
kubectl exec -it <container-name> -c <container-name> -n <namespace> -- curl -fsSL https://google.com

# If the tool is not available
kubectl run -it --rm --image alpine/curl --restart=Never curl -- curl -fsSL https://google.com

# You are looking to see if you receive a response. It should be fast.

If this fails, check further by running:

bash

kubectl debug -it --image=alpine node/<node-name> -- ip a # 4

# You are looking for interfaces with "UP" "state UP" and with an "inet" address.
# 3: kapsule0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq state UP qlen 1000
#     link/ether 02:00:00:1a:d7:bc brd ff:ff:ff:ff:ff:ff
#     inet 172.16.20.43/22 brd 172.16.23.255 scope global dynamic kapsule0
#        valid_lft 85077sec preferred_lft 85077sec
#     inet6 fd6e:74b1:2e54:cb50:5f4e:9705:dbd3:ff1f/128 scope global dynamic noprefixroute
#        valid_lft 45423sec preferred_lft 31023sec
#     inet6 fe80::ff:fe1a:d7bc/64 scope link
#        valid_lft forever preferred_lft forever

kubectl debug -it --image=alpine node/<node-name> -- ip route # 5

# You are looking for "default via ..."
# default via 172.16.20.1 dev kapsule0  src 172.16.20.43  metric 50

kubectl debug -it --image=alpine node/<node-name> -- ip neigh # 6

# You are looking for "<gateway ip> dev ..."
# 172.16.20.1 dev kapsule0 lladdr 02:00:00:1c:3a:11 ref 1 used 0/0/0 probes 4 REACHABLE

Check if the IP is allowed to send outbound traffic by the firewall.
Check if the route has the correct gateway.
Check if the node detects the gateway as neighbor.

Mitigation

Check the outbound rules of the firewall at every "routing" step (the machine, the router, the modem, the proxies...).

If you are using iptables or nftables, check the PREROUTING and OUTPUT chains.

Check if the routing is correctly configured by checking the DHCP configuration on the router (on the cloud, you are looking for IPAM configuration).

Container <-> Pod (rare)

Investigation

Normally, the Helm Charts should be configured to have the correct network configuration. Report the issue to Toucan Toco if there is this issue.

Check the application configuration and the container configuration.

bash

kubectl describe pod <pod-name> -n <namespace>
kubectl logs -f <pod-name> -c <container-name> -n <namespace>

Mitigation

Check the application configuration and the container configuration.

Pod <-> Service (rare)

Investigation

bash

kubectl describe pod <pod-name> -n <namespace> # 1
kubectl describe svc <service-name> -n <namespace> # 2
kubectl get endpoints <service-name> -n <namespace> # 3
kubectl port-forward pods/<pod-name> <host-port>:<pod-port> -n <namespace> # 4 and access to http://localhost:<host-port>

Check:

If the pod has an IP.
If the service has endpoints.
If the ports are configured correctly:

bash

# Service configuration
spec:
  ports:
    - name: http
      protocol: TCP # Should match the container's protocol
      port: 80 # Internal service port
      targetPort: 8080 # Should match a container port or container port name

If the port-forward is working. If you receive a response (even if the reponse is an error), there is probably no issue.

No pod IP means a critical failure in the Container Network Interface (CNI). Check if your Kubernetes installation has a working CNI.

No endpoints means either:

There is no pod running.
The label selector of the service does not match any pod.

Mitigations

If there is no working CNI, fix the CNI or install a working one.

If there is no endpoints, fix the service labels selector or make sure there is a pod running with proper labels.

Service <-> Ingress Controller

Investigation

bash

kubectl describe svc <service-name> -n <namespace> # 1
kubectl describe ingress <ingress-name> -n <namespace> # 2
kubectl logs -f <ingress-controller-pod> -c <ingress-controller-container> -n <ingress-controller-namespace> # 2
kubectl port-forward svc/<service-name> <host-port>:<service-port> -n <namespace> # 3 and access to http://localhost:<host-port>

Check:

If the service has an IP.
If the ingress targets the service and if the ingress logs has errors.
If the port-forward is working. If you receive a response (even if the reponse is an error), there is probably no issue.

No service IP means a critical failure in the Container Network Interface (CNI). Check if your Kubernetes installation has a working CNI.

A wrong service in the ingress rule will cause errors in the ingress controller logs.

Mitigations

Make sure the service has an IP and is reachable through port-forwarding.

Check and fix the Ingress object.

Service LB <--> Load Balancer

Investigations

bash

kubectl describe svc <service-name> -n <namespace>

Check if the service has an external IP. If there isn't, your cloud controller manager is probably not configured correctly.

If you are using MetalLB/ServiceLB, check the MetalLB/ServiceLB configuration.

Mitigations

On cloud, check if the cloud controller manager is configured correctly.

On MetalLB/ServiceLB, check the MetalLB/ServiceLB configuration.

Ingress <-- External Network

Investigations

bash

curl -fsSL https://<your-service-url>

Check if:

You have a response (even if it's an error). No response means a firewall issue.
If the response is 404 Not Found, the Ingress object is not configured correctly.
If there is a TLS issue, the Ingress object is not configured correctly.
If the response is 502 Bad Gateway, the Ingress object is not configured correctly, or there is no pod running, or the service is not reachable.
If the response is 504 Gateway Timeout, the Pod is able to take the request but takes too much time to respond.

Mitigations

Look for inbound rules and outbound rules in your firewall.

Check and fix the Ingress object.

Check if the pods and services are reachable.

Node failure

A node is not reachable.

Detection

bash

kubectl get nodes
kubectl get events --all-namespaces

Investigation

Sadly, we cannot give you a proper guide for node failures as there could be many reasons, and this goes out of the scope of this guide.

Here's an non-exhaustive list of potential causes:

The worker node is not able to reach the control nodes.
Kubelet is failing. Check with journalctl -u kubelet.service. (Note: different kubernetes distributions use different service names.)
Kernel is panicking. Check with dmesg.
Not enough memory on the node. Check with free -m.
Not enough disk space on the node. Check with df -h.

`OOMKilled`

Check the tuning guide:

⚙️Tuning resources

Last updated 5 days ago

Was this helpful?