Increasing resilience in Kubernetes - Kudos Engineering - Medium

By Tim Little

Increasing resilience in Kubernetes

Tim Little
Photo by fer gomez on Unsplash

High availability and resilience are key features of Kubernetes. But what do you do when your Kubernetes cluster starts to become unstable and it looks like your ship is starting to sink?

Our Kubernetes node problem

This is a problem we recently had with our Google Kubernetes Engine(GKE) cluster. We were seeing regular crashes of the Kubernetes nodes in the cluster.

The impact of this was that containers scheduled to the problem node became unresponsive and our monitoring systems starts spamming us with SLO and Error Budgets consumption alerts. To read more about what SLOs and Error Budget are and how we monitor them, check out one of my previous blog posts.

By checking the gcloud container operations list command, we were seeing the node auto repairing every few hours .

However when we started to look at Stackdriver logging we were seeing tons of logs which made looking for the problem like drinking from a firehose.

So we opened a support case with Google to assist in our hunt for the root cause.

Unfortunately due to timezone difference and our account being on a legacy support package, the response times we were seeing from Google were around 24 hours per message.

This problem resulted in our Error Budgets being depleted for a bunch of our services in Kubernetes. So in true SRE style we put dedicated people onto increasing the resilience of the kubernetes cluster and Istio service mesh.

Adding resilience to Kubernetes services

We started by taking a deeper look into the setup of our Kubernetes cluster and the services we were running on it, then identifying areas we could add resilience. The goal of this was to reduce the impact of a single node becoming unresponsive.

Number of replicas and distribution of pods on nodes

The first area of improvement was to increase the number of replicas our Kubernetes Deployments had. At Kudos we build services using the microservice architecture. These services are small, hold no state and are perfect for horizontal scaling.

However some of the deployments only had one pod running. If that pod just so happened to be on a node that had an issue then we saw an outage for that service. So we modified all our deployments to run at least three replicas for resilience.

This lead to another problem, at the time we only had three nodes (we have fourteen now) and sometimes Kubernetes would schedule all the pods to the same node.

So we looked at a Kubernetes feature call Pod AntiAffinity this would instruct the scheduler to mark one of the nodes ineligible for scheduling if it already contained a pod with the same label Eg. app=<service-name>.

affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: app operator: In values: - <service-name>

topologyKey: ""

With that added and all the service redeployed, we verified that the pods were scheduled to different nodes by running kubectl get pods -o custom-columns=’,NODENAME:.spec.nodeName’

Now if one of the nodes became unresponsive, we only lose one of the three pods in a deployment so the service still works.

Readiness probes and pre-stop commands

Another issue we noticed was when the nodes are being repaired, Kubernetes would start sending traffic to it before it was ready. We had seen similar behaviour when we were upgrading our Kubernetes node pool but never got to the bottom of why.

After a bit of Googling we found a blog post about Zero-Downtime Rolling Updates. One of the main take homes from that post was the asynchronous nature of the pod being marked as terminated and it being removed from the load balancer:

This re-configuration happens asynchronously, thus makes no guarantees of correct ordering, and can and will result in few unlucky requests being routed to the terminating pod.

Following on from that, we added Readiness probes to all of our services and added a preStop command that will wait for 10 seconds before terminating to allow those few connections to still be served.

readinessProbe: httpGet: path: /readiness port: 8080 initialDelaySeconds: 15lifecycle: preStop: exec:

command: ["/bin/sh","-c","sleep 10"]

Istio Horizontal Pod Autoscaler

Now with Kubernetes more stable, we noticed another single point of failure in our Kubernetes cluster, Istio.

We are using the Istio for GKE Add-on for our Kubernetes cluster. This allowed us to start using Istio without the extra operational overhead of managing the Istio Control Pane.

Istio enables us to route traffic to any of the services within the mesh without every services requiring a public IP address.

It does this by having an ingressgateway which is an Envoy proxy that sits at the edge of the mesh and is attached to a Google load balancer for external access to the internet.

The problem we were seeing is that by default the Istio for GKE add-on only deploys one pod for the ingressgateway, which means if the node that hosts the ingressgateway pod suddenly went away, we would lose access to all our services in the mesh.

That was less than ideal, so we looked into the Istio setup a little more and found that Google have already thought about this and allow you to modify control pane by using the deployed HorizontalPodAutoscaler. To do this we needed to add resource requests to the Istio control pane components with the following command kubectl edit -n istio-system Deployments/istio-telemetry then add the following to the containers:

resources: requests:

cpu: 100m

This allowed the HorizontalPodAutoscaler to detect the CPU utilisation of the Istio pods and scale the pods accordingly. We also edited the HorizontalPodAutoscaler to have a minimum of three replicas running, similar to what we do for our own services.

The root cause:

While the resilience work was underway, we were still in contact with the Google support team.

The support team advised us to take a deeper look at the Kubernetes nodes and suspected that this was a resource exhaustion problem.

After a bit of troubleshooting on the Kubernetes node, we noticed a leak in process IDs on a few of the nodes. There were thousands of <defunct> processes in the process tree that were not being terminated correctly.

After checking the parent process ID we found our problem.

The process 3491355 was one of our containers that converts HTML pages to PDFs. It does this using an instance of headless Google Chrome.

This service had a /readiness handler that was checking google-chrome --version. It looks like this check is just cating a version file and the cat process was not being terminated correctly.

After modify the readiness handler and redeploying our pdfgenerator service, we saw a massive reduction in the number of PIDs on the nodes hosting that service.

This problem had already been predicted by the Kubernetes development team and they have introduced a PID limit in Kubernetes 1.14 however as we were running 1.13 in production unfortunately we missed out on this change.


We learned a tremendous amount about Kubernetes and Istio off the back of this issue. We changed our microservice template to make the default microservice deployed into our kubernetes cluster more resilient and we scaled our cluster to help support new services.