Rolling updates in Kubernetes

Posted by Martijn van Lambalgen on January 8, 2018

So, you want to do rolling updates of your services in Kubernetes to achieve zero-downtime? That’s what we wanted to reach too, and what we’re doing now (mostly). Our journey involved quite a bit of research, filling of gaps in our lacking knowledge, learning from a multitude of mistakes, and a fair bit of trial and error. To make your journey more efficient, here is what we learned.

RollingUpdate strategy in Kubernetes

Obviously, the first part we started with was the configuration of the deployment process in Kubernetes itself. That’s the ‘easy’ part, and there was a time when the naïve part of us thought this was going to solve all our problems. Now we know better! There are several parameters that need to be configured correctly, and for that you really need to understand the type of service you’re trying to deploy.
As part of the Kubernetes deployment specification you can set a strategy.

- kind: Deployment
   ...
  spec:
    replicas: 3
    ...
    strategy:
      type: RollingUpdate
      rollingUpdate:
        maxUnavailable: 1
        maxSurge: 2
    ...

This tells Kubernetes that this service should use a rolling update when a new version is being deployed. The goal here is that a new pod is created and should be ready before the old one is terminated. For this to work properly quite a bit of work is involved however. Let’s first go through the configuration parameters:

maxUnavailable:

  • The number of pods that may be unavailable during the deployment. This is relative to the desired number, specified in replicas.
  • It can be either an absolute number or a percentage.
  • It should obviously be lower than the number of replicas, or no pods may remain available during the update.

maxSurge:

  • The (additional) number of pods that can be created, on top of the desired number of replicas.
  • This too can be either an absolute number or a percentage.
  • A higher number may speed up the deployment, but will also require more system resources.

With these two parameters you can tell Kubernetes how much freedom it has in creating a few extra pods during the deployment process, and whether it is also allowed to have fewer. There’s no telling whether Kubernetes will actually first add new pods, or start with removing old ones. Needless to say, it is not possible to set both maxUnavailable and maxSurge to 0.

You can find more information on rolling updates here: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#rolling-update-deployment

Readiness probe

How does Kubernetes decide whether a new pod is ready to receive incoming requests? For this purpose a readiness probe can be configured. A readiness probe tells Kubernetes whether the container is ‘ready’ to receive (and process) traffic. Only the container itself knows when it is ready. At the first startup there may be all kinds of initialization that needs to be done. The readiness probe may also be used to indicate temporary downtime for maintenance in the container, but in our case we’re only interested in readiness after startup.

It depends completely on your service what this readiness probe looks like. Readiness probes can be specified as HTTP requests, command line scripts or through TCP sockets.

HTTP request

The type we used most often is a simple HTTP request to a health endpoint. E.g. if your service is just a simple webserver, it is ready when it can process a request to an endpoint. The /healthz endpoint now only needs to return a 200 status code to indicate it is ready.

     readinessProbe:
        httpGet:
          path: /healthz
          port: 80
          scheme: HTTP
        initialDelaySeconds: 20
        periodSeconds: 5       
        timeoutSeconds: 1

Kubernetes will keep doing GET requests to the specified endpoint, and consider the container ‘ready’ when it receives an HTTP status of 200 OK. There are several parameters to configure this. Some of the most important parameters are:

initialDelaySeconds:

  • After starting the container, Kubernetes will wait this amount of seconds before first trying the probe. This allows the container to do initialization and startup stuff.

periodSeconds:

  • Kubernetes will keep doing the readiness probe with this many seconds in between.
  • Choose the value wisely. Too low and it may become a performance hit. Also, every minor hiccup may then result in no traffic being handled, which may result in the whole service becoming unavailable. Too high and Kubernetes will unnecessarily keep sending requests to a pod that cannot handle them.

timeoutSeconds:

  • If the probe does not return a result within this time limit, it will be considered as failed.

Command line scripts

Sometimes you have containers that are not webservers and that cannot handle HTTP requests. In those cases it may be easier to just run a shell script. Kubernetes will run the script inside the container and hope for an exit code 0. If the exit code doesn’t indicate an error, the container is considered ‘ready’.

     readinessProbe:
        exec:
          command:
            - "/opt/readiness.sh"
        initialDelaySeconds: 60
        periodSeconds: 5

 

To configure the scripted readiness probe, you can use the same parameters as with the HTTP Requests.

TCPsocket

Kubernetes will try to open a socket on a specified port to your container. If this succeeds, the container is considered ‘ready’.

     readinessProbe:
       tcpSocket:
         port: 8080
       initialDelaySeconds: 5
       periodSeconds: 10

To configure the TCP socket readiness probe, you can use the same parameters as with the other two types.

More details about all the possibilities with readiness probes and all the (additional) configuration options can be found here: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#define-readiness-probes

SIGTERM

So, when we finished this, Kubernetes could startup new pods and wait with sending traffic to the containers until they were ready, but the old pods were killed too quickly and too harshly. That is because Kubernetes will initially kindly ‘ask’ the container to terminate by sending a SIGTERM (termination signal) to the main process (PID 1). However, if that container does not respond within 30 seconds, Kubernetes just kills it without mercy. That’s not what we had in mind, so… all we needed to do was let the main process listen to the TERM signal, and let it shutdown gracefully. That’s it!

Sounds easy, but this was the hardest part for us. We had quite a few different containers, with different types of main processes. Furthermore, if you want the main process to do a graceful shutdown, that main process should propagate the TERM signals to all child processes, so they can in turn do their own graceful shutdown. If you miss one process and that process is doing actual work when you kill it, the original request will fail. Most of the time this resulted in 502 Bad Gateway errors which were also hard to debug.

“if you want the main process to do a graceful shutdown, that main process should propagate the SIGTERM to all child processes, so they can in turn do their own graceful shutdown.”

Here are some of the types of processes we had to go through when getting our containers ready for graceful shutdowns.

Java process

For Java processes it was rather easy to listen to the SIGTERM. This code in the main method shows what you need to do.

Runtime.getRuntime().addShutdownHook(new Thread(() -> {
  log.info("Shutdown hook triggered by SIGTERM. Performing graceful shutdown.");
    // Finish own work 
    // Propagate SIGTERM to child processes, wait for them to finish
}));

That’s all! Fortunately not all things are difficult.

Jetty

Depending on which version of Jetty you have, you need to configure the graceful shutdown differently, but let’s say you  have Jetty 9.x. Then this is what you need to do:

YourServletHandler servletHandler = new YourServletHandler();
StatisticsHandler statsHandler = new StatisticsHandler();
statsHandler.setHandler(servletHandler);

Server server = new Server(80);
server.setHandler(statsHandler);
server.setStopAtShutdown(true);
server.setStopTimeout(3000);

Basically you need to do three things:

  1. Enable stopping Jetty when the Java process is being terminated, by setting StopAtShutdown to true
  2. Specify the StopTimeout to tell Jetty how much time can be used for the shutdown
  3. Set a StatisticsHandler, because stats about the connection status are needed for a graceful shutdown

Shell scripts

Terminating shell scripts appeared challenging. After hearing some promising rumors, we hoped that it would automatically propagate the SIGTERM to its child processes, but this appeared to be idle hope.

#!/usr/bin/env bash
 
_term() {
  echo "We caught a SIGTERM signal!"
  nginx -s quit
  wait "$nginxPID"
}
 
trap term SIGTERM
 
nginx -g "daemon off; error_log /dev/stderr debug;" &
nginxPID=$!
wait "$nginxPID"

Signals should be explicitly trapped, after which we can explicitly propagate the TERM signal to the child processes. To terminate child processes you need to know their PID. In our example, we started nginx and captured its PID, so we can later ask nginx to shutdown gracefully again. It is important to ‘wait’ until the child process finished shutting down, because if the shell script itself finishes too soon, the whole container may be stopped, and also these children may get killed (Please note how cruel that sounds, so be careful).

For more information on termination of pods and graceful shutdowns, see: https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods

Ingress-controller

One final problem we had to tackle, occurred because of the ingress-controller. This is the piece that monitors changes in the ingress resources via the Kubernetes API and updates the configuration of a load balancer in case of any changes. Since it does its synchronization only once every 10 seconds or so, it may happen that when doing a rolling update, first only the old pods can be resolved, then a second later the new pod is ready, after which the old pods get SIGTERMs one by one. If at that point a request comes in, the ingress controller is not synchronized for the new situation yet, and will pass requests on to the old terminating pod. Since that one is no longer available, the user will get a 502 Bad Gateway error.

After trying multiple solutions we ended up with putting the ingress-controller in its own pod to reduce the number of times that it is restarted, and setting a pre-stop hook in each of the other pods that will let them wait for 15 seconds after being marked for termination. In those 15 seconds, requests will still be processed, but the ingress-controller will remove that pod from the list when it synchronizes again.

lifecycle:
   preStop:
     exec:
       command: ["sleep", "15"]

The result of this, is that there will be a small overlap in the old and new services both being available for handling requests, which allows the ingress-controller to update.

Conclusion

After doing all this stuff, we stopped seeing weird errors during our deployments, which is a pretty good indication that we’re at least on the right track. In summary, there’s four things you need to do to make your container deployments in Kubernetes smooth, without any downtime.

  1. Configure your containers in Kubernetes to use RollingUpdate
  2. Make sure Kubernetes can figure out when the new container is ready
  3. Make sure old containers (and all the processes in it) shutdown gracefully
  4. Let old pods be available a bit longer when the new pod is already available

With that you should be able to do lots and lots of deployments during the day, during the night, and during the weekend without any of your users noticing any downtime. Good luck!