Kubernetes has become the de-facto standard for the deployment of cloud-native applications. Kubernetes has been made popular by the increasing shift to a microservices architecture. This type of system, with lightweight, loosely coupled, autonomous web services, is well-suited for the deployment of distributed applications.

However, managing distributed applications at a scale is challenging, especially when multiple components are involved. Moreover, the situation quickly becomes challenging when we want to ensure the system's overall correctness, even in partial failures.

In this tutorial, we will explore the three Kubernetes probes that enable the construction of highly available, robust, and self-healing distributed applications. As our demo application, we will focus on NGINX. We will begin by examining the constraints of a basic deployment and subsequently improve it incrementally using the Kubernetes probes.

By the conclusion of this guide, we will have established a reliable methodology that can be employed for deploying any other web application in a production environment. So, let’s dive into the tutorial!

Prerequisites

To get the most out of this guide, you will need access to the following resources and tools:

Setting up core components

Before we begin, you will need to ensure that you have a new namespace and a basic NGINX deployment created. Let’s run over how you can do that:

Creating a new namespace

Kubernetes namespaces provide a way to isolate the resources within a cluster. This is completely optional, but for this tutorial, we are creating the namespace to ensure that it doesn't conflict with other existing resources you may have running on a cluster.

So, let's create a new namespace with the name probe-demo:

$ kubectl create ns probe-demo

Next, let's set the probe-demo namespace as the current context:

$ kubectl config set-context --current --namespace=probe-demo

Creating a basic NGINX deployment

Now, let's create a basic NGINX deployment without any health checks. Save the following as basic-deployment.yml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  namespace: probe-demo
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 2
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx
        ports:
        - containerPort: 80

Next, let's deploy this configuration using kubectl apply-f basic-deployment.yml and verify that all pods are in a healthy state:

$ kubectl apply -f basic-deployment.yml 
deployment.apps/nginx created

$ kubectl get pods
NAME                    READY   STATUS    RESTARTS   AGE
nginx-ff6774dc6-4lxcz   1/1     Running   0          23s
nginx-ff6774dc6-wrsjj   1/1     Running   0          23s

As we can see, now all the pods are in a running state.

Understanding the need for Kubernetes probes

In the previous stage, we created a basic NGINX deployment without any health checks. However, by default, Kubernetes provides the process health check, which verifies whether or not the main process of the container is running. If not, then by default, Kubernetes restarts that container.

In addition, we are using Kubernetes' deployment object to run multiple instances of the Pod, in this case, 2. So can we say that this deployment is robust and resilient to failures? Not really. Let's see why this is so.

First, let's verify that the NGINX web server is healthy and able to render the welcome page. Make sure to swap in one of the pod names from your cluster, as they will be different:

$ kubectl exec -it nginx-ff6774dc6-4lxcz -- curl http://localhost

If everything is fine, then the above command will display the default NGINX HTML welcome page on the terminal.

The NGINX web server renders its welcome page from the /usr/share/nginx/html/index.html file. Now, to simulate an error scenario, let's delete this file on the container and execute the same HTTP request using the curl command:

$ kubectl exec -it nginx-ff6774dc6-4lxcz -- rm /usr/share/nginx/html/index.html

$ kubectl exec -it nginx-ff6774dc6-4lxcz -- curl http://localhost

In the above output, we can see that now we get a 403 Forbidden error.

Here, we can notice that even though the NGINX daemon is running, it's not serving any functional purpose at this moment. Because it's not able to render the required page, it is throwing the HTTP status code 403.

It is very much possible that any other web application can end up in a similar situation. One such scenario is that the Java Virtual Machine(JVM) might throw the OutOfMemoryError, but the JVM process is still in a running state. This is a problematic situation because the application cannot serve any requests, but the process health check considers the application as healthy.

In such scenarios, the quick and short-term fix is to restart the Pod. Wouldn't it be great if this could happen automatically? In fact, we can achieve this using Kubernetes probes. So let's learn more about them.

Types of Kubernetes probes

Monitoring the health of an application is an essential task. However, only monitoring is not sufficient and we must take corrective actions in case of failures to maintain the overall availability of the system. Kubernetes provides a reliable way to achieve this using probes. It provides the following three types of probes:

  1. Liveness probe: This probe constantly checks whether or not the container is healthy and functional. If it detects an issue, then by default, it restarts the container
  2. Readiness probe: This probe checks whether or not the container is ready to accept incoming requests. If yes, then the requests are sent to the container for further processing
  3. Startup probe: This probe determines whether a container has started or not

Each probe provides three different methods for checking the application's health:

  • Command: This method executes the provided command in a container. As long as the return value is 0, i.e., not an error, this indicates a success.
  • TCP: This method attempts to establish a TCP connection with the container. A successful connection establishment indicates success
  • HTTP request: This method executes an HTTP request on the container. A response HTTP status code between 200 and 399 (both inclusive) indicates success.

Just now we discussed the probe types and their methods. But which one should be used? There is no single fixed answer, because that depends on your application. We can choose the method that is most suitable for the application. This is the reason why the various probe types exist.

Liveness probe

In the previous section, we saw that the process health check could not recognize whether or not the application is functional though it appears live. Sometimes, restarting the application might solve the intermittent issue. In such cases, we can use the Kubernetes' liveness probe.

The liveness probe allows us to define an application-specific health check. In other words, this mechanism provides a reliable way to monitor the health of any given application. Let's understand its usage with an example.

Defining a liveness command

Let's define the liveness probe to check the existence of the /usr/share/nginx/html/index.html file. We can use the ls command to achieve this. After adding the liveness probe, our deployment definition from earlier looks like this:

command-liveness.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  namespace: probe-demo
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 2
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx
        ports:
        - containerPort: 80
        livenessProbe:
          exec:
            command:
            - ls
            - /usr/share/nginx/html/index.html

Now, let's deploy this updated configuration and verify that the pods are in a healthy state:

$ kubectl apply -f command-liveness.yaml 
deployment.apps/nginx configured

$ kubectl get pods
NAME                     READY   STATUS    RESTARTS   AGE
nginx-847c64cc7c-664lf   1/1     Running   0          42s
nginx-847c64cc7c-jb2bq   1/1     Running   0          46s

Next, let's delete the /usr/share/nginx/html/index.html file from the pod and observe its events. Once again, make sure to substitute the pod name from kubectl get pods on your cluster:

$ kubectl exec -it nginx-847c64cc7c-664lf -- rm /usr/share/nginx/html/index.html

$ kubectl get event --namespace probe-demo --field-selector involvedObject.name=nginx-847c64cc7c-664lf
LAST SEEN   TYPE      REASON      OBJECT                       MESSAGE
3m21s       Normal    Scheduled   pod/nginx-847c64cc7c-664lf   Successfully assigned probe-demo/nginx-847c64cc7c-664lf to proble-demo-control-plane
0s          Normal    Pulling     pod/nginx-847c64cc7c-664lf   Pulling image "nginx"
3m18s       Normal    Pulled      pod/nginx-847c64cc7c-664lf   Successfully pulled image "nginx" in 2.19317793s
3m18s       Normal    Created     pod/nginx-847c64cc7c-664lf   Created container nginx
3m18s       Normal    Started     pod/nginx-847c64cc7c-664lf   Started container nginx
1s          Warning   Unhealthy   pod/nginx-847c64cc7c-664lf   Liveness probe failed: ls: cannot access '/usr/share/nginx/html/index.html': No such file or directory
1s          Normal    Killing     pod/nginx-847c64cc7c-664lf   Container nginx failed liveness probe, will be restarted

In the above output, we can see that, Kubernetes has marked the pod as unhealthy and restated it. We can see these details in the REASON and MESSAGE columns respectively.

Finally, let's verify that the pod has been restarted:

$ kubectl get pods
NAME                     READY   STATUS    RESTARTS      AGE
nginx-847c64cc7c-664lf   1/1     Running   1 (42s ago)   4m2s
nginx-847c64cc7c-jb2bq   1/1     Running   0             4m6s

In the above output, the RESTARTS column indicates that the pod was restarted 42 seconds ago.

Defining a TCP liveness probe

Similar to the command probes, we can use the TCP socket probe to check the health of the application. As the name suggests, this probe attempts to establish a TCP connection with the container at a specified port. The probe is considered successful if the connection gets established successfully.

Currently, the NGINX server is running on port 80. To simulate an error, let's try to connect to port number 8080 using the following TCP probe:

livenessProbe:
  tcpSocket:
    port: 8080

After adding this probe configuration, the deployment descriptor looks like this:

tcp-liveness.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  namespace: probe-demo
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 2
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx
        ports:
        - containerPort: 80
        livenessProbe:
          tcpSocket:
            port: 8080

Now, let's deploy this updated configuration and check the events of the pod:

$ kubectl apply -f tcp-liveness.yaml 
deployment.apps/nginx configured

$ kubectl get pods
NAME                     READY   STATUS    RESTARTS   AGE
nginx-5bb9d87b58-cwpzf   1/1     Running   0          28s
nginx-5bb9d87b58-rfnlh   1/1     Running   0          21s

$ kubectl get event --namespace probe-demo --field-selector involvedObject.name=nginx-5bb9d87b58-cwpzf
LAST SEEN   TYPE      REASON      OBJECT                       MESSAGE
33s         Normal    Scheduled   pod/nginx-5bb9d87b58-cwpzf   Successfully assigned probe-demo/nginx-5bb9d87b58-cwpzf to proble-demo-control-plane
3s          Normal    Pulling     pod/nginx-5bb9d87b58-cwpzf   Pulling image "nginx"
27s         Normal    Pulled      pod/nginx-5bb9d87b58-cwpzf   Successfully pulled image "nginx" in 5.719997947s
1s          Normal    Created     pod/nginx-5bb9d87b58-cwpzf   Created container nginx
0s          Normal    Started     pod/nginx-5bb9d87b58-cwpzf   Started container nginx
3s          Warning   Unhealthy   pod/nginx-5bb9d87b58-cwpzf   Liveness probe failed: dial tcp 10.244.0.7:8080: connect: connection refused
3s          Normal    Killing     pod/nginx-5bb9d87b58-cwpzf   Container nginx failed liveness probe, will be restarted
1s          Normal    Pulled      pod/nginx-5bb9d87b58-cwpzf   Successfully pulled image "nginx" in 1.928558648s

In the above output, we can see that the liveness probe failed because the connection was refused on port 8080. To fix this issue, we can correct the liveness probe to use port 80, where the server is listening.

Defining a liveness HTTP request

Many web applications expose an HTTP endpoint to report the health of the application. For example, in the Spring Boot framework Actuator we can use the actuator/health endpoint to check the status of the application. So let's see how to configure an HTTP endpoint in a liveness probe next.

By default, the NGINX server renders the welcome page at a base URL. To simulate an error, let's try to hit a non-existing HTTP endpoint using the following probe:

livenessProbe:
  httpGet:
    path: /non-existing-endpoint
    port: 80

After adding the probe configuration, the complete deployment descriptor looks like this: http-liveness.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  namespace: probe-demo
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 2
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx
        ports:
        - containerPort: 80
        livenessProbe:
          httpGet:
            path: /non-existing-endpoint
            port: 80

Now, let's deploy this configuration and check the events of the pod. Once again, make sure to use the pod name from your cluster rather than the example name below:

$ kubectl apply -f http-liveness.yaml 
deployment.apps/nginx configured

$ kubectl get pods
NAME                     READY   STATUS    RESTARTS   AGE
nginx-7c459d7c6c-tlqp5   1/1     Running   0          16s
nginx-7c459d7c6c-vmlq5   1/1     Running   0          11s

$ kubectl get event --namespace probe-demo --field-selector involvedObject.name=nginx-7c459d7c6c-tlqp5
LAST SEEN   TYPE      REASON      OBJECT                       MESSAGE
30s         Normal    Scheduled   pod/nginx-7c459d7c6c-tlqp5   Successfully assigned probe-demo/nginx-7c459d7c6c-tlqp5 to proble-demo-control-plane
30s         Normal    Pulling     pod/nginx-7c459d7c6c-tlqp5   Pulling image "nginx"
27s         Normal    Pulled      pod/nginx-7c459d7c6c-tlqp5   Successfully pulled image "nginx" in 3.58879558s
27s         Normal    Created     pod/nginx-7c459d7c6c-tlqp5   Created container nginx
27s         Normal    Started     pod/nginx-7c459d7c6c-tlqp5   Started container nginx
1s          Warning   Unhealthy   pod/nginx-7c459d7c6c-tlqp5   Liveness probe failed: HTTP probe failed with statuscode: 404
1s          Normal    Killing     pod/nginx-7c459d7c6c-tlqp5   Container nginx failed liveness probe, will be restarted

Here, we can see that the liveness probe failed as expected with the HTTP status code 404. To fix this issue, we can use the valid HTTP endpoint (such as /) with the liveness probe.

It is worth noting that the liveness probe is not a solution to all problems. It plays a valuable role only if your application can afford the restart of the affected pod(s), and the restart can solve the application's intermittent issues. It will not fix configuration errors or bugs in your application code.

Readiness probe

In the previous section, we saw how the liveness probe allows us to implement a self-healing system in certain situations. However, from practical experience, we know that in most cases having only a liveness probe is not sufficient.

The liveness probe is able to restart unhealthy containers. However, in some rare cases, the container may not be in a healthy state in the first place, and restarting it will not help. One example of such a scenario is when we try to deploy a new version of the application that is not healthy. Let's understand this with an example.

Rectifying the setup

In the previous section, we deployed an unhealthy pod to illustrate the failure in the HTTP liveness probe. Now, let's modify it to use the valid HTTP endpoint. Now, the modified deployment descriptor looks like this:

http-liveness.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  namespace: probe-demo
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 2
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx
        ports:
        - containerPort: 80
        livenessProbe:
          httpGet:
            path: /
            port: 80

Now, let's deploy this configuration and verify that the pods are in a healthy state:

$ kubectl apply -f http-liveness.yaml 
deployment.apps/nginx configured

$ kubectl get pods
NAME                     READY   STATUS    RESTARTS   AGE
nginx-5bb954fdcb-k42tg   1/1     Running   0          104m
nginx-5bb954fdcb-wsfjm   1/1     Running   0          104m

Breaking the liveness probe

Previously, we saw that the liveness probe plays an important role when the deployed application is healthy but at a later point, it becomes unhealthy. However, the liveness probe won't be able to do much if the application is unhealthy in the first place.

To simulate the unhealthy application scenario, let's configure the postStart hook that deletes the /usr/share/nginx/html/index.html file:

breaking-liveness.yml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  namespace: probe-demo
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 2
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx
        ports:
        - containerPort: 80
        livenessProbe:
          httpGet:
            path: /
            port: 80
        lifecycle:
          postStart:
            exec:
              command: ["/bin/bash", "-c", "rm -f /usr/share/nginx/html/index.html"]

For more information on Kubernetes lifecycle hooks, please see the official documentation. As a summary, as soon as the container defined in the deployment starts, the postStart hook will execute the defined command as part of standing up the pod.

Now, let's deploy this configuration and observe the behavior of the newly deployed pods:

$ kubectl apply -f breaking-liveness.yml 
deployment.apps/nginx configured

$ kubectl get pods
NAME                     READY   STATUS    RESTARTS     AGE
nginx-76fb56d59f-knsbz   1/1     Running   4 (3s ago)   2m4s
nginx-76fb56d59f-kx24d   1/1     Running   4 (6s ago)   2m6s

As we can see, now the pods are getting restarted continuously. Such a scenario can cause production downtime. In the next section, we will discuss how to avoid such undesirable behaviors.

Before moving to the next section, let's revert the setup by deploying the configuration from the http-liveness.yaml file from earlier:

$ kubectl apply -f http-liveness.yaml 
deployment.apps/nginx configured

$ kubectl get pods
NAME                     READY   STATUS    RESTARTS   AGE
nginx-5bb954fdcb-2fgm2   1/1     Running   0          4m23s
nginx-5bb954fdcb-xmhlr   1/1     Running   0          4m26s

Defining the HTTP readiness probe

In the previous example, we saw how an unhealthy application can cause production downtime. We can mitigate such failures by configuring a readiness probe. The syntax of the readiness probe is similar to the liveness probes:

readinessProbe:
  httpGet:
    path: /
    port: 80

Now, let's understand the behavior of the readiness probe with an example.

First, add the readiness probe configuration to handle the unhealthy deployment scenarios:

http-readiness.yml :

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  namespace: probe-demo
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 2
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx
        ports:
        - containerPort: 80
        livenessProbe:
          httpGet:
            path: /
            port: 80
        lifecycle:
          postStart:
            exec:
              command: ["/bin/bash", "-c", "rm -f /usr/share/nginx/html/index.html"]
        readinessProbe:
          httpGet:
            path: /
            port: 80

Next, let's deploy this configuration and observe the status of the newly created pod:

$ kubectl apply -f http-readiness.yml 
deployment.apps/nginx configured

$ kubectl get pods
NAME                     READY   STATUS    RESTARTS      AGE
nginx-5bb954fdcb-2fgm2   1/1     Running   0             11m
nginx-5bb954fdcb-xmhlr   1/1     Running   0             11m
nginx-dbbd95c97-mgklw    0/1     Running   6 (86s ago)   5m6s

In the above output, we can see that now there are three pods. But the most important thing is the status of the READY column.

For the last pod, we can see that the READY column shows 0/1. This indicates that the 0 out of 1 pods are ready to receive the incoming traffic. Due to this reason, the new pod is considered unhealthy. Hence Kubernetes doesn't delete the older pods. In this way, we can use a combination of liveness and readiness probes to ensure that only healthy containers serve the incoming requests.

Lastly, we can remove the erroneous postStart section in the deployment to make the deployment healthy.

In this section, we illustrated the use of the readiness probe using the HTTP probe alone. However, we can also use the command and TCP probe methods to configure the readiness probe. Their syntax is similar to the corresponding liveness probes.

Startup probe

Kubernetes also provides the startup probe. This probe, however, is not as well-known as the other two. It's mainly used with an application that takes time to start up. When the startup probe is configured, it disables the other two probes until this probe succeeds. This prevents errors or alerts from liveness or other probes from triggering unnecessarily.

The syntax of the startup probe is similar to the other probes:

startupProbe:
  httpGet:
    path: /
    port: 80

To understand its usage, let's create an unhealthy deployment with the startup probe. The following deployment removes the index.html default NGINX page but also defines a readiness probe that would produce an error as the page it tries to access is not available:

http-startup.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  namespace: probe-demo
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 2
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx
        ports:
        - containerPort: 80
        livenessProbe:
          httpGet:
            path: /
            port: 80
        lifecycle:
          postStart:
            exec:
              command: ["/bin/bash", "-c", "rm -f /usr/share/nginx/html/index.html"]
        readinessProbe:
          httpGet:
            path: /
            port: 80
        startupProbe:
          httpGet:
            path: /
            port: 80

Now, let's deploy this configuration and verify that the startup probe disables the other two probes:

$ kubectl apply -f http-startup.yaml 
deployment.apps/nginx configured

$ kubectl get pods
NAME                     READY   STATUS    RESTARTS   AGE
nginx-5bb954fdcb-2fgm2   1/1     Running   0          35m
nginx-5bb954fdcb-xmhlr   1/1     Running   0          35m
nginx-5f4576574c-dqhfp   0/1     Running   0          24s

$ kubectl get event --namespace probe-demo --field-selector involvedObject.name=nginx-5f4576574c-dqhfp
LAST SEEN   TYPE      REASON      OBJECT                       MESSAGE
95s         Normal    Scheduled   pod/nginx-5f4576574c-dqhfp   Successfully assigned probe-demo/nginx-5f4576574c-dqhfp to proble-demo-control-plane
5s          Normal    Pulling     pod/nginx-5f4576574c-dqhfp   Pulling image "nginx"
93s         Normal    Pulled      pod/nginx-5f4576574c-dqhfp   Successfully pulled image "nginx" in 2.046922256s
33s         Normal    Created     pod/nginx-5f4576574c-dqhfp   Created container nginx
33s         Normal    Started     pod/nginx-5f4576574c-dqhfp   Started container nginx
5s          Warning   Unhealthy   pod/nginx-5f4576574c-dqhfp   Startup probe failed: HTTP probe failed with statuscode: 403
5s          Normal    Killing     pod/nginx-5f4576574c-dqhfp   Container nginx failed startup probe, will be restarted

In the above output, we can see that the pod was marked as unhealthy since the startup probe failed. To make the setup functional again, we can remove the postStart section.

Just like the liveness probes, we can also use the command and TCP probe methods to configure the startup probes.

Advanced Probe Configuration

So far, we've utilized the probes in their default configuration. In reality, as a solution for a probe failure, the default practice of restarting a pod was stated. However, we can override this based on the needs of the application. Each configuration parameter is described in depth in the table below:

Parameter Description Default Value Minimum Value
initialDelaySeconds The time duration after the container has started but before any probes are initiated 0 0
periodSeconds The frequency of the probe 10 1
timeoutSeconds The timeout value for the probe responses 1 1
successThreshold The minimum number of consecutive success responses required to mark the probe status as a success 1 1
failureThreshold The minimum number of consecutive failed responses required to mark the probe status as failed 3 1

Summary

Throughout this tutorial, we went over how to configure probes in Kubernetes. First, we discussed the limitation of the default process health check, then we discussed the different types of probes. After providing practical examples of liveness, readiness, and startup probes, we were able to look at advanced probe configuration.

For more information on Kubernetes probes, check out these resources: