Chaos engineering experiments on Kubernetes using Litmus

Learn how to use the open-source Litmus tool to conduct chaos experiments on your Kubernetes cluster, find bugs and vulnerabilities before they reach production, and make your cluster more resilient.

4 minutes reading time

Written by

Saiyam Pathak
Saiyam Pathak

Head of Developer Relations @ vCluster

Litmus is an open-source tool to do chaos experiments on Kubernetes clusters, introducing unexpected failures to a system to test its resiliency. With Litmus, you can actually create chaos experiments, find bugs fast, and fix them before they ever reach the production phase. It turns out to be a great way to make a Kubernetes cluster more resilient. In this tutorial, I will walk you through the Litmus installation process on a Kubernetes cluster, and create/run the below experiments on it:

  • Pod Deletion
  • Pod Autoscaler

Before we get into setup and what Litmus can do, be sure to check out our video guide...

Chaos Engineering for Kubernetes with Litmus

Prerequisites

Getting up and running with Litmus

Once you have the Kubernetes cluster ready, install the LitmusChaos Operator:

$ kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.9.0.yaml
namespace/litmus created
serviceaccount/litmus created
clusterrole.rbac.authorization.k8s.io/litmus created
clusterrolebinding.rbac.authorization.k8s.io/litmus created
deployment.apps/chaos-operator-ce created
customresourcedefinition.apiextensions.k8s.io/chaosengines.litmuschaos.io created
customresourcedefinition.apiextensions.k8s.io/chaosexperiments.litmuschaos.io created
customresourcedefinition.apiextensions.k8s.io/chaosresults.litmuschaos.io created

This installs all the required Custom Resource Definitions and Operator. You should be able to see Litmus running in its own namespace:

$ kubectl get pods -n litmus
NAME READY STATUS RESTARTS AGE
chaos-operator-ce-56449c7d75-lt8jc 1/1 Running 0 90s
$ kubectl get crds | grep chaos
chaosengines.litmuschaos.io 2020-11-06T14:23:59Z
chaosexperiments.litmuschaos.io 2020-11-06T14:24:00Z
chaosresults.litmuschaos.io 2020-11-06T14:24:00Z
$ kubectl api-resources | grep chaos
chaosengines litmuschaos.io true ChaosEngine
chaosexperiments litmuschaos.io true ChaosExperiment
chaosresults litmuschaos.io true ChaosResult

Below are the 3 CRDs (Definitions taken from the official repository):

  • ChaosEngine: A resource to link a Kubernetes application or Kubernetes node to a ChaosExperiment. ChaosEngine is watched by Litmus' Chaos-Operator which then invokes Chaos-Experiments
  • ChaosExperiment: A resource to group the configuration parameters of a chaos experiment. ChaosExperiment CRs are created by the operator when experiments are invoked by ChaosEngine.
  • ChaosResult: A resource to hold the results of a chaos-experiment. The Chaos-exporter reads the results and exports the metrics into a configured Prometheus server.

Now It's the time to create some chaos experiments!

Step 1 - Prepare your cluster

Create a new namespace, demo and Service Account (sa.yaml) that can be used by the chaos engine with below contents:

apiVersion: v1
kind: Namespace
metadata:
name: demo
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: chaos-sa
namespace: demo
labels:
name: pod-delete-sa
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: chaos-sa
namespace: demo
labels:
name: chaos-sa
rules:
- apiGroups: ["","litmuschaos.io","batch","apps"]
resources: ["pods","deployments","pods/log","events","jobs","chaosengines","chaosexperiments","chaosresults"]
verbs: ["create","list","get","patch","update","delete","deletecollection"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: chaos-sa
namespace: demo
labels:
name: pod-delete-sa
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: chaos-sa
subjects:
- kind: ServiceAccount
name: chaos-sa
namespace: demo

Apply this to your cluster:

kubectl apply -f sa.yaml
namespace/demo created
serviceaccount/chaos-sa created
role.rbac.authorization.k8s.io/chaos-sa created
rolebinding.rbac.authorization.k8s.io/chaos-sa created

You can see above has created a role, a rolebinding to modify the Litmus CRDs and Kubernetes deployments tied to the namespace demo.

Step 2 - Install experiments

Chaos experiments contain the actual details for chaos events to be triggered. There are experiments already listed on chaos hub that can be readily installed onto the cluster. For now we will install generic experiments and use pod-delete and pod-autoscaler which we will need to get onto our cluster:

kubectl apply -f https://hub.litmuschaos.io/api/chaos/1.9.0?file=charts/generic/experiments.yaml -n demo

Both the experiments mentioned above, and some others too, get created:

kubectl get chaosexperiments -n demo
NAME AGE
pod-delete 10m
pod-network-duplication 19s
node-drain 19s
node-io-stress 18s
disk-fill 17s
node-taint 17s
pod-autoscaler 16s
pod-cpu-hog 16s
pod-memory-hog 15s
pod-network-corruption 14s
pod-network-loss 13s
disk-loss 13s
pod-io-stress 12s
k8-service-kill 11s
pod-network-latency 11s
node-cpu-hog 10s
docker-service-kill 10s
kubelet-service-kill 9s
node-memory-hog 8s
k8-pod-delete 8s
container-kill 7s

Step 3 - Create deployment and Chaos Engine for pod-delete

Let's start a simple 2-replica ngnix deployment in our demo namespace that we can run our experiments on.

$ kubectl create deployment nginx --image=nginx --replicas=2 --namespace=demo
deployment.apps/nginx created

Then, let's create a pod_delete.yaml that we can apply as a ChaosEngine to our cluster:

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: nginx-chaos
namespace: demo
spec:
appinfo:
appns: 'demo'
applabel: 'app=nginx'
appkind: 'deployment'
annotationCheck: 'false'
engineState: 'active'
auxiliaryAppInfo: ''
chaosServiceAccount: chaos-sa
monitoring: false
jobCleanUpPolicy: 'delete'
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: '30'
- name: CHAOS_INTERVAL
value: '10'
- name: FORCE
value: 'false'

Apply this to our cluster:

$ kubectl apply -f pod_delete.yaml
chaosengine.litmuschaos.io/nginx-chaos created

In above if the annotationCheck is true, then you need to annotate your deployment with kubectl annotate deploy/nginx litmuschaos.io/chaos="true" -n demo to make it work.

After the ChaosEngine is created, it will create 2 new pods which will in turn start to terminate pods from our nginx deployment, which is the motive of this experiment.

$ kubectl get pods -n demo
NAME READY STATUS RESTARTS AGE
nginx-f89759699-kxrbc 1/1 Running 0 83s
nginx-chaos-runner 1/1 Running 0 25s
pod-delete-up8kop-zmgjx 1/1 Running 0 11s
nginx-f89759699-p7cwq 0/1 Terminating 0 83s
nginx-f89759699-j2swb 1/1 Running 0 5s

To check the status/result of the experiment, describe the chaosresult: kubectl describe chaosresult nginx-chaos-pod-delete -n demo

You should see something like this:

Deployment and Chaos Engine for pod-delete

Step 4 - Create deployment and Chaos Engine for pod-autoscale

As we already have our nginx deployment created, we just need to create the ChaosEngine:

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: nginx-chaos
namespace: demo
spec:
# It can be true/false
annotationCheck: 'false'
# It can be active/stop
engineState: 'active'
#ex. values: ns1:name=percona,ns2:run=nginx
auxiliaryAppInfo: ''
appinfo:
appns: 'demo'
applabel: 'app=nginx'
appkind: 'deployment'
chaosServiceAccount: chaos-sa
monitoring: false
# It can be delete/retain
jobCleandUpPolicy: 'delete'
experiments:
- name: pod-autoscaler
spec:
components:
env:
# set chaos duration (in sec) as desired
- name: TOTAL_CHAOS_DURATION
value: '60'
# number of replicas to scale
- name: REPLICA_COUNT
value: '10'

Apply this yaml file to your cluster, and you should see it report back with chaosengine.litmuschaos.io/nginx-chaos configured.

We have made the replica count to 10, so the pods should automatically scale to 10 replicas. This is a very interesting experiment as it can be used to check node autoscaling behaviour. It also shows the people behind Litmus are very responsive to the community, as this experiment came about because of a suggestion by me!

Once we apply the file, we should see the pod replicas going to 10:

$ kubectl get pods -n demo
NAME READY STATUS RESTARTS AGE
nginx-f89759699-j2swb 1/1 Running 0 16m
nginx-f89759699-klwn6 1/1 Running 0 16m
nginx-autoscale-runner 1/1 Running 0 17s
pod-autoscaler-fa841p-mtqzn 1/1 Running 0 10s
nginx-f89759699-cz9n5 0/1 ContainerCreating 0 4s
nginx-f89759699-lp25g 1/1 Running 0 4s
nginx-f89759699-brtxn 1/1 Running 0 4s
nginx-f89759699-wwzjd 1/1 Running 0 4s
nginx-f89759699-8jqp9 1/1 Running 0 4s
nginx-f89759699-tp7wp 1/1 Running 0 4s
nginx-f89759699-wcqbc 1/1 Running 0 4s
nginx-f89759699-f2pph 1/1 Running 0 4s

As with the pod deletion experiment, we can use kubectl describe to get more detail about the results. The command will be kubectl describe chaosresult nginx-chaos-pod-autoscaler -n demo.

Wrapping up

Litmus is a really good tool with great community backing and a growing number of experiments. In very little time you can deploy it to the cluster and start creating chaos to make your Kubernetes applications ready for any kind of failure.

All experiments are listed here at https://hub.litmuschaos.io/, where you can raise issues for new experiments and contribute them as well.

Let us know on Twitter @Civocloud and @SaiyamPathak if you try Litmus on Civo!

Saiyam Pathak
Saiyam Pathak

Head of Developer Relations @ vCluster

Saiyam Pathak is Head of Developer Relations at vCluster and a prominent advocate in the cloud-native and Kubernetes community. He is also the founder of Kubesimplify, a platform dedicated to simplifying Kubernetes and cloud-native technologies through educational content.

Saiyam has previously worked at organizations including Civo, Walmart Labs, Oracle, and HP, gaining experience across machine learning platforms, multi-cloud infrastructure, and managed Kubernetes services. He actively contributes to the community through technical content, meetups, and open-source initiatives.

View author profile