Litmus is an open-source tool to do chaos experiments on Kubernetes clusters, introducing unexpected failures to a system to test its resiliency. With Litmus, you can actually create chaos experiments, find bugs fast, and fix them before they ever reach the production phase. It turns out to be a great way to make a Kubernetes cluster more resilient. In this tutorial I will walk you through the Litmus installation process on a Kubernetes cluster, and create/run the below experiments on it:

  • Pod Deletion
  • Pod Autoscaler

Before we get into set up and what Litmus can do, be sure to check out our video guide...

Prerequisites

  • A Kubernetes cluster you control. We'll take advantage of Civo's super-fast managed k3s service to experiment with this quickly. If you don't yet have an account, sign up to the beta now to take advantage of quick deploy times and $70 free credit per month! Alternatively, you could also use any other Kubernetes cluster.
  • kubectl installed, and the kubeconfig file for your cluster downloaded.

Civo Kubernetes configuration

Getting up and running with Litmus

Once you have the Kubernetes cluster ready, install the LitmusChaos Operator:

$ kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.9.0.yaml
namespace/litmus created
serviceaccount/litmus created
clusterrole.rbac.authorization.k8s.io/litmus created
clusterrolebinding.rbac.authorization.k8s.io/litmus created
deployment.apps/chaos-operator-ce created
customresourcedefinition.apiextensions.k8s.io/chaosengines.litmuschaos.io created
customresourcedefinition.apiextensions.k8s.io/chaosexperiments.litmuschaos.io created
customresourcedefinition.apiextensions.k8s.io/chaosresults.litmuschaos.io created

This installs all the required Custom Resource Definitions and Operator. You should be able to see Litmus running in its own namespace:

$ kubectl get pods -n litmus
NAME                                 READY   STATUS    RESTARTS   AGE
chaos-operator-ce-56449c7d75-lt8jc   1/1     Running   0          90s

$ kubectl get crds | grep chaos
chaosengines.litmuschaos.io       2020-11-06T14:23:59Z
chaosexperiments.litmuschaos.io   2020-11-06T14:24:00Z
chaosresults.litmuschaos.io       2020-11-06T14:24:00Z

$ kubectl api-resources | grep chaos
chaosengines                                   litmuschaos.io                 true         ChaosEngine
chaosexperiments                               litmuschaos.io                 true         ChaosExperiment
chaosresults                                   litmuschaos.io                 true         ChaosResult

Below are the 3 CRDs (Definitions taken from the official repository):

ChaosEngine: A resource to link a Kubernetes application or Kubernetes node to a ChaosExperiment. ChaosEngine is watched by Litmus' Chaos-Operator which then invokes Chaos-Experiments

ChaosExperiment: A resource to group the configuration parameters of a chaos experiment. ChaosExperiment CRs are created by the operator when experiments are invoked by ChaosEngine.

ChaosResult: A resource to hold the results of a chaos-experiment. The Chaos-exporter reads the results and exports the metrics into a configured Prometheus server.

Now It's the time to create some chaos experiments!

Step 1 - Prepare your cluster

Create a new namespace, demo and Service Account (sa.yaml) that can be used by the chaos engine with below contents:

---
apiVersion: v1
kind: Namespace
metadata:
  name: demo
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: chaos-sa
  namespace: demo
  labels:
    name: pod-delete-sa
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: chaos-sa
  namespace: demo
  labels:
    name: chaos-sa
rules:
- apiGroups: ["","litmuschaos.io","batch","apps"]
  resources: ["pods","deployments","pods/log","events","jobs","chaosengines","chaosexperiments","chaosresults"]
  verbs: ["create","list","get","patch","update","delete","deletecollection"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: chaos-sa
  namespace: demo
  labels:
    name: pod-delete-sa
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: chaos-sa
subjects:
- kind: ServiceAccount
  name: chaos-sa
  namespace: demo

Apply this to your cluster:

kubectl apply -f sa.yaml
namespace/demo created
serviceaccount/chaos-sa created
role.rbac.authorization.k8s.io/chaos-sa created
rolebinding.rbac.authorization.k8s.io/chaos-sa created

You can see above has created a role, a rolebinding to modify the Litmus CRDs and Kubernetes deployments tied to the namespace demo.

Step 2 - Install experiments

Chaos experiments contain the actual details for chaos events to be triggered. There are experiments already listed on chaos hub that can be readily installed onto the cluster. For now we will install generic experiments and use pod-delete and pod-autoscaler which we will need to get onto our cluster:

kubectl apply -f https://hub.litmuschaos.io/api/chaos/1.9.0?file=charts/generic/experiments.yaml -n demo

Both the experiments mentioned above, and some others too, get created:

kubectl get chaosexperiments -n demo
NAME             AGE
pod-delete                10m
pod-network-duplication   19s
node-drain                19s
node-io-stress            18s
disk-fill                 17s
node-taint                17s
pod-autoscaler            16s
pod-cpu-hog               16s
pod-memory-hog            15s
pod-network-corruption    14s
pod-network-loss          13s
disk-loss                 13s
pod-io-stress             12s
k8-service-kill           11s
pod-network-latency       11s
node-cpu-hog              10s
docker-service-kill       10s
kubelet-service-kill      9s
node-memory-hog           8s
k8-pod-delete             8s
container-kill            7s

Step 3 - Create Deployment and Chaos Engine for pod-delete

Let's start a simple 2-replica ngnix deployment in our demo namespace that we can run our experiments on.

$ kubectl create deployment nginx --image=nginx --replicas=2 --namespace=demo
deployment.apps/nginx created

Then, let's create a pod_delete.yaml that we can apply as a ChaosEngine to our cluster:

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: nginx-chaos
  namespace: demo
spec:
  appinfo:
    appns: 'demo'
    applabel: 'app=nginx'
    appkind: 'deployment'
  annotationCheck: 'false'
  engineState: 'active'
  auxiliaryAppInfo: ''
  chaosServiceAccount: chaos-sa
  monitoring: false
  jobCleanUpPolicy: 'delete'
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: '30'

            - name: CHAOS_INTERVAL
              value: '10'

            - name: FORCE
              value: 'false'

Apply this to our cluster:

$ kubectl apply -f pod_delete.yaml
chaosengine.litmuschaos.io/nginx-chaos created

In above if the annotationCheck is true, then you need to annotate your deployment with kubectl annotate deploy/nginx litmuschaos.io/chaos="true" -n demo to make it work.

After the ChaosEngine is created, it will create 2 new pods which will in turn start to terminate pods from our nginx deployment, which is the motive of this experiment.

$ kubectl get pods -n demo
NAME                      READY   STATUS        RESTARTS   AGE
nginx-f89759699-kxrbc     1/1     Running       0          83s
nginx-chaos-runner        1/1     Running       0          25s
pod-delete-up8kop-zmgjx   1/1     Running       0          11s
nginx-f89759699-p7cwq     0/1     Terminating   0          83s
nginx-f89759699-j2swb     1/1     Running       0          5s

to check the status/result of the experiment, describe the chaosresult: kubectl describe chaosresult nginx-chaos-pod-delete -n demo

You should see something like this:

Pod-delete demo report

Step 4 - Create Deployment and Chaos Engine for pod-autoscale

As we already have our nginx deployment created, we just need to create the ChaosEngine:

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: nginx-chaos
  namespace: demo
spec:
  # It can be true/false
  annotationCheck: 'false'
  # It can be active/stop
  engineState: 'active'
  #ex. values: ns1:name=percona,ns2:run=nginx
  auxiliaryAppInfo: ''
  appinfo:
    appns: 'demo'
    applabel: 'app=nginx'
    appkind: 'deployment'
  chaosServiceAccount: chaos-sa
  monitoring: false
  # It can be delete/retain
  jobCleandUpPolicy: 'delete'
  experiments:
    - name: pod-autoscaler
      spec:
        components:
          env:
            # set chaos duration (in sec) as desired
            - name: TOTAL_CHAOS_DURATION
              value: '60'
            # number of replicas to scale
            - name: REPLICA_COUNT
              value: '10'

Apply this yaml file to your cluster, and you should see it report back with chaosengine.litmuschaos.io/nginx-chaos configured.

We have made the replica count to 10, so the pods should automatically scale to 10 replicas. This is a very interesting experiment as it can be used to check node autoscaling behaviour. It also shows the people behind Litmus are very responsive to the community, as this experiment came about because of a suggestion by me!

Once we apply the file, we should see the pod replicas going to 10:

$ kubectl get pods -n demo
NAME                          READY   STATUS              RESTARTS   AGE
nginx-f89759699-j2swb         1/1     Running             0          16m
nginx-f89759699-klwn6         1/1     Running             0          16m
nginx-autoscale-runner        1/1     Running             0          17s
pod-autoscaler-fa841p-mtqzn   1/1     Running             0          10s
nginx-f89759699-cz9n5         0/1     ContainerCreating   0          4s
nginx-f89759699-lp25g         1/1     Running             0          4s
nginx-f89759699-brtxn         1/1     Running             0          4s
nginx-f89759699-wwzjd         1/1     Running             0          4s
nginx-f89759699-8jqp9         1/1     Running             0          4s
nginx-f89759699-tp7wp         1/1     Running             0          4s
nginx-f89759699-wcqbc         1/1     Running             0          4s
nginx-f89759699-f2pph         1/1     Running             0          4s

As with the pod deletion experiment, we can use kubectl describe to get more detail about the results. The command will be kubectl describe chaosresult nginx-chaos-pod-autoscaler -n demo.

Wrapping up

Litmus is a really good tool with great community backing and a growing number of experiments. In very little time you can deploy it to the cluster and start creating chaos to make your Kubernetes applications ready for any kind of failure.

All experiments are listed here at https://hub.litmuschaos.io/, where you can raise issues for new experiments and contribute them as well.

Let us know on Twitter @Civocloud and @SaiyamPathak if you try Litmus on Civo!