The ability to not only monitor but also swiftly respond to potential issues is crucial for maintaining a robust and reliable system. Building on the information laid in the first part of this series, "Application Performance Monitoring with Prometheus and Grafana on Kubernetes" part 2 of this series will delve deeper into the realms of alert management using Prometheus Alertmanager and log analysis with Grafana Loki.

Through this tutorial, I will guide you through the intricate process of enhancing your Kubernetes monitoring strategy, ensuring that you're not just collecting data, but also leveraging it to maintain the health and efficiency of your environment.

Setting up Alerts with Prometheus Alertmanager

Monitoring your Kubernetes nodes is essential for ensuring the health and performance of your cluster and the applications running on it. Monitoring can help you:

  • Detect and troubleshoot issues before they affect your users or customers.
  • Optimize the resource utilization and scalability of your cluster and applications.
  • Gain insights into the behavior and trends of your cluster and applications over time.

Monitoring your Kubernetes nodes is not enough. You also need to set up alerts for potential issues and incidents that may affect your cluster performance and availability.

Prometheus allows you to create alert rules based on metrics and send them to Alertmanager for processing. Alertmanager is a component that handles alerts from Prometheus and routes them to different notification channels such as email, Slack, and PagerDuty. It also supports features such as grouping, deduplication, silencing, and inhibition of alerts.

To set up alerts with Prometheus Alertmanager, we need to do three things:

  • Install Alertmanager on Kubernetes
  • Create alert rules based on metrics
  • Configure notification channels

Installing Alertmanager on Kubernetes

Before configuring Alertmanager, you need to ensure that Prometheus and other monitoring components are installed in your Kubernetes cluster. We’ll use Helm to install the kube-prometheus-stack chart.

Step 1: Add Helm Repository

If you haven’t already, you’ll need to add the Helm repository for the kube-prometheus-stack chart:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

Step 2: Create a Namespace

Create a dedicated namespace for your monitoring components. In this example, we’ll use the monitoring namespace:

kubectl create namespace monitoring 

Step 3:Install kube-prometheus-stack

Install the kube-prometheus-stack chart into the monitoring namespace:

helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring 

This command will deploy Prometheus, Alertmanager, Grafana, and other monitoring components to your cluster.

Now that Alertmanager is installed on our cluster by the kube-prometheus-stack chart. We just need to check its status and configuration.

Checking Alertmanager status

To check the status of Alertmanager, follow these steps:

Step 1: Run the following command:

kubectl get sts -n monitoring -l app.kubernetes.io/name=alertmanager

You should see something like this:

Your Alt Text

This shows that Alertmanager is running as a StatefulSet (sts) with one replica.

Step 2: Check the configuration of Alertmanager by running the following command:

kubectl get secret -n monitoring prometheus-kube-prometheus-alertmanager -o jsonpath="{.data['alertmanager\.yaml']}" | base64 --decode

You should see something like this:

Your Alt Text

This shows that Alertmanager has a default configuration that does not send any notifications. It only has one receiver called null that does nothing.

Step 3: Modify this configuration to add our own receivers and routing rules. To do that, we need to create a ConfigMap with our custom configuration and apply it to our cluster.

For example, create a file called alertmanager-config.yaml with the following content:

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-kube-prometheus-alertmanager
  namespace: monitoring
data:
  alertmanager.yaml: |-
    global:
      resolve_timeout: 5m
    route:
      group_by: ['job']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 12h
      receiver: 'email'
      routes:
      - match:
          alertname: Watchdog
        receiver: 'null'
    receivers:
    - name: 'null'
    - name: 'email'
      email_configs:
      - to: 'your_email@example.com'
        from: 'alertmanager@example.com'
        smarthost: 'smtp.gmail.com:587'
        auth_username: 'your_username@gmail.com'
        auth_identity: 'your_identity@gmail.com'
        auth_password: 'your_password'

This ConfigMap replaces the default configuration with a custom one that has two receivers:

  • A null receiver that does nothing (used for the Watchdog alert)
  • An email receiver that sends notifications to an email address (used for all other alerts)

Step 4: The email receiver uses Gmail as an example. You need to replace the values with your own credentials and settings. You may also need to enable less secure apps or app passwords for Gmail to work.

You can also add other receivers, such as Slack, PagerDuty, and Webhook, by following the Alertmanager documentation.

Step 5: Apply this ConfigMap to your cluster by running the following command:

kubectl apply -f alertmanager-config.yaml

This will update the existing ConfigMap with your custom configuration.

Step 6: Reload Alertmanager with the new configuration by running the following command:

kubectl delete pod -n monitoring -l app.kubernetes.io/name=alertmanager

This will delete the existing pod and create a new one with the updated configuration.

Step 7: Check the logs of Alertmanager by running:

kubectl logs -n monitoring -l app.kubernetes.io/name=alertmanager -f

You should see something like this:

Your Alt Text

This shows that Alertmanager is running and loaded your custom configuration.

You can also access Alertmanager’s web UI with the following steps:

By default, Alertmanager is exposed through a Kubernetes Service of type ClusterIP. This makes it only accessible internally in the cluster.

To access it externally, you have a couple options:

  • Port forward to your local machine:
  • kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-alertmanager 9093:9093
    Then access http://localhost:9093 in your browser
    Create an Ingress resource to expose it via load balancer/nodeport
    This requires an Ingress controller be installed in the cluster.
  • Alternatively, you can access it internally in the cluster by getting the ClusterIP service address:
  • kubectl get svc -n monitoring prometheus-kube-prometheus-alertmanager
    Then open http://:9093
    So, in summary, use port-forwarding or an Ingress to access Alertmanager UI externally. Or get the internal ClusterIP to access it within the cluster.

Having followed these steps, you will see there are no active alerts at the moment. You can also check the status of Alertmanager and its configuration by clicking on “Status” and then “Configuration”.

This will show the custom configuration that you applied to Alertmanager.

Creating alert rules based on metrics

Now that we have installed and configured Alertmanager, we need to create some alert rules based on metrics.

Alert rules are defined in Prometheus using a YAML format. They specify the conditions that trigger an alert and the labels and annotations that are attached to it.

For example, here is an alert rule that fires when the CPU usage of any node exceeds 80% for more than 5 minutes:

groups:
- name: node-alerts
  rules:
  - alert: HighCPUUsage
    expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: High CPU usage on {{ $labels.instance }}
      description: The CPU usage of {{ $labels.instance }} is above 80% for more than 5 minutes.

This alert rule belongs to a group called node-alerts:

  • The expr field defines the PromQL expression that evaluates the condition.
  • The for field defines the duration that the condition must hold true before firing the alert.
  • The labels field defines the key-value pairs that are attached to the alert.
  • The annotations field defines the key-value pairs that provide additional information about the alert.

You can create more alert rules based on different metrics and conditions. You can also use templates to generate labels and annotations based on the alert context dynamically.

For more information on how to write alert rules, you can check out the Prometheus documentation.

To apply alert rules to Prometheus, we need to create a ConfigMap with our alert rules and apply them to our cluster.

For example, create a file called prometheus-rules.yaml with the following content:

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-kube-prometheus-rulefiles-0
  namespace: monitoring
  labels:
    app.kubernetes.io/name: prometheus-rulefiles
    app.kubernetes.io/instance: prometheus
data:
  node-alerts.yaml: |-
    groups:
    - name: node-alerts
      rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: High CPU usage on {{ $labels.instance }}
          description: The CPU usage of {{ $labels.instance }} is above 80% for more than 5 minutes.

This ConfigMap replaces the default ConfigMap with our custom one that has one alert rule file called node-alerts.yaml. You can add more files with different alert rules as you wish. To apply this ConfigMap to your cluster, run the following command:

kubectl apply -f prometheus-rules.yaml

This will update the existing ConfigMap with your custom alert rules.

To reload Prometheus with the new alert rules, run the following command:

kubectl delete pod -n monitoring -l app.kubernetes.io/name=prometheus

This will delete the existing pod and create a new one with the updated alert rules.

You can check the logs of Prometheus by running:

kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus -f

You should see something similar to the image below, which shows that Prometheus is running and loaded your custom alert rules:

Your Alt Text

You can also access Prometheus's web UI with the same steps we provided earlier.

However, if you want to test your alert rule, you can artificially increase the CPU usage of one of your nodes by creating a Kubernetes Job that runs a stress command on a Pod.

For example, you can create a YAML file named stress-job.yaml with the following content:

apiVersion: batch/v1 kind: Job metadata: name: stress-job spec: template: spec: containers: - name: stress image: progrium/stress args: - --cpu - “4” - --timeout - “300” restartPolicy: Never

This will create a Job that runs a Pod with the progrium/stress image, which is a Docker image that contains the stress command.

The Pod will run the stress command with 4 CPU-intensive processes for 300 seconds, and then terminate.

You can then go back to Prometheus’s web UI and check the status of the alerts again.

Configuring notification channels

Now that we have created and tested our alert rule, we need to configure our notification channels to receive notifications when an alert fires.

As we mentioned before, Alertmanager handles alerts from Prometheus and routes them to different notification channels such as email, Slack, PagerDuty, etc.

We have already configured an email receiver in Alertmanager’s configuration file. However, we need to test it and make sure it works as expected.

To test our email receiver, we can use the Alertmanager’s web UI to send a test notification.

To access Alertmanager’s web UI, we need to use port-forwarding to expose the service to our local machine.

To use port-forwarding, we need to run the following command in a terminal:

kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-alertmanager 9093:9093 

This will forward the local port 9093 to the port 9093 on the Alertmanager service, which is running in the monitoring namespace.

After running the command, we can open a browser and go to http://localhost:9093. This will show us the Alertmanager’s web UI.

To send a test notification, click on ‘Status’ and then ‘Send a test notification’.

You will see that you can select a receiver from the list and enter some labels and annotations for the test notification.

For example, you can select the email receiver and enter some labels and annotations as follows:

labels:
  severity: warning
  instance: node1.civo.com
annotations:
  summary: Test notification
  description: This is a test notification from Alertmanager

Then click on ‘Send notification’.

You can check your email inbox and see if you received it.

You can also click on ‘View’ in Alertmanager to go back to Alertmanager’s web UI and see more details about the notification.

If you did not receive the email notification, you may need to check your spam folder or email settings. You may also need to check the logs of Alertmanager for any errors or warnings by running:

kubectl logs -n monitoring -l app.kubernetes.io/name=alertmanager -f

You should see something like the image below, which shows that Alertmanager has sent a test notification to your email address without any errors:

Your Alt Text

You can also test other notification channels, such as Slack, PagerDuty, Webhook, etc., by following the same steps and adding the corresponding receivers and configurations to Alertmanager's configuration file.

For more information on how to configure notification channels, you can check out the Alertmanager documentation.

Querying Logs with Grafana Loki

Monitoring your Kubernetes nodes with metrics and alerts is useful, but sometimes you need more information to troubleshoot issues and incidents. Logs are a valuable source of information that provides details about the events and activities of your applications and services.

However, logs can be hard to manage and analyze in a distributed and dynamic environment such as Kubernetes. That's where Grafana Loki comes in handy.

Loki is a log aggregation system that integrates with Grafana and allows you to query logs with a Prometheus-like syntax.

Loki is designed to be lightweight and cost-effective, by indexing only metadata and labels of logs and storing the log contents in compressed chunks. It also supports features such as live tailing, log streaming, and alerting on logs.

To query logs with Grafana Loki, we need to do three things:

  • Install Grafana Loki on Kubernetes
  • Create Loki data sources in Grafana
  • Use the Grafana Explore interface to query logs with LogQL

Installing Grafana Loki on Kubernetes

To install Grafana Loki on Civo, we will use the loki-stack Helm chart from the Grafana repository. This chart bundles together Loki, Promtail, Fluent Bit, and Grafana:

  • Loki is the main component that stores and indexes logs.
  • Promtail is an agent that collects logs from files and pods and sends them to Loki.
  • Fluent Bit is an alternative agent that collects logs from various sources and formats and sends them to Loki.
  • Grafana is a visualization tool that allows you to query and display logs from Loki.

To add the Grafana repository to Helm, run the following command:

helm repo add grafana https://grafana.github.io/helm-charts

To install the Loki-stack chart on your cluster, run the following command:

helm install loki grafana/loki-stack --namespace logging --create-namespace --set grafana.enabled=false,prometheus.enabled=false,prometheus.alertmanager.persistentVolume.enabled=false,prometheus.server.persistentVolume.enabled=false

This will create a new namespace called logging and deploy all the resources in it.

We have disabled Grafana and Prometheus in this chart because we already have them installed by the kube-prometheus-stack chart in the monitoring namespace.

We have also disabled persistent volumes for Prometheus Alertmanager and Server because we don’t need them for this guide.

You can check the status of the installation by running:

helm status loki -n logging

You can also check the pods that are running by running:

kubectl get pods -n logging -l release=loki

You should see something like the image below which shows that Loki, Promtail, and Fluent Bit are running as DaemonSets on each node of our cluster:

Your Alt Text

Creating Loki data sources in Grafana

Now that we have installed Grafana Loki on our cluster, we need to create data sources in Grafana to connect to Loki. A data source is a configuration that tells Grafana how to access a specific service or database that provides data for dashboards and charts.

To create Loki data sources in Grafana, follow these steps:

Step 1: Create a data source for Loki in Grafana by going to http:///datasources and clicking on 'Add data source'.

You will see there is already a data source for Prometheus that was created by the kube-prometheus-stack chart.

Step 2: Create a data source for Loki by clicking on Loki from the list or searching for it in the filter box.

Step 3: Enter the URL of Loki’s API endpoint in the HTTP section.

The URL of Loki’s API endpoint is http://loki.loki-stack.logging.svc.cluster.local (or port-forwarding port 3100).

You can also change other settings, such as authentication, labels, derived fields, etc., according to your needs.

For more information on how to configure Loki data sources, you can check out the Grafana documentation.

Step 4: Save your data source by clicking on ‘Save & Test’ at the bottom of the page.

Step 5: Check the logs of Loki by running:

kubectl logs -n logging -l app.kubernetes.io/name=loki -

You should see something like the image below which shows that Loki is running and ready to receive logs:

Loki is running

Using the Grafana Explore interface to query logs with LogQL

Now that we have created a data source for Loki in Grafana, we can use the Grafana Explore interface to query logs with LogQL.

LogQL is a query language for Loki that allows you to filter and aggregate logs based on labels and values. It is similar to PromQL but has some differences and limitations.

To use the Grafana Explore interface, go to http://91.211.152.130/explore (or your external IP address) and select Loki from the data source drop-down menu.

This will show a query editor where you can enter your LogQL queries and a panel where you can see the results.

You can also switch between Logs and Table modes to see the logs in different formats, switch between Metrics and Logs modes to see the logs as time series or as raw logs, and adjust the time range and refresh interval of the queries.

By default, the query editor shows all the logs from all the labels in Loki. You can filter the logs by using label selectors and operators.

For example, if you want to see only the logs from Promtail pods, you can enter this query:

{app_kubernetes_io_name="promtail"}

This shows only the logs from Promtail pods with their labels and values.

You can also filter the logs by using regular expressions and keywords.

For example, if you want to see only the logs from Promtail pods that contain the word “error”, you can enter this query:

{app_kubernetes_io_name="promtail"} |= "error"

This shows only the logs from Promtail pods that contain the word “error” with their labels and values.

You can also aggregate the logs by using functions and operators.

For example, if you want to see the count of logs from Promtail pods by level label, you can enter this query:

sum by (level) (count_over_time({app_kubernetes_io_name="promtail"}[5m]))

This will show the count of logs from Promtail pods by level label as a time series chart.

You can also use other functions and operators to perform different aggregations and transformations on your logs.

For example, you can use rate(),avg(), quantile(), histogram_quantile(), etc., to calculate various statistics on your logs. You can also use logfmt(), json(), unwrap(), etc., to parse and extract values from your log lines.

For more information on how to use functions and operators in LogQL, you can check out the Loki documentation.

Summary

Throughout this tutorial, we've looked at the pull-based model of Prometheus, understanding how it efficiently retrieves metrics from targeted services at specified intervals and stores them in its time-series database. This approach not only centralizes our monitoring but also offloads the metric collection responsibility from our applications. Especially when our applications are inundated with metric data, we've seen how Prometheus can be adjusted or scaled to handle the load without impacting our application's performance.

Furthermore, we've explored the power and flexibility of the Prometheus client libraries. These libraries, available in a plethora of languages like Go, Python, and Rust, have enabled us to seamlessly integrate custom metric instrumentation directly into our application code. We've also touched upon the vast ecosystem of third-party libraries, emphasizing that if existing solutions don't meet our needs, there's always the option to craft our own.

By visualizing our data with Grafana and running PromQL queries, we've transformed raw metrics into actionable insights, ensuring that our applications not only run but thrive in their respective environments.

Further Resources

If you want to learn more about this topic, take a look at some of these additional resources: