Advanced Analysis of Kubernetes Distributed Tracing

⚠️ To follow along with this tutorial, you must have read part 1, which sets up end-to-end distributed tracing using Grafana Tempo and OpenTelemetry in a Kubernetes environment.

Discussing end-to-end distributed tracing involves more than just tracing. It also encompasses other important components, such as metrics and logs. Therefore, a conversation about tracing is incomplete without addressing logs and metrics.

While traces provide information about the request flow and performance of individual services in your application, logs and metrics offer additional layers of observability. Logs give detailed, text-based records of events within your application, and metrics provide quantitative data on the performance and health of your system. Together, they offer a more complete picture of your application’s state.

In the first part of this series, we successfully set up end-to-end distributed tracing using Grafana Tempo and OpenTelemetry in a Kubernetes environment. We used a pre-instrumented Django application to send traces to Grafana Tempo through an OpenTelemetry collector. This setup used a Civo Object Store, and the trace data was visualized in Grafana.

Now, in the second part of this series, we will learn how to analyze these traces. Through this tutorial, we will examine spans to understand request flows and latency, how to identify issues or bottlenecks using metadata, and how to integrate Grafana Loki and Prometheus as additional data sources in Grafana for a complete analysis of logs related to the traces and metrics for performance.

Analyzing Trace Data

At the end of the previous tutorial, we could view traces in Grafana which are outlined in the image below. This image shows that the trace GET /create took 5.98 milliseconds. Additionally, we have details indicating the successful completion of the request with a 200 HTTP status code, signaling that the operation was executed without errors.

Analyzing Trace Data part 1

In distributed tracing, a trace is a collection of spans, where each span represents a specific operation or segment of work done in the service. Spans within a trace can have parent-child relationships that show the flow and hierarchy of operations.

In this particular trace, we observe a breakdown of individual spans, including the note_create span within the django-notes-app service. This note_create span, which took 2.57 milliseconds, is a child span of the GET /create span.

As a child span, it represents a discrete operation, or a part of the processing that contributes to the overall response of the GET /create request. This hierarchical relationship between spans is crucial for understanding the flow of requests and identifying areas within a service that contribute to the total execution time.

Analyzing Trace Data part 2

For a more comprehensive analysis of the trace, you have the option to export the trace data. This can be done by clicking on the export icon highlighted in the image below 👇

Analyzing Trace Data part 3

The exported data provides detailed information about the trace, such as:

Services involved, such as django-notes-app
Span details, including trace ID, span ID, parent span ID, timestamps, and more
Specific attributes of each span, like HTTP methods, URLs, status codes, and server names

This level of detail is beneficial for in-depth analysis, allowing you to thoroughly examine each aspect of the trace, from the high-level view of the request to the granular details of individual operations.

The exported trace will be downloaded in a JSON format and once viewed it looks something like this:

{
  "batches": [
    // Batch for the trace 'GET /create'
    {
      "resource": {
        "attributes": [
          {
            "key": "service.name",
            "value": {
              "stringValue": "django-notes-app"
            }
          }
        ],
        "droppedAttributesCount": 0
      },
      "instrumentationLibrarySpans": [
        {
          "spans": [
            {
              "traceId": "a4fcabb761c0bcb79f49462d317cb769",
              "spanId": "d28cb2de926c9ee4",
              "parentSpanId": "0000000000000000", // Root span with no parent
              // ... additional span details ...
            }
          ],
          "instrumentationLibrary": {
            "name": "opentelemetry.instrumentation.wsgi", // Instrumentation library
            "version": "0.41b0"
          }
        }
      ]
    },
    // Batch for the trace 'note_create'
    {
      "resource": {
        "attributes": [
          {
            "key": "service.name",
            "value": {
              "stringValue": "django-notes-app"
            }
          }
        ],
        "droppedAttributesCount": 0
      },
      "instrumentationLibrarySpans": [
        {
          "spans": [
            {
              "traceId": "a4fcabb761c0bcb79f49462d317cb769",
              "spanId": "29a715d4dba3c442",
              "parentSpanId": "d28cb2de926c9ee4", // Parent span ID indicating this span is a child of the 'GET /create' span
              // ... additional span details ...
            }
          ],
          "instrumentationLibrary": {
            "name": "notes_app.views", // Instrumentation library for the view
            "version": ""
          }
        }
      ]
    }
  ]
}

Reconfiguring the Django Application

Until now, we can view requests as they flow through our application (traces), including timing data and interactions between different components or services in our Django application. We can now integrate logs and metrics into this setup to enhance our observability capabilities. This addition will enable us to:

Send logs to our OpenTelemetry collector so we can analyze log data alongside trace data.
Send metrics to our OpenTelemetry collector so we can monitor key performance indicators for a more comprehensive understanding of our application’s behavior.

Step 1: Cloning the Django Application

First, we need to configure our Django project to send logs and metrics to our OpenTelemetry collector in our Civo Kubernetes cluster.

Clone the following GitHub repository; the Django project has been configured to generate detailed logs using the OpenTelemetry Logging Instrumentation and a custom format that integrates trace and span IDs.

For metrics, it employs the OpenTelemetry Metrics API to track the number of requests it receives using a counter metric. This counter, named request_count, increments with each incoming request to the Django notes-app application, providing a straightforward yet effective way to monitor traffic load. The count data is then exported through an OpenTelemetry exporter to establish a robust framework for logging and performance monitoring of the Django application.

Step 2: Dockerizing and Deploying the Django Application to DockerHub

Once cloned, create a DockerHub repository, dockerize it, and deploy it to the new repository using the following commands:

docker build -t <your-dockerhub-username>/<repository-name>:latest .
docker push <your-dockerhub-username>/<repository-name>:latest

Step 3: Updating the Django Application Deployment

Now that we have dockerized the Django project and have pushed it to DockerHub let's update our deployment.

To begin, update the previous deployment’s image to point to the new Docker image using the following commands:

kubectl set image deployment/django-deployment django-app={your-dockerhub-username}/{name-of-your-image}

This will update the existing Kubernetes deployment with our new image. You should have the following output once the deployment is has been configured:

deployment.apps/django-deployment image updated

Confirm that the Django application is running using the following command:

kubectl get pods

Once it is running, you should have the following output:

NAME                                       READY   STATUS    RESTARTS   AGE
django-deployment-6c4c7d4bcf-lwx8v         1/1     Running   0          65s

Installing Grafana Loki

Having successfully configured our Django project to generate logs and metrics in addition to traces, our next step is to set up the infrastructure required for visualizing and analyzing this data.

We've already established a pipeline for forwarding traces from our OpenTelemetry collector to Grafana Tempo, which are then visualized in Grafana. Now, we'll extend this capability to include logs and metrics.

To achieve this, we'll first install Loki for log aggregation and Prometheus for metrics collection. These tools will serve as the foundational elements for our observability stack, allowing us to gain deeper insights into our application's performance and behavior.

Step 1: Configuring Loki Stack

When installing Loki Stack via Helm, it comes with a comprehensive stack that includes not only Loki but also Prometheus and Grafana. This stack provides an integrated solution for log aggregation, metrics collection, and data visualization.

However, for more granular control over these components, we will install them separately. Since we already have Grafana installed, we won't need to install it again.

Begin by creating a file named loki-values.yaml. This file will host our custom configurations for the Loki stack installation.

Use a text editor to create this file and insert the following settings:

loki:
  enabled: true

prometheus:
  enabled: false

grafana:
  enabled: false

These settings ensure that only Loki is enabled during the installation, while Prometheus and Grafana are not installed as part of this stack. This approach lets us maintain the existing Grafana setup and manage Prometheus separately.

Step 2: Installing Loki Stack

With the Loki Stack configured, we can now go ahead and install the Loki Stack using Helm with the custom settings created in the previous step.

Execute the following command to add the Loki Stack Helm chart repository:

helm install loki grafana/loki-stack -f loki-values.yaml

You should have the following output:

NAME: loki
LAST DEPLOYED: Thu Nov 30 05:49:24 2023
NAMESPACE: default
STATUS: deployed
REVISION: 1
NOTES:
The Loki stack has been deployed to your cluster. Loki can now be added as a data source in Grafana.

See http://docs.grafana.org/features/datasources/loki/ for more detail.

After running the Helm command, check your Kubernetes cluster to confirm that Loki is up and running:

kubectl get pods
kubectl get svc

You should have the following output:

# kubectl get pods
NAME                                       READY   STATUS      RESTARTS   AGE
...
loki-0                                     0/1     Running     0          20s
loki-promtail-tqghk                        1/1     Running     0          20s
loki-promtail-5nsfv                        1/1     Running     0          20s

# kubectl get svc
NAME                                      TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                                                                                                   AGE
loki-headless                             ClusterIP   None            <none>        3100/TCP                                                                                                  25s
loki-memberlist                           ClusterIP   None            <none>        7946/TCP                                                                                                  25s
loki                                      ClusterIP   10.43.241.73    <none>        3100/TCP                                                                                                  25s

Installing Prometheus

With Loki configured and installed in our cluster, next up we’ll go-ahead to configure and install Prometheus. To achieve this, we will be using the Prometheus kube-prometheus-stack Helm chart.

Step 1: Configuring Prometheus

Before installing Prometheus, we need to create a job configuration that will allow Prometheus to scrape metrics from specific targets.

Create a file named prometheus-values.yaml and paste in the following configuration:

global:
  scrape_interval: '5s'
  scrape_timeout: '10s'

prometheus:
  prometheusSpec:
    additionalScrapeConfigs: |
      - job_name: otel-collector
        static_configs:
          - targets:
            - opentelemetry-collector:8889

grafana:
  enabled: false

This configuration does the following:

Sets the global scrape interval to every 5 seconds and the scrape timeout to 10 seconds. This defines how frequently Prometheus will collect metrics and the maximum time allowed for a scrape request.
Adds a new scrape job named otel-collector. This job is configured to scrape metrics from the opentelemetry-collector service at port 8889. We will configure our OpenTelemetry Collector to expose this port later.
Set Grafana to false, indicating that we are not installing Grafana as part of this Prometheus setup, as it comes with the Prometheus kube-prometheus-stack.

Step 2: Installing Prometheus

After configuring the scrape settings in prometheus-values.yaml, the next step is to install Prometheus in our Kubernetes cluster.

Begin by adding the Prometheus chart repository to your Helm setup. This ensures you have access to the latest Prometheus charts:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

Now, install Prometheus with Helm using the custom configurations you've defined above:

helm install prometheus prometheus-community/kube-prometheus-stack -f prometheus-values.yaml

NAME: prometheus
LAST DEPLOYED: Thu Nov 30 06:42:52 2023
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
kube-prometheus-stack has been installed. Check its status by running:
  ...

After the installation process completes, you can verify if Prometheus is running correctly using the following commands:

kubectl get pods
kubectl get svc

You should see something similar to this:

# kubectl get pods
NAME                                                     READY   STATUS      RESTARTS   AGE
...
prometheus-prometheus-node-exporter-rblhc                0/1     Pending     0          2m10s
prometheus-prometheus-node-exporter-n7z8n                0/1     Pending     0          2m10s
prometheus-kube-prometheus-operator-7d89b9dd4d-h24fx     1/1     Running     0          2m10s
prometheus-kube-state-metrics-69bbfd8c89-xlnlk           1/1     Running     0          2m10s
alertmanager-prometheus-kube-prometheus-alertmanager-0   2/2     Running     0          2m7s
prometheus-prometheus-kube-prometheus-prometheus-0       2/2     Running     0          2m6s

#kubectl get svc
NAME                                      TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                                                                                                   AGE
...
prometheus-prometheus-node-exporter       ClusterIP   10.43.81.61     <none>        9100/TCP                                                                                                  6m43s
prometheus-kube-prometheus-operator       ClusterIP   10.43.252.136   <none>        443/TCP                                                                                                   6m43s
prometheus-kube-prometheus-prometheus     ClusterIP   10.43.64.194    <none>        9090/TCP,8080/TCP                                                                                         6m43s
prometheus-kube-state-metrics             ClusterIP   10.43.60.6      <none>        8080/TCP                                                                                                  6m43s
prometheus-kube-prometheus-alertmanager   ClusterIP   10.43.144.21    <none>        9093/TCP,8080/TCP                                                                                         6m43s
alertmanager-operated                     ClusterIP   None            <none>       9093/TCP,9094/TCP,9094/UDP                                                                                6m39s
prometheus-operated                       ClusterIP   None            <none>        9090/TCP                                                                                                  6m38s                                                                                                                     6m38s

Step 3: Creating a Service Monitor

The objective is to enable Prometheus to scrape metrics from our OpenTelemetry collector instance, allowing us to view these metrics in Grafana. To achieve this, we need to create a Service Monitor, a Kubernetes resource used by Prometheus to specify how to discover and scrape metrics from a set of services.

Create a file called service-monitor.yaml and paste in the following configuration settings:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: otel-collector
  labels:
    release: prometheus  
spec:
  selector:
    matchLabels:
      app: opentelemetry-collector  # Ensure this matches the labels of your OpenTelemetry Collector service
  endpoints:
  - port: metrics  # The name of the port exposed by your OpenTelemetry Collector service
    interval: 5s

This configuration sets up a service monitor called otel-collector. It has a label prometheus, which in this case is the name of our Prometheus Helm release.

The service monitor is set to look for the OpenTelemetry Collector, which we have named opentelemetry-collector. It checks the metrics port of this collector every 5 seconds. This port is where our application's metrics will be available, and we will set this up later.

Now run the following command to create the service monitor:

kubectl apply -f service-monitor.yaml
kubectl get servicemonitor

You should see the following outputs:

#kubectl apply -f service-monitor.yaml
servicemonitor.monitoring.coreos.com/otel-collector created

#kubectl get servicemonitor
NAME                                                 AGE
prometheus-prometheus-node-exporter                  12m
prometheus-kube-prometheus-operator                  12m
...
otel-collector                                       61s

Next, access the Prometheus UI on your local machine. This will allow us to confirm that it has picked up the otel-collector service monitor we just created. On your machine, run:

kubectl port-forward svc/prometheus-kube-prometheus-prometheus 9090:9090

Head over to your browser and visit the address - localhost:9090:

Creating a Service Monitor part 1

Click on the Status dropdown, and select Service discovery.

You should see the otel-collector listed as shown below:

Creating a Service Monitor part 2

Updating the OpenTelemetry Collector

Now that Loki and Prometheus are configured and installed, we need to update our OpenTelemetry Collector configuration to forward logs to Loki and metrics to Prometheus.

Navigate to your OpenTelemetry Collector configuration file and add the necessary exporters for Loki and Prometheus:

#collector.yaml
...

  exporters:
    debug: {}
    otlp:
      endpoint: grafana-tempo:4317
      tls:
        insecure: true
    loki:
      # Loki exporter configuration
      endpoint: http://loki:3100/loki/api/v1/push
    prometheus:
      # Prometheus exporter configuration
      endpoint: 0.0.0.0:8889

  service:
    pipelines:
      ...
      metrics:
        receivers: [otlp]
        processors: [batch]
        exporters: [debug, prometheus]

...

Now upgrade the OpenTelemetry collector chart using the following command:

helm upgrade opentelemetry-collector open-telemetry/opentelemetry-collector -f collector.yaml

Next, execute the command to edit the OpenTelemetry service.

This step is necessary to add port 8889 to the list of ports exposed by the OpenTelemetry collector service.

By doing this, Prometheus will be able to access and scrape metrics from the service.

kubectl edit service opentelemetry-collector

This will open up the service manifest in a Vim editor. Scroll down to the last option in the ports section of the service specification. Press i to enter insert mode and type in the following:

- name: metrics
    port: 8889
    protocol: TCP
    targetPort: 8889

Updating the OpenTelemetry Collector part 1

Once added, exit the insert mode by pressing the Esc key. Then, type :wq and press Enter to save the changes and exit the OpenTelemetry collector service manifest file.

You should have the following output:

service/opentelemetry-collector edited

Confirm the port 8889 is actually exposed using the following command - kubectl get service. You should see the port 8889 listed among the exposed ports like so:

opentelemetry-collector                   ClusterIP   10.43.73.178            6831/UDP,14250/TCP,14268/TCP,4317/TCP,4318/TCP,9411/TCP,8889/TCP                                          108m

Head back to your Prometheus server UI, navigate to the targets option from the Status dropdown; You should see that the otel-collector service monitor is active and up as a target:

Updating the OpenTelemetry Collector part 2

This confirms that Prometheus has been configured correctly to scrape metrics from our Opentelemetry collector.

Viewing Logs and Metrics with Grafana

Up until now, we have successfully set up an infrastructure that sends logs and metrics to Loki and Prometheus. At this point, we are ready to view these components through Grafana.

Step 1: Adding Loki as a Datasource

To begin viewing logs in Grafana, you first need to add Loki as a datasource.

Navigate to the settings icon on the left panel and select Home.

Adding Loki as a Datasource part 1

Click on Add your first data source, search and choose Loki from the list of available data sources.

In the Loki data source settings, enter the URL of your Loki service - http://loki:3100. This is usually something like http://<loki-service-name>:3100

Save and test the data source to ensure Grafana can connect to Loki.

Be sure to interact with your application so logs can be generated. If there are no logs available for Loki to pick up, the connection will not be successful.

Once connected, head over to Explore and select Loki as shown below 👇

Adding Loki as a Datasource part 2

Add the following label filters container and django-app and click on the Run query button:

Adding Loki as a Datasource part 3

You should see the following output:

Adding Loki as a Datasource part 4

This confirms that Loki is receiving logs, and based on how the Django application logging instrumentation is configured, you see the date and time the logs were generated and the TraceIds and SpanIds in every log related to the Views in the Django application.

By clicking on the logs, you get to see the label of the Django application which in this case is called django-app, the container django-app (just as it was specified in the deployment manifest for the Django application), the job representing tasks, the namespace representing the Kubernetes namespace in which the application is running.

Additionally, you will see the name of the node, indicating the specific server in the Kubernetes cluster where the pod is hosted, and the name of the pod, which is the smallest deployable unit in Kubernetes that contains the Django application.

Adding Loki as a Datasource part 5

From here, you can download the logs either in a .txt or .json format to have a complete view of what the logs comprise of:

Adding Loki as a Datasource part 6

Step 2: Adding Prometheus as a Datasource

Just as we did for Loki, we need to add Prometheus as a data source so we can view metrics generated by the Django application:

Follow the steps used in the previous step to add Prometheus as a data source. Use the following endpoint prometheus-kube-prometheus-prometheus:9090 in the Prometheus data settings.

Once you have successfully added Prometheus (Prometheus server) as the data source, head over to explore and select Prometheus.

Before we begin to view metrics, there are some things you should take note of:

The Django application was instrumented using a counter metric. A counter is a simple metric type in Prometheus that only increases and resets to zero on restart. In our case, we've used it to count the number of requests the Django application receives. This gives us a straightforward yet powerful insight into the application's traffic.
Each request to the application increments the counter by one, regardless of the request type (GET, POST, etc.) or the endpoint accessed. This approach provides a high-level overview of the application's usage and can help identify trends in traffic, peak usage times, and potential bottlenecks.
When viewing this metric in Prometheus or Grafana, you'll see a continuously increasing graph over time, representing the cumulative count of requests.

Select the label filters exported_job and django-notes-app, click on the metric dropdown, and select request_count_total as shown below:

Adding Prometheus as a Datasource part 1

Once you click on Run query you should see the following:

Adding Prometheus as a Datasource part 2

When you run the query, you'll see a graph showing how many requests have been made over time. You can also select individual requests for a detailed view. Each request on the list is color-coded, making it easy to match with its corresponding graph.

Select the first request from the graph section; the graph will focus on that specific request and stop at its total count, as shown below:

Adding Prometheus as a Datasource part 3

From the image above, the first request was selected, and the graph stopped at the total count of that request which is 4.

We have successfully generated metrics in our Django application, routed them to our OpenTelemetry collector, and configured Prometheus to scrape them. Additionally, we can now view these metrics in Grafana.

Troubleshooting

In any complex setup like this, you might encounter issues. Here are some common troubleshooting steps:

Incorrect configurations are a common source of problems. Double-check your collector.yaml, service manifests, and any Helm value files you've used.
Ensure Prometheus is correctly discovering and scraping targets. Access Prometheus UI and check under Status → Service discovery or Status → Targets.
Verify that the data sources in Grafana are correctly set up and can connect to Loki and Prometheus.
If Prometheus isn't scraping metrics as expected, verify the configuration of your service monitor. Ensure the labels and selectors correctly match your OpenTelemetry Collector service. You can also use kubectl describe servicemonitor otel-collector to view detailed information about the service monitor.

Summary

Through this guide, we've taken a deep dive into setting up a comprehensive observability stack for a Django application pre-instrumented with OpenTelemetry running in Kubernetes. By integrating Grafana Tempo for distributed tracing, Loki for logs aggregation, and Prometheus for metrics collection, we have created a robust environment that tracks and visualizes aspects of our application's performance and health.

By completing this tutorial, you're well on your way to mastering Kubernetes-based application monitoring and troubleshooting. Keep experimenting and learning to harness the full potential of these powerful tools.

Further Resources

If you want to learn more about this topic, here are some of my favorite resources:

The OpenTelemetry Docs
Prometheus Configuration Docs
Loki-Stack Helm Chart Repository
Henrik Rexed Navigate Europe 2023 talk on The Sound of Code: Instrument with OpenTelemetry

Advanced Analysis of Kubernetes Distributed Tracing

Analyzing Trace Data

Reconfiguring the Django Application

Step 1: Cloning the Django Application

Step 2: Dockerizing and Deploying the Django Application to DockerHub

Step 3: Updating the Django Application Deployment

Installing Grafana Loki

Step 1: Configuring Loki Stack

Step 2: Installing Loki Stack

Installing Prometheus

Step 1: Configuring Prometheus

Step 2: Installing Prometheus

Step 3: Creating a Service Monitor

Updating the OpenTelemetry Collector

Viewing Logs and Metrics with Grafana

Step 1: Adding Loki as a Datasource

Step 2: Adding Prometheus as a Datasource

Troubleshooting

Summary

Further Resources

Mercy Bassey

These may also be of interest

How to mitigate Kubernetes runtime security threats

Building a RAG system with Gemini for financial forecasting on Civo Kubernetes

Ensuring YAML best practices using KubeLinter

Kubernetes

Compute

Databases

CivoStack Enterprise

Civo FlexCore

CivoStack for Service Providers

Cloud GPU

Carbon neutral GPU

Kubeflow as a Service

Startups

Small & mid-market

SaaS companies

CI / Testing

Move to Kubernetes

Case studies & testimonials

Learn

Blog

White papers

Documentation

Civo news

Meetups

Marketplace

Use Civo for your demos

Advanced Analysis of Kubernetes Distributed Tracing

Analyzing Trace Data

Reconfiguring the Django Application

Step 1: Cloning the Django Application

Step 2: Dockerizing and Deploying the Django Application to DockerHub

Step 3: Updating the Django Application Deployment

Installing Grafana Loki

Step 1: Configuring Loki Stack

Step 2: Installing Loki Stack

Installing Prometheus

Step 1: Configuring Prometheus

Step 2: Installing Prometheus

Step 3: Creating a Service Monitor

Updating the OpenTelemetry Collector

Viewing Logs and Metrics with Grafana

Step 1: Adding Loki as a Datasource

Step 2: Adding Prometheus as a Datasource

Troubleshooting

Summary

Further Resources

Mercy Bassey

These may also be of interest

How to mitigate Kubernetes runtime security threats

Building a RAG system with Gemini for financial forecasting on Civo Kubernetes

Ensuring YAML best practices using KubeLinter