How to deploy ClearML on Kubernetes for ML experiment tracking

Deploy ClearML on Civo Kubernetes to implement production-grade MLOps experiment tracking that captures and compares all factors affecting machine learning model behavior.

10 minutes reading time

Written by

Mercy Bassey
Mercy Bassey

Technical writer

A machine learning model is only useful when it's in production, serving actual users. Getting it to that level is a pipeline, from development to deployment. Each step includes checks, which is where Machine Learning Operations (MLOps) come in.

MLOps is a culture, and experiment tracking is one of its most foundational stages. In regular software development, behaviors are deterministic. If you deploy the same code twice, you get the same behavior. Machine Learning doesn't work that way. Results depend on data, library versions, hyperparameters, processing power, and a dozen other things. Without proper tracking, it's difficult to figure out why a model's outcome changes. As well as why it fails or breaks.

This tutorial focuses on experiment tracking with ClearML, an open-source MLOps platform designed for managing the lifecycle of machine learning applications. By the end, you’ll know what ClearML is and how you can deploy and use it for experiment tracking in Kubernetes.

Understanding ClearML’s architecture

ClearML, by design, includes modules for different machine learning tasks. The experiment tracking manager module is what facilitates the end-to-end control and visibility of machine learning experiments. This includes automated capturing of experimentation logs, metrics, and outputs. Under the experiment tracking manager module, the following runtime components work together:

  • ClearML server: This component controls the storage and management of experiments and the tracking workflow in total. From the ClearML server, you can access the ClearML dashboard. It supports databases like MongoDB to store experiment metadata, Elastic Search for searching and indexing, and Redis for caching. It comes in two variants: one that is fully managed by ClearML and the open-source variant, which you can deploy with Docker Compose or Helm.
  • ClearML SDK: This is what connects your code to your ClearML server. It’s a Python package, and is used to capture your machine learning code, outputs, hyperparameters, metrics, or anything else for experiment tracking.
  • ClearML agent: This is the component that acts as a job scheduler, executing experiment tasks from queues. It orchestrates the experiment tracking workflow and can run on cloud instances and GPUs.

Prerequisites

To follow along, you should have the following:

Deploy ClearML server

The ClearML server includes three components that work together and communicate internally within Kubernetes. These components are required to be exposed as separate subdomains over HTTP or HTTPS for external access via an ingress controller. They are the API, web, and file servers, and they serve the following purposes:

  • API server: To expose endpoints for writing and querying experiment data.
  • Web server: To provide a single-page UI to manage experiments, consuming the API server’s endpoint to render experiment data visually.
  • File server: To handle the storage of experiment data. From model files, images, to every asset your model uses and produces.

Install a Traefik ingress controller

Before you deploy the ClearML server, you must install an ingress controller in your Kubernetes cluster and configure it to expose your ClearML server externally. You’ll use the Traefik ingress controller in this tutorial and install it using Helm.

Add the Traefik Helm repository:

$ helm repo add traefik https://traefik.github.io/charts
"traefik" has been added to your repositories

Update the repository:

$ helm repo update
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "traefik" chart repository
Update Complete. ⎈Happy Helming!⎈

Install Traefik: 

$ helm install traefik traefik/traefik
NAME: traefik
LAST DEPLOYED: Sat May 30 17:53:27 2026
NAMESPACE: default
STATUS: deployed
REVISION: 1
DESCRIPTION: Install complete
TEST SUITE: None
NOTES:
traefik with docker.io/traefik:v3.7.1 has been deployed successfully on default namespace!

Confirm that your Traefik pod is running:

$ kubectl get pods
NAME READY STATUS RESTARTS AGE
install-traefik2-nodeport-hi-pxkkg 0/1 Completed 0 33m
traefik-5f694f77f5-sq2j5 1/1 Running 0 2m28s

Check that it has an External IP:

$ kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
traefik LoadBalancer 10.43.165.247 74.220.23.118 80:31702/TCP,443:32464/TCP 5m22s

⚠️ Take note of the value under EXTERNAL-IP; for example, 74.220.23.118 (yours will be different). This IP address will be used to set hostnames for your ClearML server installation later.

ClearML has a demo server you can use temporarily. Using the demo server is not suitable for production workloads, and you will deploy your own ClearML server instead.

Add the Helm chart repository for ClearML in your Kubernetes cluster:

$ helm repo add clearml https://clearml.github.io/clearml-helm-charts
"clearml" has been added to your repositories

Update the repository:

$ helm repo update
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "clearml" chart repository
Update Complete. ⎈Happhelm install clearml clearml/clearml -f values.yamly Helming!⎈

Create a values.yaml file via Nano and paste in the following configuration settings, replacing <external IP> with the external IP of your ingress controller:

$ nano values.yaml
apiserver:
ingress:
enabled: true
hostName: "api.<external IP>.nip.io"
additionalConfigs:
apiserver.conf: |
auth {
fixed_users {
enabled: true
pass_hashed: false
users: [
{
username: "john"
password: "iAmm&^TheOnlyJohnDoe(^n&3$*(++("
name: "John Doe"
}
]
}
}
fileserver:
ingress:
enabled: true
hostName: "files.<external IP>.nip.io"
persistence:
size: 5Gi
webserver:
ingress:
enabled: true
hostName: "app.<external IP>.nip.io"
elasticsearch:
enabled: true
image: "docker.elastic.co/elasticsearch/elasticsearch"
imageTag: "7.17.20"
esJavaOpts: "-Xmx1g -Xms1g"
resources:
requests:
cpu: 100m
memory: 1Gi
limits:
cpu: 1000m
memory: 2Gi
volumeClaimTemplate:
resources:
requests:
storage: 10Gi
mongodb:
persistence:
size: 10Gi

⚠️ For this tutorial, I have used a placeholder password. When completing this step, change this password immediately before deploying to production. Use a strong, unique password.

Here’s an explanation of what’s contained in this file:

  • An ingress configuration with a unique nip.io hostname for all three components: the API server, file server, and the web server.
  • Set user credentials for the ClearML server. If you don’t set user credentials, you’ll be required to do so via the ClearML server login page.
  • An Elastic Search version set to 7.17.20 to resolve an incompatibility with the Cgroup v2 interface used by K3s version 1.34.

As of this writing, the ClearML Helm chart defaults to Elasticsearch 7.16.x, which fails on newer Kubernetes distributions using cgroup v2 (such as recent K3s releases). Upgrading Elasticsearch to version 7.17.20 resolves the issue in this deployment.

  • Allocates 10GB for MongoDB and the file server.
  • Set 1Gi of Java heap memory for Elasticsearch’s internal process, with overall resource limits set to 1 CPU core and 2Gi RAM.

Save and close the file.

Install ClearML with the following command:

$ helm install clearml clearml/clearml -f values.yaml
NAME: clearml
LAST DEPLOYED: Sat May 30 17:53:27 2026
NAMESPACE: default
STATUS: deployed
REVISION: 1
DESCRIPTION: Install complete
NOTES:
1. Get the application URL:
<http://app.74.220.23.118.nip.io>

Before you access the web server, check that all pods related to ClearML are running:

$ kubectl get pods -A | grep clearml
default clearml-apiserver-99455c685-dpggk 1/1 Running 0 4m34s
default clearml-apiserver-asyncdelete-565c875667-kfjmw 1/1 Running 0 4m34s
default clearml-elastic-master-0 1/1 Running 0 4m34s
default clearml-fileserver-85cd59dbdb-psxcx 1/1 Running 0 4m34s
default clearml-mongodb-57f56d8c8b-qnnkk 1/1 Running 0 4m34s
default clearml-redis-master-0 1/1 Running 0 4m34s
default clearml-webserver-cf8c779bd-qxtpf 1/1 Running 0 4m34s
default install-helm-clearml-7531-17-vw99k 0/1 Completed 0 21m

Visit the web server http://app.<external IP> .nip.io over your preferred web browser. You’ll be prompted to input a username and password once it loads. Input the username and password you set in your values.yaml file:

Viewing the ClearML server login page

Viewing the ClearML server login page

Upon successful login, you’ll be redirected to your dashboard, where you can manage your projects, experiments, pipelines, and so on.

Run the ClearML SDK locally

The ClearML server is just a management console; the component of the experiment tracking management module that actually lives in your code is the ClearML SDK. With the ClearML SDK, you can create experimentation workflows locally, connect them to your ClearML server, and view or manage your workflows from there.

You’ll develop a machine learning project locally (on your computer), install the ClearML SDK, and connect to your ClearML server running on your Civo kubernetes cluster.

One requirement needed to fully set up your project with the ClearML SDK is your workspace credentials. Once you install the SDK, you’ll be required to initialize it with these credentials; therefore, you must obtain them first.

Visit the following address http://app.<external_ip>.nip.io/settings/workspace-configuration. Click on Create Credentials to access and display your credentials. You can either leave it open or copy it to a text editor:

Obtaining workplace credentials on clearML server dashboard

Obtaining workplace credentials on clearML server dashboard

Create a working directory called experiment and open it up with your default code editor:

$ mkdir experiment && cd experiment && code

From your working directory, create a virtual environment:

Most modern Python projects use package managers like Poetry for dependency management and virtual environment creation. However, for simplicity this article uses the traditional approach; a virtual environment paired with a requirements.txt file. If you're already familiar with Poetry or any other dependency management tool for Python projects, feel free to use it instead.

python3 -m venv venv
source venv/bin/activate

Create a requirements.txt file and paste in the following code:

$ nano requirements.txt
clearml
scikit-learn

Install the ClearML SDK as well as Scikit-learn, which you will use to build and train a model:

$ pip install -r requirements.txt

Initialize the SDK and paste in your credentials when prompted to. Also, allow it some time to verify your credentials:

$ clearml-init
# Once verified
Verifying credentials ...
Credentials verified!
New configuration stored in /home/mercy/clearml.conf
ClearML setup completed successfully.

Next, create three scripts ex1.py , ex2.py , ex3.py and paste in the following code:

The idea here is to run three experiments using the Random Forest Classifier model with the Iris flower dataset, which contains about 150 measurements of Iris flowers across three species.

For the sake of this tutorial, you’ll be varying the n_estimators hyperparameter that controls how the model builds across three experiments, 10 in the first, 50 in the second, and 100 in the third. With this, you can see how ClearML will be used to track each run (as a separate experiment) and then compare all three via the ClearML server to see the accuracy levels of your model.

# Experiment 1
from clearml import Task
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Initialize ClearML Task
task = Task.init(project_name="Iris Classification", task_name="10 estimators")
# Define hyperparameters
params = {
"n_estimators": 10,
"max_depth": 3,
"random_state": 42
}
task.connect(params)
# Load and split the Iris dataset
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=params["random_state"]
)
# Train the model
clf = RandomForestClassifier(
n_estimators=params["n_estimators"],
max_depth=params["max_depth"],
random_state=params["random_state"]
)
clf.fit(X_train, y_train)
# Log metrics
logger = task.get_logger()
train_accuracy = accuracy_score(y_train, clf.predict(X_train))
test_accuracy = accuracy_score(y_test, clf.predict(X_test))
logger.report_scalar("accuracy", "train", value=train_accuracy, iteration=1)
logger.report_scalar("accuracy", "test", value=test_accuracy, iteration=1)
print(f"Train accuracy: {train_accuracy:.4f}")
print(f"Test accuracy: {test_accuracy:.4f}")
task.close()

Use a similar code snippet for ex2.py and ex3.py. But use the following hyperparameters for them, respectively:

# experiment 2
task = Task.init(project_name="Iris Classification", task_name="50 estimators")
params = {
"n_estimators": 50,
"max_depth": 3,
"random_state": 42
}
# experiment 3
task = Task.init(project_name="Iris Classification", task_name="100 estimators")
params = {
"n_estimators": 100,
"max_depth": 3,
"random_state": 42
}

These will do the following:

  • Use 10 n_estimators to train the Random Forest Classifier model, output the hyperparameters, and accuracy metrics, and save the experiment as 10 estimators for experiment one - ex1.py.
  • Use 50 n_estimators, output the hyperparameters, and accuracy metrics, and save the experiment as 50 estimators for experiment two - ex2.py.
  • Use 100 n_estimators, output the hyperparameters, and accuracy metrics, and save the experiment as 100 estimators for experiment three - ex3.py.

Run them one after the other:

python ex1.py
python ex2.py
python ex3.py

From here, the ClearML server will:

  • Create a project on your dashboard called Iris Classification.
  • Register three separate experiments on the project.
  • Capture the hyperparameters, accuracy metrics, execution environment, and the Python package versions for each run.
  • Log the train and test accuracy for each experiment.

Note that the ClearML SDK will attempt to monitor your hardware resources, for example, your GPU statistics. Since you are running on your local machine, which uses a CPU, the SDK will turn off GPU monitoring.

For the most part, GPU usage in experimentation in machine learning is relevant data. Its utilization percentage, as well as memory used, are factors that affect performance in machine learning, which the ClearML SDK tries to capture.

Therefore, if you see the following - 2026-05-30 12:46:56,775 - clearml.Task - INFO - Finished repository detection and package analysis ClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring at each output after running your scripts, do not fret. GPU monitoring is optional.

View experiments on ClearML server

Since you are connected to your ClearML server, your project and experiments will be visible on your dashboard.

Head to your server and click on “View All” to view your Iris Classification project:

Viewing projects on the ClearML server projects dashboard

Viewing projects on the ClearML server projects dashboard

Select your project:

Viewing the project - Iris Classification

Viewing the project - Iris Classification

You should see all experiments listed out for you:

Viewing experiments under the Iris Classification project

Viewing experiments under the Iris Classification project

You can monitor trends for your train and test performance by clicking on the graph icon:

Viewing performance metrics for test and training data

Viewing performance metrics for test and training data

Additionally, it’s possible to compare further across different values like hyperparameters, scalars, plots, and debug samples across experiments:

Clicking on the compare option to compare across different values for experiments.

Clicking on the compare option to compare across different values for experiments.

Comparing across different values for experiments

Comparing across different values for experiments

From here, you can view artifacts, executions, and other relevant information for your experiments, which helps to find inconsistencies that affect experiment output and performance, if any.

Now you can make the following conclusions:

  • All three experiments achieved the same test accuracy of 1.0.
  • 10 estimators slightly outperformed 50 and 100, scoring 0.9666 against 0.9583.

This is the visibility the ClearML server gives you. All in one place. Without having you to write down or manually record these values and outputs for each run. Comparison is one click away.

Troubleshooting

Below are some common issues along with their potential solutions to help you navigate and resolve any hurdles that you may face in this tutorial:

  1. Web server showing 404 in the browser: After you install ClearML server via Helm, check that all its pods are in a (1/1) Running state before you attempt to visit the web server from a browser. The following command will output the states of the pods - kubectl get pods -n default | grep clearml.
  2. ClearML pods in Error, OOMKilled, Pending or CrashLoopBackOff state: View the logs of each pod and describe each pod using kubectl logs <pod name> and kubectl describe pod <podname> respectively. These commands will help you identify the root cause of the failures. Also, ensure that your kubernetes cluster has sufficient CPU, memory, and storage resources allocated for your ClearML server components. For example, the file server, MongoDB, Elasticsearch, and so on.
  3. Test the API server connectivity: You can test whether the API server is reachable with the following command: curl -u "ACCESS_KEY:SECRET_ACCESS_KEY" -X GET "http://api.<external IP>/auth.login". A successful request will output a 200 response code with a JSON payload containing a bearer token. 
  4. Cannot locate your access key and secret access key: You can obtain your ACCESS_KEY and SECRET_ACCESS_KEY from http://app.<external_ip>.nip.io/settings/workspace-configuration.

Next steps

How about you make your experiments reproducible with the ClearML agent? The ClearML agent lays emphasis on “reproducibility”. This means, while you can capture everything about an experiment when it runs (which the SDK solves), you should be able to rerun it with the same environment anytime to see if you will get the same output or result.

You can deploy it on Kubernetes as a Helm Chart or via the clearml-k8s-glue script, or even locally on your machine from your project via pip.

If you choose to use a Helm chart, make sure to view the values available to you via helm show values clearml/clearml-agent.

Summary

ClearML offers many features. In this tutorial, you have learned how to use it for experiment tracking, from deploying the ClearML server component on a Civo Kubernetes cluster via Helm, to initializing the ClearML SDK in your project, and connecting to your ClearML server. You have created a project and run three experiments under it to see how ClearML captures and organizes experiment data, and then compared their outputs.

This article uses two nodes to demonstrate how ClearML can be used for experiment tracking and adjusts the storage requirements for Elasticsearch, MongoDB, and the file server. If you’ll be working in a production environment, make sure to adjust the storage requirements to suit your use case and use at least three nodes when provisioning your Kubernetes cluster.

Mercy Bassey
Mercy Bassey

Technical writer

Mercy Bassey is a Cloud, Systems, and IT Support Specialist and technical writer with a focus on cloud infrastructure, DevOps practices, IT operations, and security. She specialises in translating complex technical concepts into clear, accessible documentation, with experience across tools and technologies including Linux, Kubernetes, Terraform, and scripting. She has contributed to Civo through the Write for Us programme and publishes additional technical content on Medium.

View author profile