Optimize GPU costs for machine learning with just-in-time Kubernetes scaling

Learn how to implement just-in-time GPU provisioning on Civo Kubernetes to dynamically scale resources, enforce budget and utilization guardrails, and receive alerts for efficient ML training cost optimization.

6 minutes reading time

Written by

Mostafa Ibrahim
Mostafa Ibrahim

Software Engineer @ GoCardless

Machine learning projects face an escalating challenge due to the increasing costs of GPU infrastructure. Training modern deep learning models requires substantial computational resources, yet many teams find expensive GPU nodes sitting idle between training runs. This inefficiency turns model iteration into an ongoing budget problem rather than a pure research or product problem.

Just-in-time GPU provisioning solves this by creating capacity only when workloads demand it. Instead of maintaining always-on GPU clusters, teams dynamically scale GPU resources, ensuring accelerators are available for training and inference while avoiding idle-hour charges.

This tutorial walks through building a lightweight, on-demand GPU scheduler that also posts notifications about cost and training events. Specifically, the system will:

  • Monitor Kubernetes pods that request GPUs.
  • Scale GPU node pools using Civo CLI (or REST API in production).
  • Enforce simple utilization and budget guardrails.
  • Post Slack/Telegram alerts for scaling and budget events (simulated in this tutorial).

Architecture overview

The solution consists of several components working together:

  • Civo Kubernetes cluster: Base CPU-only cluster with dynamic GPU node pools
  • Watchdog controller: Monitors workload patterns and triggers scaling decisions
  • Cost monitor: Tracks spending against budgets using Civo's billing API
  • Alert System: Provides real-time notifications about scaling events and cost thresholds
  • Policy engine: Enforces business rules around utilization and spending limits

The resulting architecture combines node labels/taints, a small watchdog script, and lightweight alerts to make GPU spending observable and predictable.

Prerequisites

RequirementPurposeNotes

Authenticate CLI/API calls

Store keys in Kubernetes Secrets

Kubernetes cluster (CPU-only)

Base cluster to which GPU node pools will be added

Inspect nodes, apply manifests, test pods

Ensure kubeconfig targets the correct cluster

Keep credentials out of source code by using Kubernetes Secrets. Assign minimal RBAC permissions to any in-cluster automation to follow least-privilege principles.

Step 1: Create a GPU cluster on Civo

Instead of spinning up a separate GPU-only cluster, add a GPU node pool to the existing CPU cluster. This approach keeps workloads organized, CPU nodes handle system tasks and light jobs, while GPU nodes are reserved only for ML training and inference. It also simplifies networking, namespaces, and monitoring since everything stays under one Kubernetes control plane.

There are two approaches to doing this:

  • Civo Dashboard:
    • Go to Civo Dashboard → Kubernetes → Create Cluster
    • Choose a standard CPU size (e.g., g4s.kube.small).
    • Set initial node count = 1 to avoid accidental charges during testing.
    • Go to cluster → Create New Pool.
    • Select a GPU-optimized size (e.g., g4g.kube.small).
    • Set initial node count = 1.
  • Civo CLI:
    • Create or scale CPU and GPU pools programmatically.
    • Example CLI commands:
# Step 1: Create a CPU-only cluster
civo kubernetes create CLUSTER_NAME --size=g4s.kube.small --nodes=1
# Step 2: Add a GPU pool (start with 0 nodes to avoid idle costs)
civo kubernetes node-pool create CLUSTER_NAME --size=g4g.kube.small --count=1
# Step 3: Scale GPU pool when training starts
civo kubernetes node-pool scale CLUSTER_NAME NODEPOOL_ID -n NUMBER

The advantage of Civo’s GPU

Unlike other providers, Civo pre-installs NVIDIA drivers, CUDA toolkit, and container runtime on GPU nodes. This eliminates the common "bootstrap time" problem where teams wait 10-15 minutes for driver installation after node creation.

Get started with Civo AI today

AI in our cloud, or yours? Civo AI puts the power of the latest NVIDIA GPUs and multi-cloud control in your hands without cost, complexity or lock-in. Work at the speed of your ideas, without draining your budget – and keep your data close, compliant and completely under your control.

Talk to our team

After creating the pool, confirm that nodes join the cluster and that the NVIDIA device plugin or vendor operator is running. GPU nodes require drivers and device plugins to expose nvidia.com/gpu resources in Kubernetes. Here’s how to do it:

kubectl describe node <gpu-node-name> | grep nvidia.com/gpu
confirm that nodes join the cluster and that the NVIDIA device plugin or vendor operator is running

Step 2: Tag GPU workloads

Tagging and taints prevent the accidental scheduling of CPU workloads on GPU nodes, allowing the scheduler to detect GPU work. Inspect Node Labels and Taints:

kubectl describe node <gpu-node-name>
  • Look for labels like node.kubernetes.io/instance-typenodepool=gpu, or custom tags.
  • Note taints like gpu=true: NoSchedule which prevent generic pods from landing there.
gpu=true: NoSchedule

Apply Labels, Taints, and Affinity:

If missing, add labels via:

kubectl label nodes <gpu-node-name> gpu=on-demand
kubectl taint nodes <gpu-node-name> gpu=true:NoSchedule

If nodes cannot be labelled, skip this step and use nodeAffinity in the job spec (below is the working example).

Example Job Spec (gpu-test.yaml):

# gpu-test.yaml
apiVersion: v1
kind: Pod
metadata:
name: gpu-test
Labels:
gpu: on-demand
spec:
restartPolicy: OnFailure
containers:
- name: fake-gpu
image: busybox
command: ["sh", "-c", "echo 'Simulating GPU work'; sleep 60"]
resources:
limits:
nvidia.com/gpu: 1
tolerations:
- key: "gpu"
operator: "Equal"
value: "true"
effect: "NoSchedule"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: gpu
operator: In
values:
- on-demand

Pro tip: Consistent labeling (gpu=on-demand) ensures the watchdog can detect GPU jobs reliably across different teams and projects. Consider implementing a labeling policy that includes cost centers and project codes for better financial tracking.

Step 3: Write watchdog script

Watchdog is a lightweight Python service that continuously monitors GPU job activity in your cluster, triggers node-pool scaling when needed, and posts notifications for team visibility. Think of it as the “eyes and hands” of the cost-aware GPU scheduler.

Design Principles:

  • Monitor pods, not obs: Pod-level inspection (status: Pending/Running) is more reliable.
  • Secure authentication: Read the Civo API key from a Kubernetes Secret.
  • Least-privilege RBAC: ServiceAccount with read-only access to pods/jobs.
  • Backoff & cooldown: Avoid thrashing with exponential backoff or cooldown between scale actions.

Folder structure:

civo-gpu-scheduler/
├── gpu-test.yaml # GPU test pod manifest
├── watchdog_local.py # Watchdog script for local simulation
├── watchdog.py # Production-ready script (with CLI/API calls)
└── README.md

Example watchdog (Python, CLI-based for simplicity):

Import dependencies and configure logging:

import time, logging
logging.basicConfig(level=logging.INFO)

Define state management variables:

scaled_jobs = set()
COOLDOWN = 10
LAST_ACTION = 0

Function to detect pending GPU jobs:

def pending_gpu_jobs():
return ["gpu-test"] # Replace with real pod query

Continuous Monitoring with Scale-Up and Scale-Down: Loops to check GPU job status, trigger scale-up when pending jobs exist, and cooldown has passed, and trigger scale-down when no jobs exist, and cooldown has passed.

while True:
pending = pending_gpu_jobs()
now = time.time()
# -----------------------------------------------
# Scale-Up Logic
# -----------------------------------------------
# If GPU jobs are waiting and cooldown passed,
# simulate a scale-up and send alerts.
if pending and (now - LAST_ACTION) > COOLDOWN:
for job in pending:
logging.info(f"Pending GPU pod detected: {job}, simulating scale-up")
print("Scale-up simulated (Civo API/CLI call would go here)")
print(f"Slack alert: GPU scale-up triggered for job {job}")
print(f"Telegram alert: GPU scale-up triggered for job {job}")
scaled_jobs.add(job)
LAST_ACTION = now
# -----------------------------------------------
# Scale-Down Logic
# -----------------------------------------------
# If no jobs remain and cooldown passed,
# simulate a scale-down and send alerts.
if not pending and scaled_jobs and (now - LAST_ACTION) > COOLDOWN:
logging.info("No pending GPU pods: simulating scale-down")
print("Scale-down simulated (Civo API/CLI call would go here)")
print("Slack alert: GPU scale-down triggered")
print("Telegram alert: GPU scale-down triggered")
scaled_jobs.clear()
LAST_ACTION = now
# -----------------------------------------------
# Sleep Interval
# -----------------------------------------------
# Avoids busy looping. Runs check every 5 seconds.
time.sleep(5)

Alerts are simulated. In production, replace print() with real Slack/Telegram webhooks.

RBAC example

The watchdog requires specific Kubernetes permissions to monitor pods and jobs. This RBAC configuration follows least-privilege principles:

apiVersion: v1
kind: ServiceAccount
metadata:
name: gpu-watchdog
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: gpu-watchdog-role
namespace: kube-system
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "watch"]
- apiGroups: ["batch"]
resources: ["jobs"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: gpu-watchdog-binding
namespace: kube-system
subjects:
- kind: ServiceAccount
name: gpu-watchdog
namespace: kube-system
roleRef:
kind: Role
name: gpu-watchdog-role
apiGroup: rbac.authorization.k8s.io

Step 4: Set cost guardrails

Effective cost management requires proactive guardrails that prevent budget overruns while maintaining development velocity. Civo's transparent billing model makes implementing these controls straightforward.

Suggested guardrails

RuleDescription

High utilization rule

Maintain GPUs when average utilization > 70% for 5+ minutes. This preference for throughput ensures active experiments aren't interrupted by premature scale-down decisions.

Low utilization rule

Scale down when utilization < 50% for 10+ minutes and no pending GPU pods exist. This prevents immediate thrashing while ensuring cost efficiency.

Budget cap rule

Pause new GPU job scheduling when the monthly spend reaches 80% of the allocated budget. This hard stop prevents surprise bills while allowing teams to plan resource usage.

Time-based rules

Automatically scale down GPU resources during known low-usage periods (nights, weekends) unless explicitly overridden.

Metrics and signals

Prefer telemetry over heuristics. Deploy NVIDIA DCGM exporter and Prometheus to compute nvidiagpuutilization. Use PromQL rolling averages to implement thresholds. If metrics are not available, fall back to pod-count heuristics.

Budget checking via Civo API

Civo exposes billing endpoints to list account charges. Use the charges API to compute cumulative usage and enforce budget caps. Example API resource: GET https://api.civo.com/v2/charges. Parse responses safely and handle missing fields.

Example budget check (safe parsing and error handling recommended):

import requests, os
CIVO_API_KEY = os.getenv("CIVO_API_KEY")
HEADERS = {"Authorization": f"bearer {CIVO_API_KEY}"}
def get_charges():
resp = requests.get("https://api.civo.com/v2/charges", headers=HEADERS, timeout=10)
resp.raise_for_status()
return resp.json()

Caution: The exact billing object shape can change. Validate response fields before use.

Step 5: Test the workflow

Comprehensive testing ensures the system behaves correctly under various scenarios, from normal operations to edge cases and failure conditions.

Testing steps

  1. Deploy the watchdog as a Deployment in the cluster.
  2. Submit a short GPU job labeled gpu=on-demand.
  3. Verify the job becomes Pending → watchdog detects it → simulated scale-up occurs.
  4. Confirm simulated Slack/Telegram alerts appear with relevant context.
  5. Observe job runs on GPU node → completion triggers scale-down and alert.
  6. Test error conditions: API failures, network issues, budget limits.

Here’s an example of the output you should see after running the watchdog script.

example of the output

Monitor the dashboard and billing for expected activity. Pre-pull images or maintain a warm node for faster startup if needed.

Troubleshooting

Common issues and their solutions when deploying the GPU scheduler:

  • Pod pending: Check that tolerations/affinity labels match GPU node configuration exactly. Mismatched labels are the most common cause of scheduling failures.
  • Cold start issues: Civo's fast provisioning helps, but image pulls can still add 2–5 minutes. Pre-pull common images or maintain standby nodes for latency-sensitive workloads.
  • Webhook failures: Check logs and network egress rules. Ensure the cluster can reach external notification services.
  • Quota/API errors: Inspect CLI/API output carefully. GPU quotas may limit scaling, especially for new Civo accounts.

Summary

By leveraging Civo's flexible node-pool management, strategic workload tagging, and telemetry-driven scaling, you can optimize your GPU usage and costs. Implementing best practices such as using Secrets for API keys, minimal RBAC, and cooldown windows can further enhance your deployment's efficiency and reliability. With these strategies, you can ensure predictable and policy-driven GPU spending. By following these guidelines, you'll be well on your way to maximizing the potential of your GPU-accelerated workloads on Civo.

If you are looking to learn more about some of the topics discussed in this tutorial, check out these resources:

Mostafa Ibrahim
Mostafa Ibrahim

Software Engineer @ GoCardless

Mostafa Ibrahim is a software engineer and technical writer specializing in developer-focused content for SaaS and AI platforms. He currently works as a Software Engineer at GoCardless, contributing to production systems and scalable payment infrastructure.

Alongside his engineering work, Mostafa has written more than 200 technical articles reaching over 500,000 readers. His content covers topics including Kubernetes deployments, AI infrastructure, authentication systems, and retrieval-augmented generation (RAG) architectures.

View author profile