Optimize GPU Costs for Machine Learning with Just-in-Time Kubernetes Scaling

Machine learning projects face an escalating challenge due to the increasing costs of GPU infrastructure. Training modern deep learning models requires substantial computational resources, yet many teams find expensive GPU nodes sitting idle between training runs. This inefficiency turns model iteration into an ongoing budget problem rather than a pure research or product problem.

Just-in-time GPU provisioning solves this by creating capacity only when workloads demand it. Instead of maintaining always-on GPU clusters, teams dynamically scale GPU resources, ensuring accelerators are available for training and inference while avoiding idle-hour charges.

Learn more about Civo’s GPU capabilities here >

This tutorial walks through building a lightweight, on-demand GPU scheduler that also posts notifications about cost and training events. Specifically, the system will:

Monitor Kubernetes pods that request GPUs.
Scale GPU node pools using Civo CLI (or REST API in production).
Enforce simple utilization and budget guardrails.
Post Slack/Telegram alerts for scaling and budget events (simulated in this tutorial).

Architecture Overview

The solution consists of several components working together:

Civo Kubernetes Cluster: Base CPU-only cluster with dynamic GPU node pools
Watchdog Controller: Monitors workload patterns and triggers scaling decisions
Cost Monitor: Tracks spending against budgets using Civo's billing API
Alert System: Provides real-time notifications about scaling events and cost thresholds
Policy Engine: Enforces business rules around utilization and spending limits

The resulting architecture combines node labels/taints, a small watchdog script, and lightweight alerts to make GPU spending observable and predictable.

Prerequisites

Requirement	Purpose	Notes
Civo Account & API Key	Authenticate CLI/API calls	Store keys in Kubernetes Secrets
Kubernetes Cluster (CPU-only)	Base cluster to which GPU node pools will be added	Create via Civo Dashboard or CLI
kubectl	Inspect nodes, apply manifests, test pods	Ensure kubeconfig targets the correct cluster

Keep credentials out of source code by using Kubernetes Secrets. Assign minimal RBAC permissions to any in-cluster automation to follow least-privilege principles.

Step 1: Create a GPU Cluster on Civo

Instead of spinning up a separate GPU-only cluster, add a GPU node pool to the existing CPU cluster. This approach keeps workloads organized, CPU nodes handle system tasks and light jobs, while GPU nodes are reserved only for ML training and inference. It also simplifies networking, namespaces, and monitoring since everything stays under one Kubernetes control plane.

There are two approaches to doing this:

Civo Dashboard:
- Go to Civo Dashboard → Kubernetes → Create Cluster
- Choose a standard CPU size (e.g., g4s.kube.small).
- Set initial node count = 1 to avoid accidental charges during testing.
- Go to cluster → Create New Pool.
- Select a GPU-optimized size (e.g., g4g.kube.small).
- Set initial node count = 1.
Civo CLI:
- Create or scale CPU and GPU pools programmatically.
- Example CLI commands:


# Step 1: Create a CPU-only cluster
civo kubernetes create CLUSTER_NAME --size=g4s.kube.small --nodes=1

# Step 2: Add a GPU pool (start with 0 nodes to avoid idle costs)
civo kubernetes node-pool create CLUSTER_NAME --size=g4g.kube.small --count=1

# Step 3: Scale GPU pool when training starts
civo kubernetes node-pool scale CLUSTER_NAME NODEPOOL_ID -n NUMBER

The advantage of Civo’s GPU

Unlike other providers, Civo pre-installs NVIDIA drivers, CUDA toolkit, and container runtime on GPU nodes. This eliminates the common "bootstrap time" problem where teams wait 10-15 minutes for driver installation after node creation.

Get started with Civo AI today

AI in our cloud, or yours? Civo AI puts the power of the latest NVIDIA GPUs and multi-cloud control in your hands without cost, complexity or lock-in. Work at the speed of your ideas, without draining your budget – and keep your data close, compliant and completely under your control.

Talk to our team

After creating the pool, confirm that nodes join the cluster and that the NVIDIA device plugin or vendor operator is running. GPU nodes require drivers and device plugins to expose nvidia.com/gpu resources in Kubernetes. Here’s how to do it:

kubectl describe node <gpu-node-name> | grep nvidia.com/gpu

Step 2: Tag GPU Workloads

Tagging and taints prevent the accidental scheduling of CPU workloads on GPU nodes, allowing the scheduler to detect GPU work. Inspect Node Labels and Taints:

kubectl describe node <gpu-node-name>

Look for labels like node.kubernetes.io/instance-type, nodepool=gpu, or custom tags.
Note taints like gpu=true: NoSchedule which prevent generic pods from landing there.

Tag GPU Workloads

Apply Labels, Taints, and Affinity:

If missing, add labels via:

kubectl label nodes <gpu-node-name> gpu=on-demand
kubectl taint nodes <gpu-node-name> gpu=true:NoSchedule

If nodes cannot be labelled, skip this step and use nodeAffinity in the job spec (below is the working example).

Example Job Spec (gpu-test.yaml):

# gpu-test.yaml
apiVersion: v1
kind: Pod
metadata:
  name: gpu-test
  Labels:
    gpu: on-demand
spec:
  restartPolicy: OnFailure
  containers:
  - name: fake-gpu
    image: busybox
    command: ["sh", "-c", "echo 'Simulating GPU work'; sleep 60"]
    resources:
      limits:
        nvidia.com/gpu: 1
  tolerations:
  - key: "gpu"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: gpu
            operator: In
            values:
            - on-demand

Pro tip: Consistent labeling (gpu=on-demand) ensures the watchdog can detect GPU jobs reliably across different teams and projects. Consider implementing a labeling policy that includes cost centers and project codes for better financial tracking.

Step 3: Write Watchdog Script

Watchdog is a lightweight Python service that continuously monitors GPU job activity in your cluster, triggers node-pool scaling when needed, and posts notifications for team visibility. Think of it as the “eyes and hands” of the cost-aware GPU scheduler.

Design Principles:

Monitor Pods, Not Jobs: Pod-level inspection (status: Pending/Running) is more reliable.
Secure Authentication: Read the Civo API key from a Kubernetes Secret.
Least-Privilege RBAC: ServiceAccount with read-only access to pods/jobs.
Backoff & Cooldown: Avoid thrashing with exponential backoff or cooldown between scale actions.

Folder Structure:

civo-gpu-scheduler/
├── gpu-test.yaml        # GPU test pod manifest
├── watchdog_local.py    # Watchdog script for local simulation
├── watchdog.py          # Production-ready script (with CLI/API calls)
└── README.md

Example Watchdog (Python, CLI-based for simplicity):

Import dependencies and configure logging

import time, logging
logging.basicConfig(level=logging.INFO)

Define state management variables

scaled_jobs = set()
COOLDOWN = 10
LAST_ACTION = 0

Function to detect pending GPU jobs

def pending_gpu_jobs():
    return ["gpu-test"]  # Replace with real pod query

Continuous Monitoring with Scale-Up and Scale-Down: Loops to check GPU job status, trigger scale-up when pending jobs exist and cooldown has passed, and trigger scale-down when no jobs exist and cooldown has passed.

while True:
    pending = pending_gpu_jobs()
    now = time.time()

    # -----------------------------------------------
    # Scale-Up Logic
    # -----------------------------------------------
    # If GPU jobs are waiting and cooldown passed,
    # simulate a scale-up and send alerts.
    if pending and (now - LAST_ACTION) > COOLDOWN:
        for job in pending:
            logging.info(f"Pending GPU pod detected: {job}, simulating scale-up")
            print("Scale-up simulated (Civo API/CLI call would go here)")
            print(f"Slack alert: GPU scale-up triggered for job {job}")
            print(f"Telegram alert: GPU scale-up triggered for job {job}")
            scaled_jobs.add(job)
        LAST_ACTION = now

    # -----------------------------------------------
    # Scale-Down Logic
    # -----------------------------------------------
    # If no jobs remain and cooldown passed,
    # simulate a scale-down and send alerts.
    if not pending and scaled_jobs and (now - LAST_ACTION) > COOLDOWN:
        logging.info("No pending GPU pods: simulating scale-down")
        print("Scale-down simulated (Civo API/CLI call would go here)")
        print("Slack alert: GPU scale-down triggered")
        print("Telegram alert: GPU scale-down triggered")
        scaled_jobs.clear()
        LAST_ACTION = now

    # -----------------------------------------------
    # Sleep Interval
    # -----------------------------------------------
    # Avoids busy looping. Runs check every 5 seconds.
    time.sleep(5)

Alerts are simulated. In production, replace print() with real Slack/Telegram webhooks.

RBAC Example

The watchdog requires specific Kubernetes permissions to monitor pods and jobs. This RBAC configuration follows least-privilege principles:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: gpu-watchdog
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: gpu-watchdog-role
  namespace: kube-system
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["batch"]
  resources: ["jobs"]
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: gpu-watchdog-binding
  namespace: kube-system
subjects:
- kind: ServiceAccount
  name: gpu-watchdog
  namespace: kube-system
roleRef:
  kind: Role
  name: gpu-watchdog-role
  apiGroup: rbac.authorization.k8s.io

Step 4: Set Cost Guardrails

Effective cost management requires proactive guardrails that prevent budget overruns while maintaining development velocity. Civo's transparent billing model makes implementing these controls straightforward.

Suggested Guardrails

Rule	Description
High Utilization Rule	Maintain GPUs when average utilization > 70% for 5+ minutes. This preference for throughput ensures active experiments aren't interrupted by premature scale-down decisions.
Low Utilization Rule	Scale down when utilization < 50% for 10+ minutes and no pending GPU pods exist. This prevents immediate thrashing while ensuring cost efficiency.
Budget Cap Rule	Pause new GPU job scheduling when the monthly spend reaches 80% of the allocated budget. This hard stop prevents surprise bills while allowing teams to plan resource usage.
Time-Based Rules	Automatically scale down GPU resources during known low-usage periods (nights, weekends) unless explicitly overridden.

Metrics and Signals

Prefer telemetry over heuristics. Deploy NVIDIA DCGM exporter and Prometheus to compute nvidiagpuutilization. Use PromQL rolling averages to implement thresholds. If metrics are not available, fall back to pod-count heuristics.

Budget Checking Via Civo API

Civo exposes billing endpoints to list account charges. Use the charges API to compute cumulative usage and enforce budget caps. Example API resource: GET https://api.civo.com/v2/charges. Parse responses safely and handle missing fields.

Example budget check (safe parsing and error handling recommended):

import requests, os
CIVO_API_KEY = os.getenv("CIVO_API_KEY")
HEADERS = {"Authorization": f"bearer {CIVO_API_KEY}"}


def get_charges():
    resp = requests.get("https://api.civo.com/v2/charges", headers=HEADERS, timeout=10)
    resp.raise_for_status()
    return resp.json()

Caution: The exact billing object shape can change. Validate response fields before use.

Step 5: Test the Workflow

Comprehensive testing ensures the system behaves correctly under various scenarios, from normal operations to edge cases and failure conditions.

Testing Steps

Deploy the watchdog as a Deployment in the cluster.
Submit a short GPU job labeled gpu=on-demand.
Verify the job becomes Pending → watchdog detects it → simulated scale-up occurs.
Confirm simulated Slack/Telegram alerts appear with relevant context.
Observe job runs on GPU node → completion triggers scale-down and alert.
Test error conditions: API failures, network issues, budget limits.

Here’s an example of the output you should see after running the watchdog script.

Test the Workflow

Monitor the dashboard and billing for expected activity. Pre-pull images or maintain a warm node for faster startup if needed.

Troubleshooting

Common issues and their solutions when deploying the GPU scheduler:

Pod Pending: Check that tolerations/affinity labels match GPU node configuration exactly. Mismatched labels are the most common cause of scheduling failures.
Cold Start Issues: Civo's fast provisioning helps, but image pulls can still add 2–5 minutes. Pre-pull common images or maintain standby nodes for latency-sensitive workloads.
Webhook Failures: Check logs and network egress rules. Ensure the cluster can reach external notification services.
Quota/API Errors: Inspect CLI/API output carefully. GPU quotas may limit scaling, especially for new Civo accounts.

Summary

By leveraging Civo's flexible node-pool management, strategic workload tagging, and telemetry-driven scaling, you can optimize your GPU usage and costs. Implementing best practices such as using Secrets for API keys, minimal RBAC, and cooldown windows can further enhance your deployment's efficiency and reliability. With these strategies, you can ensure predictable and policy-driven GPU spending. By following these guidelines, you'll be well on your way to maximizing the potential of your GPU-accelerated workloads on Civo.

If you are looking to learn more about some of the topics discussed in this tutorial, check out these resources: