How to monitor and optimize GPU utilization in the cloud
Written by
Marketing Team at Civo
Written by
Marketing Team at Civo
GPU utilization is one of the most expensive metrics in cloud infrastructure to get wrong. A GPU running at 30% utilization costs the same as one running at 90%, but it's doing a third of the useful work. For workloads measured in tens of thousands of GPU-hours, the difference between average utilization in the 30s and average utilization in the 70s is hundreds of thousands of dollars across the life of the workload.
The problem is that utilization is hard to monitor well and harder to optimize. The standard tools give a single number that hides what's actually happening underneath. The number can look fine while the workload is wasting half the GPU's capability. Engineering teams who care about cost-efficiency need a more honest picture of what's going on inside the GPU and what the levers are for improving it.
This is a working guide to monitoring and optimizing GPU utilization in the cloud, with the goal of treating GPU spend as a variable to manage rather than a fixed overhead.
What "GPU utilization" actually means
The most common GPU utilization metric (the one shown by nvidia-smi and most cloud dashboards) is the percentage of time the GPU has at least one kernel running. This is a useful baseline but a poor measure of actual usage. A GPU showing 90% utilization may be running kernels that use 10% of its compute capacity. The GPU is busy, but it's not productive.
The honest picture requires several metrics together:
- GPU utilization (the standard metric): Percentage of time the GPU is active. Useful as a floor; if this is low, optimization opportunities are large.
- SM (Streaming Multiprocessor) occupancy: How many of the GPU's compute units are actually doing work. A more granular view of "is the GPU busy?"
- Memory utilization: Percentage of GPU memory in use. Important for understanding whether the workload is memory-bound.
- Memory bandwidth utilization: How much of the GPU's memory bandwidth is being used. For memory-bound workloads, this is often the binding constraint rather than compute.
- Tensor Core utilization: For AI workloads, how much of the specialized Tensor Cores are being used. Low Tensor Core utilization on an H100 or B200 is often the most expensive form of underutilization.
- Model FLOPS Utilization (MFU): Ratio of actual FLOPS achieved to peak FLOPS. The single best measure of how efficiently the workload is using the hardware.
A workload showing 95% GPU utilization, 35% SM occupancy, and 40% MFU is busy but inefficient. The optimization opportunity is large.
How to monitor properly
The instrumentation for honest GPU monitoring has matured significantly. The components:
Optimization 1: Right-size the GPU to the workload
The first optimization is often the largest. Many teams default to the most powerful GPU they can get, regardless of whether the workload needs it. When an H100 or B200 is running at 25%, utilization is much more expensive per useful output than an A100 running at 75%.
The questions to ask of each workload:
- Does the workload's memory requirement fit a smaller GPU? An 80GB A100 covers many workloads that don't need an H100.
- Is the workload memory-bandwidth-bound or compute-bound? Memory-bandwidth-bound workloads benefit less from the higher TFLOPS of newer cards.
- Is the workload at a precision that benefits from the newer cards' Tensor Core capabilities? FP4 and FP8 support on B200 only helps if the workload uses them.
Civo's GPU range (A100 40GB at $1.09/hour on-demand, A100 80GB at $1.79/hour, H100 PCIe at $2.49/hour, H100 SXM at $2.99/hour, H200 SXM at $3.49/hour, L40s at $1.29/hour) gives teams the flexibility to match hardware to workload. For a workload running at 30% utilization on an H100, moving to an A100 80GB can cut costs by close to half while potentially improving utilization.
Optimization 2: Increase batch size
The second optimization addresses the most common cause of low utilization: the workload isn't asking the GPU to do enough work at once. The fix is increasing the batch size until the GPU is well-loaded.
The trade-offs:
- Larger batch size means higher GPU utilization but more memory consumption
- For training, very large batch sizes can affect convergence
- For inference, larger batch sizes mean higher throughput but worse latency
For training workloads, the practical method is to grow the batch size as large as it fits in memory, then use gradient accumulation if a still-larger effective batch size is needed for convergence. For inference, dynamic batching with a bounded wait time gives most of the throughput benefit without significant latency cost.
The instrumentation to know this is working is straightforward: GPU utilization, SM occupancy, and throughput should all increase as batch size grows. When they stop increasing, the workload has hit a different bottleneck.
Optimization 3: Fix the data loading pipeline
The third common cause of low utilization is data starvation. If the GPU is waiting for the next batch of data to arrive, no amount of batch size tuning will help.
The symptoms:
- GPU utilization sits low while CPU utilization is high
- Profiler shows significant time in data loading operations
- Increasing batch size produces less throughput improvement than expected
The fixes:
- Use multiple data loader workers to parallelize data preprocessing
- Pre-fetch the next batch while the current batch is being processed
- Pin memory to speed up host-to-device transfers
- Cache preprocessed data if the preprocessing is expensive
- Move preprocessing to the GPU if it's CPU-bound and parallelizable
For workloads that read large datasets from object storage, the storage layer can itself be the bottleneck. The architectural fix is to keep data close to the compute, which is straightforward on platforms where storage and GPU compute live on the same physical infrastructure.
Optimization 4: Use mixed precision
The fourth optimization is precision. Modern GPUs include specialized hardware for lower-precision operations - Tensor Cores for FP16/BF16, and FP8 support on H100 and B200. Using these instead of FP32 can produce substantial throughput improvements with minimal accuracy loss for most ML workloads.
The practical steps:
- For training, enable mixed-precision training in the framework (PyTorch's torch.cuda.amp, TensorFlow's mixed_float16 policy)
- For inference, quantize the model to FP16 or INT8 where the application can tolerate it
- For the latest hardware, evaluate FP8 inference for transformer workloads
The combination of higher throughput and lower memory consumption from mixed precision often unlocks larger batch sizes too, compounding the benefit.
Optimization 5: Share GPUs across workloads
The fifth optimization addresses workloads that don't need a full GPU. For inference workloads with low throughput requirements, or development workloads that don't run continuously, sharing a GPU across multiple workloads improves utilization without requiring rewrites.
The mechanisms:
- Multi-Instance GPU (MIG) on A100 and H100 partitions a single physical GPU into multiple isolated instances. Each instance has dedicated memory and compute, but the physical card is shared.
- MPS (Multi-Process Service) allows multiple processes to share a GPU more efficiently than the default time-slicing model.
- Kubernetes GPU sharing through device plugins enables multiple pods to share GPU resources where the workloads tolerate it.
For development and small-scale inference workloads, GPU sharing can dramatically improve utilization without affecting individual workload performance. For Civo's Kubernetes GPU clusters, the standard Kubernetes GPU resource model supports these patterns directly.
Optimization 6: Scale down when you're not using it
The sixth optimization is the simplest and most often missed. GPUs running at 0% utilization still cost the full hourly rate. The fix is to shut them down when they're not being used.
The patterns:
- Development workflows: shut down dev instances at the end of the day, restart in the morning
- Training workflows: provision capacity at the start of a run, release it at the end, not "just in case"
- Inference workflows: use autoscaling to scale down during off-peak hours
This sounds obvious, but is one of the largest sources of waste in practice. The team's GPU bill often includes significant spend on instances nobody is actively using.
For Civo's GPU compute and Kubernetes GPU clusters, fast provisioning (under 90 seconds for new GPU nodes) makes this pattern practical - instances can be torn down and recreated quickly enough that "leave it running just in case" stops being the easier option.
Optimization 7: Move expensive workloads to commitments
The seventh optimization is structural. Workloads with stable, high-utilization patterns benefit from committed pricing, which exchanges flexibility for cost. Civo's pricing includes options for 6, 12, 24, and 36-month commitments at progressively discounted rates compared to on-demand.
For an A100 80GB, the on-demand rate of $1.79/hour drops to $1.39/hour with a 36-month commitment. For sustained workloads, this is a substantial saving. The question to ask before committing is whether the workload's profile is genuinely stable - committed capacity that sits idle is more expensive than on-demand capacity that's right-sized.
The honest analysis combines utilization data with workload forecasting. Teams that know their workload patterns can model the cost of committed versus on-demand and pick the right mix.
Putting it together
Pulling the optimizations into a working approach:
- Instrument honestly to measure GPU utilization, SM occupancy, memory bandwidth, Tensor Core usage, and MFU
- Right-size the GPU to each workload, not the most powerful card available
- Tune batch size and data loading to keep the GPU well-fed
- Use mixed precision to extract more throughput from the hardware
- Share GPUs across workloads where the pattern allows
- Scale down GPUs that aren't actively in use
- Commit on stable workloads to capture the discount
The compound effect of doing all of these is significant. A team that systematically applies the framework typically sees 2-3x improvement in cost per useful output, which translates to substantial budget savings on any workload large enough to matter.
For workloads on Civo's Cloud GPU range, transparent pricing, full NVIDIA hardware coverage, fast provisioning, and standard observability tooling make systematic optimization practical - talk to the Civo team to get started.

Marketing Team at Civo
Civo is the Sovereign Cloud and AI platform designed to help developers and enterprises build without limits. We bridge the gap between the openness of the public cloud and the rigorous security of private environments, delivering full cloud parity across every deployment. As a team, we are dedicated to providing scalable compute, lightning-fast Kubernetes, and managed services that are ready in minutes. Through CivoStack Enterprise and our FlexCore appliance, we empower organizations to maintain total data sovereignty on their own hardware.
Our mission is to make the cloud faster, simpler, and fairer. By providing enterprise-grade NVIDIA GPUs and streamlined model management, we ensure that high-performance AI and machine learning are accessible to everyone. Built for transparency and performance, the Civo Team is here to give you total control over your infrastructure, your data, and your spend.
Share this article