Understanding GPU cloud instance types: How to read a spec sheet for real-world ML performance
Written by
Marketing Team at Civo
Written by
Marketing Team at Civo
A GPU spec sheet is a confidence trick. It looks like an objective document - numbers, units, comparable rows - but most of the numbers on it don't map cleanly to the performance a real workload will see. Teams that pick GPUs by reading the headline figures usually find out the gap between spec and reality somewhere around the first production run.
This is a working guide to reading GPU cloud instance specifications against actual ML workloads. The goal isn't to recommend a card. It's to give teams a framework for translating the marketing numbers into something they can use to make a decision, and to flag the specs that matter most for different kinds of ML work.
The numbers on the spec sheet, and what they actually mean
Every GPU spec sheet covers the same handful of categories. Each one tells you something useful, and each one has a way of being misleading if read in isolation.
Compute throughput in TFLOPS
The headline number on any GPU spec sheet is the peak compute throughput, usually expressed in teraflops or petaflops at a specific precision. A B200 spec sheet will list a number for FP4, another for FP8, another for FP16, and another for FP32. The numbers are real, but they're peak values measured under specific conditions, on specific operations, with the GPU running at thermal limits and the kernel hand-tuned for the benchmark.
What this means in practice: the TFLOPS number tells you the ceiling, not the floor. A workload that's well-suited to the architecture might hit 60-70% of peak in production. A workload that doesn't match the architecture's strengths might hit 15%. The spec sheet doesn't tell you which is which. The workload's structure does.
The other thing to watch for is precision. A card that delivers extraordinary FP4 or FP8 throughput might not deliver proportionally more FP32 performance. For workloads that can use low precision - most inference, some training, increasing portions of mixed-precision training - the lower-precision number is the relevant one. For workloads that need higher precision - scientific computing, certain training scenarios - the higher-precision number is what matters, and the headline often isn't.
VRAM capacity
VRAM is the GPU's local memory, and it's usually the binding constraint on which models a GPU can run. A 40GB A100 can fit a different set of models, batch sizes, and context lengths than an 80GB version of the same card, and the difference matters more than the TFLOPS gap between them for many workloads.
The questions to ask about VRAM are concrete. Will the model and its activations fit? For training, will the model, the optimizer state, the gradients, and a useful batch size all fit at once? For inference, will the model and its KV cache fit at the context length the application requires?
Running out of VRAM doesn't just degrade performance. It usually means the workload doesn't run at all, or runs in a configuration so reduced that the math no longer works. Teams that underspec VRAM tend to discover this when they try to scale up.
Memory bandwidth
Memory bandwidth is how fast data moves between the GPU's local memory and its compute cores. It's expressed in GB/s or TB/s, and for many real workloads, it matters more than peak TFLOPS. The reason is that modern deep learning workloads are often memory-bound - the compute units are sitting idle waiting for the next batch of data to arrive from VRAM.
A useful rule of thumb: if the GPU's TFLOPS number is much larger than the memory bandwidth multiplied by some workload-specific factor, the workload will be memory-bound and won't see the full benefit of the compute. The exact factor depends on the operation, but the pattern is consistent: high-bandwidth GPUs deliver better real-world performance than the TFLOPS number alone suggests, especially for transformer inference.
Interconnect speed
For distributed training, interconnect speed between GPUs is often the actual bottleneck. The TFLOPS number for a single card tells you nothing about how well eight of them will work together on a 70-billion-parameter model. The number that matters is the interconnect bandwidth - NVLink for intra-node, InfiniBand or high-speed Ethernet for inter-node.
Gradient synchronization across many GPUs has to push large quantities of data between cards on every training step. If the interconnect can't keep up, the GPUs spend most of their time waiting. The spec sheet usually tells you the per-GPU interconnect speed; what it doesn't always make clear is whether the cloud provider has built the cluster with that interconnect actually available, or whether the cards are connected through a slower fabric that bottlenecks the whole arrangement.
Power and thermal envelope
The spec sheet will quote a TDP - thermal design power, usually in watts. This number isn't directly relevant to the workload's performance, but it's a good proxy for what the card is designed to sustain. Higher TDP usually means higher sustained performance under load; lower TDP means the card will throttle sooner if pushed hard.
In a cloud context, TDP also matters because it affects how densely the provider has packed cards into a node, and whether the cooling architecture can sustain peak performance across all the GPUs in a cluster simultaneously. A spec sheet that promises 700W per GPU is meaningful only if the cluster is cooled to deliver it.
What the spec sheet doesn't tell you
Several things matter for real-world ML performance that don't show up on the manufacturer's spec sheet at all. These are the variables that separate a workload that performs as expected from one that disappoints.
Software stack maturity
A GPU's real-world performance is heavily dependent on the software running on it: CUDA version, cuDNN, kernel libraries, framework versions, driver maturity. New GPUs often launch with peak software performance that lags peak hardware performance by months or quarters, because the kernels haven't been optimized yet.
For ML teams, this means a newer card with theoretically higher TFLOPS may deliver lower real-world performance than an older card whose software stack is mature. The spec sheet doesn't capture this. Practical guidance: check the framework's support matrix, look for kernel optimizations in the framework's release notes, and test on the actual workload before committing.
Provisioning latency
The spec sheet tells you the peak performance once the GPU is running. It doesn't tell you how long it takes to get one running. In production ML, the difference between a GPU that's available in seconds and one that's available after a quota approval and a 45-minute provisioning step is significant.
Civo's GPU-enabled Kubernetes clusters are provisioned in under 90 seconds, which is a useful baseline for thinking about what "fast provisioning" means in this context. It also means experimentation, iteration, and incident response don't have to be planned around long provisioning waits.
Availability and quota
This is the one that catches many teams out. The spec sheet treats the GPU as available; the procurement reality often isn't. Hyperscaler quotas for the latest cards are tight, with approval cycles that can stretch out for weeks, and on-demand availability of H100 or B200 capacity is frequently constrained during peak demand. A spec sheet that quotes B200 performance isn't useful if the actual cards aren't available when the workload needs them.
The advantage of working with a provider focused on AI/ML workloads is that GPU capacity is the product, not a feature added to a general cloud catalog. Civo's GPU compute range covers A100, H100, H200, L40S, and B200 Blackwell — with Vera Rubin NVL72 available to reserve for Q1 2027 delivery.
Pricing structure
The spec sheet is silent on cost. The cost-per-FLOP or cost-per-VRAM-GB across providers can vary by a factor of three or more for the same hardware, and the structure of the pricing - what's included, what's charged separately, what's metered - matters as much as the headline rate.
Civo publishes an illustrative comparison for 8× A100 on-demand: $1.09 per hour against $3.43 for AWS, $3.67 for Google Cloud Platform, and $3.40 for Microsoft Azure. The headline difference is large, but the structural difference is more important: Civo doesn't charge for data ingress or egress, which means the total cost of running a real ML workload can be substantially different from what the per-hour rates alone suggest.
How to read a spec sheet against your actual workload
The practical method is to translate the spec sheet into the answers to a small number of workload-specific questions.
For training large models
- Does the VRAM fit the model, optimizer state, gradients, and a useful batch size at the target sequence length?
- Is the memory bandwidth high enough that the workload won't be memory-bound at the chosen batch size?
- Is the inter-GPU interconnect fast enough to support gradient synchronization across the number of cards the training run needs?
- Does the cloud provider have the specific multi-GPU configuration the workload requires?
The right card here is usually whatever maximizes VRAM and interconnect bandwidth within the budget. Peak TFLOPS matters less than people expect.
For fine-tuning and parameter-efficient methods
- Can the base model fit at inference precision, with enough headroom for the adapter weights and activations?
- Is the workload short enough that single-GPU performance dominates, and interconnect doesn't matter?
- Is the per-hour cost low enough that iteration is cheap?
Single-card workloads benefit from cards in the A100 or L40s range more often than from top-of-line training cards. The headline number on a B200 spec sheet may not be worth the price premium for a workload that doesn't saturate it.
For inference
- Does the VRAM fit the model and the KV cache at the target context length and concurrency?
- Is the per-request latency acceptable at the precision the application can tolerate?
- Is the per-request cost competitive at the workload's volume?
Inference workloads often benefit from cards optimized for lower precision and lower power, not the highest training-class cards. The spec sheet's FP4 or FP8 numbers, plus the memory bandwidth, are usually more relevant than peak FP16 TFLOPS.
For research and experimentation
- Is the card available on demand, without quota approval?
- Is it cheap enough that the team can run many experiments without budget anxiety?
- Does the platform support fast provisioning, so iteration isn't blocked by waiting for capacity?
For this profile, mid-range cards are often the right answer. The flexibility to provision quickly, run cheaply, and tear down is worth more than peak performance for any single experiment.
The pragmatic checklist
Pulling it together, a working checklist for reading a GPU cloud spec sheet:
- Translate the peak TFLOPS into a realistic floor based on the workload's structure, not the marketing benchmark
- Confirm VRAM fits the model and the operational headroom - including optimizer state for training and KV cache for inference
- Check memory bandwidth against the workload's memory pattern - high-bandwidth cards punch above their TFLOPS for memory-bound workloads
- For distributed training, check the interconnect specification of the actual cluster - not just the per-GPU number
- Verify the software stack is mature for your framework version - and benchmark on a real workload if it isn't
- Confirm availability and provisioning speed match the team's operating model
- Model the total cost including egress and any metered services - the per-hour rate is rarely the full picture
Spec sheets exist to sell GPUs. Workloads exist to make money or answer research questions. The two only line up when the team reading the spec sheet knows which numbers are doing real work for their specific case.
Civo offers transparent pricing, fast provisioning, and the latest NVIDIA hardware across GPU compute and Kubernetes GPU configurations, with no hidden charges that distort the cost comparison.

Marketing Team at Civo
Civo is the Sovereign Cloud and AI platform designed to help developers and enterprises build without limits. We bridge the gap between the openness of the public cloud and the rigorous security of private environments, delivering full cloud parity across every deployment. As a team, we are dedicated to providing scalable compute, lightning-fast Kubernetes, and managed services that are ready in minutes. Through CivoStack Enterprise and our FlexCore appliance, we empower organizations to maintain total data sovereignty on their own hardware.
Our mission is to make the cloud faster, simpler, and fairer. By providing enterprise-grade NVIDIA GPUs and streamlined model management, we ensure that high-performance AI and machine learning are accessible to everyone. Built for transparency and performance, the Civo Team is here to give you total control over your infrastructure, your data, and your spend.
Share this article