Matching GPU choice to your ML training method: Fine-tuning, LoRA, RAG, and inference

9 minutes reading time

Written by

Civo Team
Civo Team

Marketing Team at Civo

The standard advice for choosing an ML GPU is to look at the workload type and pick accordingly. That advice is right, but underspecified. "Training" and "inference" are categories that cover dramatically different infrastructure requirements depending on which specific method the team is using. A full fine-tune of a 70-billion-parameter model and a LoRA fine-tune of the same model want very different hardware. A RAG pipeline and a batch inference workload that returns the same predictions for the same model use the GPU in nearly opposite ways.

This is a method-by-method blog on matching your GPU choice to ML work. Each method has a characteristic profile - what it stresses, what it doesn't, what hardware delivers the best performance per dollar - and choosing well at this level matters more than chasing the latest card. The same conversation about provider-level evaluation criteria - performance, cost, SLAs, ecosystem fit - sits above this; once the provider shortlist is set, the work is matching the specific hardware to the specific method.

Full fine-tuning: The heaviest GPU workload most teams will run

Full fine-tuning means updating every parameter in a pretrained model with new data. The model, its optimizer state, its gradients, and the activations all have to fit in GPU memory at once, and for any model above a few billion parameters, this is the workload most likely to push a single GPU past its limits.

A 7-billion-parameter model in mixed-precision training requires roughly 56GB of VRAM for the optimizer state alone, before accounting for model weights and activations, pushing the total well beyond what a single 40GB GPU can handle. Larger models scale linearly. A 70B model is comfortably out of single-GPU range and has to be split across multiple cards with carefully designed parallelism.

What this means for GPU choice:

  • VRAM is the binding constraint: Higher-VRAM cards win even if their peak TFLOPS are lower than alternatives. The 80GB A100 is more useful than the 40GB for any non-trivial fine-tune; the H100 and B200 extend the ceiling further.
  • Interconnect bandwidth matters as much as raw compute: Multi-GPU fine-tuning is gradient-synchronization-heavy, and a card with a fast interconnect (NVLink, InfiniBand) will outperform a faster card connected through a slower fabric.
  • Sustained throughput beats peak performance: Full fine-tunes can run for days; a card that holds 70% of peak under sustained load is more valuable than one that hits 95% in a benchmark and throttles in production.

For most full fine-tunes, the H100 or H200 is the right answer where budget allows, with the A100 as the reliable workhorse where it doesn't. The B200 is increasingly the right call for very large models or where the FP8 throughput translates well to the workload. Civo's GPU range covers all of these, with the A100, H100, H200, L40s, and B200 Blackwell available on demand, and Vera Rubin NVL72 available for early access reservation ahead of Q1 2027 delivery.

Reserve your Vera Rubin capacity

2,048 Vera Rubin GPUs. Q1 2027 delivery confirmed. Pricing from $11.00/hr. Allocations are first-come, first-served. Once they are gone, they are gone.

Contact the Civo sales team to reserve today >

LoRA and parameter-efficient fine-tuning: A different game entirely

LoRA - low-rank adaptation - and the broader family of parameter-efficient fine-tuning methods change the GPU requirements profoundly. Instead of updating every parameter, LoRA trains a small set of adapter matrices alongside the frozen base model. The optimizer state shrinks by orders of magnitude. The gradients shrink with it. The memory ceiling that constrains full fine-tuning relaxes substantially.

This has real implications for hardware choice. A model that needs four H100s for a full fine-tune can often be LoRA-tuned on a single A100, sometimes even on a single L40s if the base model can be loaded in 8-bit or 4-bit precision. The economics change dramatically: a single-GPU LoRA workflow is an order of magnitude cheaper than the equivalent full fine-tune, and the iteration loop is faster because the team isn't waiting for cluster scheduling.

The practical guidance:

  • Single-card workloads are usually optimal: LoRA workflows benefit from a single GPU with enough VRAM for the base model at inference precision, plus the adapter weights and activations. Multi-GPU LoRA setups exist but are usually overengineered for the workload.
  • Mid-range cards punch above their weight: The L40s and A100 are often the sweet spot for LoRA work - they offer enough VRAM to hold the base model and enough compute to train the adapters efficiently, at a substantially lower hourly rate than top-of-line training cards.
  • Iteration speed matters more than peak performance: LoRA workflows are typically run repeatedly during model development - adjusting hyperparameters, swapping base models, testing different adapter ranks. Provisioning latency and per-experiment cost dominate the total experience.

For teams doing significant volumes of LoRA work, the cost difference between hyperscaler pricing and a more focused provider compounds quickly. Civo's on-demand rate of $1.09/GPU/hr for the A100 40GB compares favourably to Google Cloud's equivalent at $3.67/hr per GPU - a saving of over 70% on the same hardware.

RAG: A workload that barely touches the GPU during retrieval

Retrieval-augmented generation has become one of the dominant patterns for production LLM applications, but it has an unusual infrastructure profile because most of the work happens off the GPU. The pipeline typically involves an embedding step (which uses a GPU briefly), a retrieval step against a vector database (which doesn't use a GPU at all), and a generation step that combines the retrieved context with a user query and runs it through an LLM.

The GPU requirements for RAG break down into three distinct workloads:

Embedding generation

Embedding models are small compared to generation models - usually under 1B parameters - and the GPU work per document is modest. The workload is throughput-oriented and parallelizable: embedding a corpus of millions of documents is a batch job that benefits from any GPU with enough VRAM to hold the model and a useful batch size.

Mid-range cards like the L40s are often the most cost-effective choice for this work. The peak performance of an H100 isn't useful if the workload doesn't saturate it.

Vector retrieval

This isn't a GPU workload at all in most production setups. Vector databases run on CPU-based instances optimized for memory and storage rather than compute. The GPU choice for the rest of the pipeline doesn't affect the retrieval step.

Generation

This is where the GPU does the heavy lifting in production RAG, and it's a pure inference workload. The model has to hold its weights and KV cache for the active conversation, the per-request latency has to be low enough for the application, and the throughput has to scale with the request volume.

GPU choice for the generation step is the same as for any LLM inference workload: VRAM that fits the model and the context, memory bandwidth that supports the desired throughput, and a precision the application can tolerate. The A100 80GB is a reliable choice for most production RAG generation; the H100 makes sense where latency requirements are tight or throughput requirements are high.

The practical takeaway for RAG: think about the pipeline as three workloads, not one. The cheapest GPU that handles the generation step well is usually the right choice, with embedding handled separately and retrieval running on CPU instances. Civo's mixed offering - different GPU types across traditional GPU compute and Kubernetes GPU - supports this pattern naturally, allowing different parts of the pipeline to run on the hardware that best fits each.

Inference: The workload where most teams overspend

Inference is the workload that runs constantly once a model is deployed, and it's where most teams systematically overspend on GPUs. The default of using the same card the model was trained on is rarely the right answer in production, but it's the path of least resistance, and many teams take it.

The structural differences from training are significant. Inference doesn't need the optimizer state, gradients, or activations that dominate training memory. It does need the model weights, the KV cache for ongoing conversations, and enough memory bandwidth to serve requests at the target latency. The compute pattern is also different: many small requests rather than a few large batches, with latency mattering more than throughput.

What this means for GPU choice:

  • Inference-optimized cards are often the right call: The L40s, for example, is designed for inference and graphics workloads, and delivers strong per-dollar performance on inference-only deployments.
  • Lower precision is usually fine and sometimes preferable: Cards with strong FP8 or INT8 support, like the H100 and B200, deliver better cost per inference at lower precision than the same workload at FP16 - assuming the application can tolerate the quality trade-off, which most can.
  • VRAM headroom for the KV cache matters: A model that fits on a 40GB card during training might need 80GB during inference if the application uses long context windows or high concurrency.
  • The right card depends on the latency target: For batch inference, where latency doesn't matter, a cheaper card is usually correct. For real-time inference where p99 latency is a hard requirement, faster cards with higher memory bandwidth justify the premium.

At Civo, we charge zero egress fees - removing an entire category of cost that compounds significantly at inference scale. Inference workloads move significant data - each response leaves the cloud, every time - and providers that charge for egress can add a meaningful percentage to the total cost of ownership at scale.

Putting it together: The decision framework

The pattern across all four methods is that GPU choice should be driven by the specific shape of the workload, not by a general "use the best card you can afford" heuristic. The practical decision steps:

  1. Identify the method clearly: Full fine-tuning, LoRA, RAG, and inference each have different profiles, and conflating them leads to the wrong hardware choice.
  2. Calculate the memory requirement honestly: Model weights plus the method-specific overhead - optimizer state for training, KV cache for inference, adapter weights for LoRA. Underestimating here is the most common failure mode.
  3. Match the precision the method can tolerate: Lower precision opens up cheaper or smaller GPUs without significant quality loss in most cases.
  4. Consider the iteration pattern, not just the peak performance: Methods that involve many short runs benefit from fast provisioning and low per-hour costs. Methods that involve long-running jobs benefit from sustained throughput and reliable interconnects.
  5. Model total cost of ownership, not just per-hour rates: Egress, storage, and any metered service add up differently across providers, and the headline rate is only part of the picture.

A team that runs all four methods at different points in the lifecycle - full fine-tunes for major model updates, LoRA for routine adaptation, RAG for production deployment, inference at scale - benefits most from a provider whose catalog supports all of them at competitive economics. Civo's range across GPU compute and Kubernetes GPU, combined with transparent pricing and the absence of egress fees, is designed for exactly this kind of mixed workload.

Civo Team
Civo Team

Marketing Team at Civo

Civo is the Sovereign Cloud and AI platform designed to help developers and enterprises build without limits. We bridge the gap between the openness of the public cloud and the rigorous security of private environments, delivering full cloud parity across every deployment. As a team, we are dedicated to providing scalable compute, lightning-fast Kubernetes, and managed services that are ready in minutes. Through CivoStack Enterprise and our FlexCore appliance, we empower organizations to maintain total data sovereignty on their own hardware.

Our mission is to make the cloud faster, simpler, and fairer. By providing enterprise-grade NVIDIA GPUs and streamlined model management, we ensure that high-performance AI and machine learning are accessible to everyone. Built for transparency and performance, the Civo Team is here to give you total control over your infrastructure, your data, and your spend.

View author profile