GPUs are no longer a niche component. Gamers know them for immersive graphics, workstation users rely on them for balanced performance, and in the age of AI, GPUs have become one of the most in-demand resources in modern infrastructure.

They are also expensive. That reality creates two immediate constraints, for individuals and enterprises alike: GPU-backed instances should be provisioned deliberately, and once provisioned, they should be used efficiently.

This post focuses on the second point; how to make better use of GPU resources using time-slicing on Kubernetes.

An introduction to multitasking

Before locking in on GPU time-slicing, it is worth revisiting the reason why it exists. In the early days of computing, operating systems could only run one process at a time. Commercial successes like the IBM PC, which ran PC DOS 1.0, were described as single-user, single-tasking operating systems. To combat this, modern operating systems had to devise a way to run multiple processes in parallel.

Early attempts at this involved introducing the concept of a process scheduler, which would control and manage processes being scheduled on the CPU, how long they would run for, and which process would be loaded next. Some examples of early operating systems that had a process scheduler include CTSS (Compatible Time-Sharing System), which notably ran on MIT's modified IBM 709.

Time-sharing became the standard approach because it was quick, and from an end-user perspective, there was negligible delay. Fast forward to modern day, and Linux-based operating systems have access to schedulers such as the Completely Fair Scheduler (CFS) and, more recently, the Earliest Eligible Virtual Deadline First (EEVDF) scheduler, which takes a lot of these concepts.

This background matters because GPUs adopt many of the same scheduling principles, albeit for more specialised workloads.

What is time-slicing?

Time-slicing is a scheduling technique where multiple processes share a single resource by taking turns. Each process receives a fixed time interval (sometimes called a quantum); when that interval expires, the scheduler preempts the running process and switches to another. From the user's perspective, all processes appear to run simultaneously, even though only one executes at any given moment.

What is GPU time-slicing?

GPU time-slicing applies this principle to graphics processors. On NVIDIA GPUs from the Ampere architecture onward, the GPU driver can interleave execution contexts from multiple processes, allowing several workloads to share a single physical GPU. Each workload receives a slice of GPU time, with the driver handling context switches.

In Kubernetes land, this means you can schedule multiple pods onto a single GPU node, with each pod believing it has access to a dedicated GPU. The NVIDIA device plugin(we will get to that in a minute) and GPU operator handle the underlying multiplexing.

How does GPU sharing work on Kubernetes?

GPU’s are a little different in that it doesn’t “just work” out of the box; even regular PC users have to install drivers and ensure they are compatible with the rest of their computer parts. Kubernetes adds one more layer on top of this, as it does not exactly know what a GPU is by default, unlike first-class resources such as a node.

In researching this blog, one of the earliest community discussions around sharing GPUs between pods appears in Kubernetes issue #52757, opened in September 2017. The ask was around allowing multiple containers to share a single GPU device.

Technical constraints

While the general consensus was that this was needed, the discussion also highlighted some technical constraints:

  • No native concurrency: By default, CUDA kernels from different processes cannot run on a GPU simultaneously. They are time-sliced, not parallelized. The Pascal architecture introduced instruction-level preemption (an improvement over block-level preemption), but context switches remain non-trivial.
  • No resource partitioning: At the time, there was no mechanism to partition GPU resources (streaming multiprocessors, memory) or assign priorities when multiple processes shared a card.
  • MPS complexity: NVIDIA's Multi-Process Service existed as an alternative, but it introduced its own operational considerations.

GPU enablement through the Device Plugin interface was the focus for Kubernetes 1.8, with sharing explicitly deferred. As one contributor noted, "sharing GPUs is out of scope for the foreseeable future (at least until v1.11). Our current focus is to get GPUs per container working in production."

That "foreseeable future" stretched considerably as it was not until September 2022 that NVIDIA released the GPU Operator with time-slicing support, providing a standardized mechanism for GPU sharing on Kubernetes. The operator has since become the de facto approach for managing NVIDIA GPUs in Kubernetes clusters.

NVIDIA GPU Operator

The NVIDIA GPU Operator manages the full lifecycle of GPU resources in Kubernetes. Instead of manually installing drivers, runtimes, and plugins on each node, these components are deployed and managed as containerised workloads.

For GPU time-slicing, the operator is responsible for distributing configuration, advertising resources to the kubelet, and keeping the system in sync as settings change.

Configuration flow

Enabling time-slicing involves a chain of resources that the operator reconciles into a running system. Here’s a high-level overview of how it works:

NVIDIA GPU Operator Configuration Flow

Component Description
ConfigMap Defines the GPU time-slicing configuration, including how many replicas each physical GPU is divided into and which GPU models the configuration applies to. This represents the desired state only and does not enact changes by itself.
ClusterPolicy Acts as the central custom resource monitored by the GPU Operator. It references the ConfigMap and specifies which configuration profile to apply. Updating the ClusterPolicy to point to the time-slicing ConfigMap triggers reconciliation.
GPU Operator Controller Observes the ClusterPolicy and reconciles the cluster’s actual state with the desired state. When time-slicing settings change, it updates the relevant DaemonSets to propagate the new configuration across GPU nodes.
Device Plugin Pod Runs on each GPU node and contains three containers:
  • init (config-manager-init): An init container that copies the time-slicing configuration from the /available-configs volume (sourced from the ConfigMap) into the /config volume (an emptyDir shared with other containers). Runs once at pod startup.
  • main (nvidia-device-plugin): Registers GPU resources with the kubelet. It reads the configuration from /config and advertises GPU replicas accordingly.
  • sidecar (config-manager): Watches for configuration changes. When the ConfigMap is updated, it copies the new configuration into /config and sends a SIGHUP signal to the main container, triggering a hot reload without restarting the pod.
GPU Replication Based on the time-slicing configuration, the device plugin advertises multiple logical GPUs for each physical GPU. For example, configuring replicas: 10 on a node with one physical GPU results in ten allocatable nvidia.com/gpu resources reported to the kubelet. From the scheduler’s perspective, the node can host ten GPU-consuming pods, all sharing the same physical GPU via time-slicing.
Note: GPU time-slicing does not provide memory isolation. All workloads share the same GPU memory pool. If one workload exhausts available memory, others may fail.

Timeslicing in action

Once configured, the GPU rapidly switches between workloads at fixed time intervals. The animation below demonstrates this behavior. Adjust the slider to see how different timeslice durations affect switching patterns:

Time-slicing in action

Why this is important

There are a couple of reasons why GPU optimization techniques such as time-slicing exists, the biggest being costs. As this was being written, AWS raised prices on GPU instances, particularly p5e.48xlarge instance, which is about eight NVIDIA H200 accelerators, went from $34.61 to $39.80 per hour across most regions, while the p5en.48xlarge climbed from $36.18 to $41.61.

Benefits of GPU time-slicing

Beyond cost, GPU time-slicing offers several practical benefits:

Benefit Description
Improved resource utilization A GPU running a single inference workload at approximately 15% utilization leaves the remaining capacity unused. Time-slicing enables multiple workloads to share the same device, reducing idle capacity and improving overall utilization.
Lower barrier to GPU access In shared clusters, dedicating entire GPUs to individual teams or workloads creates artificial scarcity. Time-slicing allows more users and applications to access GPU resources without requiring additional hardware.
Bursty workload consolidation Workloads with short-lived utilization spikes followed by idle periods—such as batch inference or preprocessing pipelines—can be efficiently consolidated by granting shared GPU access through time-slicing.

NVIDIA B200 Blackwell from $2.69

We are pleased to offer early access to NVIDIA’s latest Blackwell architecture with the NVIDIA B200 GPU, now available from $2.69 per GPU/hour (preemptible) for a limited time.

Designed for AI at scale, the NVIDIA B200 delivers exceptional performance for training and inference workloads, enabling teams to push the boundaries of modern AI development.

👉 Be the first to access NVIDIA GPUs built for AI at scale

Summary

GPU time-slicing is a practical optimisation technique for workloads that do not require exclusive access to a GPU or that exhibit bursty usage patterns. In Kubernetes environments, it provides a structured way to increase utilisation and reduce cost without additional hardware.

Time-slicing is not the only GPU-sharing strategy, and this post does not attempt to compare alternatives. For readers interested in deeper comparisons, the following resources are recommended:

Accelerate your performance with Civo GPUs

Enterprise clouds weren’t made for the AI era. So we built one that is. Civo AI puts the power of the latest NVIDIA GPUs and multi-cloud control in your hands without cost, complexity or lock-in. Work at the speed of your ideas, without draining your budget – and keep your data close, compliant and completely under your control.

👉 Learn more