An introduction to GPU time-slicing
Written by
Technical Writer @ Civo
Written by
Technical Writer @ Civo
GPUs are no longer a niche component. Gamers know them for immersive graphics, workstation users rely on them for balanced performance, and in the age of AI, GPUs have become one of the most in-demand resources in modern infrastructure.
They are also expensive. That reality creates two immediate constraints, for individuals and enterprises alike: GPU-backed instances should be provisioned deliberately, and once provisioned, they should be used efficiently.
This post focuses on the second point; how to make better use of GPU resources using time-slicing on Kubernetes.
An introduction to multitasking
Before locking in on GPU time-slicing, it is worth revisiting the reason why it exists. In the early days of computing, operating systems could only run one process at a time. Commercial successes like the IBM PC, which ran PC DOS 1.0, were described as single-user, single-tasking operating systems. To combat this, modern operating systems had to devise a way to run multiple processes in parallel.
Early attempts at this involved introducing the concept of a process scheduler, which would control and manage processes being scheduled on the CPU, how long they would run for, and which process would be loaded next. Some examples of early operating systems that had a process scheduler include CTSS (Compatible Time-Sharing System), which notably ran on MIT's modified IBM 709.
Time-sharing became the standard approach because it was quick, and from an end-user perspective, there was negligible delay. Fast forward to modern day, and Linux-based operating systems have access to schedulers such as the Completely Fair Scheduler (CFS) and, more recently, the Earliest Eligible Virtual Deadline First (EEVDF) scheduler, which takes a lot of these concepts.
This background matters because GPUs adopt many of the same scheduling principles, albeit for more specialised workloads.
What is time-slicing?
Time-slicing is a scheduling technique where multiple processes share a single resource by taking turns. Each process receives a fixed time interval (sometimes called a quantum); when that interval expires, the scheduler preempts the running process and switches to another. From the user's perspective, all processes appear to run simultaneously, even though only one executes at any given moment.
What is GPU time-slicing?
GPU time-slicing applies this principle to graphics processors. On NVIDIA GPUs from the Ampere architecture onward, the GPU driver can interleave execution contexts from multiple processes, allowing several workloads to share a single physical GPU. Each workload receives a slice of GPU time, with the driver handling context switches.
In Kubernetes land, this means you can schedule multiple pods onto a single GPU node, with each pod believing it has access to a dedicated GPU. The NVIDIA device plugin (we will get to that in a minute) and GPU operator handle the underlying multiplexing.
How does GPU sharing work on Kubernetes?
GPU’s are a little different in that it doesn’t “just work” out of the box; even regular PC users have to install drivers and ensure they are compatible with the rest of their computer parts. Kubernetes adds one more layer on top of this, as it does not exactly know what a GPU is by default, unlike first-class resources such as a node.
In researching this blog, one of the earliest community discussions around sharing GPUs between pods appears in Kubernetes issue #52757, opened in September 2017. The ask was around allowing multiple containers to share a single GPU device.
Technical constraints
While the general consensus was that this was needed, the discussion also highlighted some technical constraints:
- No native concurrency: By default, CUDA kernels from different processes cannot run on a GPU simultaneously. They are time-sliced, not parallelized. The Pascal architecture introduced instruction-level preemption (an improvement over block-level preemption), but context switches remain non-trivial.
- No resource partitioning: At the time, there was no mechanism to partition GPU resources (streaming multiprocessors, memory) or assign priorities when multiple processes shared a card.
- MPS complexity: NVIDIA's Multi-Process Service existed as an alternative, but it introduced its own operational considerations.
GPU enablement through the Device Plugin interface was the focus for Kubernetes 1.8, with sharing explicitly deferred. As one contributor noted, "sharing GPUs is out of scope for the foreseeable future (at least until v1.11). Our current focus is to get GPUs per container working in production."
That "foreseeable future" stretched considerably as it was not until September 2022 that NVIDIA released the GPU Operator with time-slicing support, providing a standardized mechanism for GPU sharing on Kubernetes. The operator has since become the de facto approach for managing NVIDIA GPUs in Kubernetes clusters.
NVIDIA GPU Operator
The NVIDIA GPU Operator manages the full lifecycle of GPU resources in Kubernetes. Instead of manually installing drivers, runtimes, and plugins on each node, these components are deployed and managed as containerised workloads.
For GPU time-slicing, the operator is responsible for distributing configuration, advertising resources to the kubelet, and keeping the system in sync as settings change.
Configuration flow
Enabling time-slicing involves a chain of resources that the operator reconciles into a running system. Here’s a high-level overview of how it works:

Note: GPU time-slicing does not provide memory isolation. All workloads share the same GPU memory pool. If one workload exhausts available memory, others may fail.
Timeslicing in action
Once configured, the GPU rapidly switches between workloads at fixed time intervals. The animation below demonstrates this behavior. Adjust the slider to see how different timeslice durations affect switching patterns:

Why this is important
There are a couple of reasons why GPU optimization techniques such as time-slicing exists, the biggest being costs. As this was being written, AWS raised prices on GPU instances, particularly p5e.48xlarge instance, which is about eight NVIDIA H200 accelerators, went from $34.61 to $39.80 per hour across most regions, while the p5en.48xlarge climbed from $36.18 to $41.61.
Benefits of GPU time-slicing
Beyond cost, GPU time-slicing offers several practical benefits:
NVIDIA B200 Blackwell from $2.69
We are pleased to offer early access to NVIDIA’s latest Blackwell architecture with the NVIDIA B200 GPU, now available from $2.69 per GPU/hour (preemptible) for a limited time.
Designed for AI at scale, the NVIDIA B200 delivers exceptional performance for training and inference workloads, enabling teams to push the boundaries of modern AI development.
👉 Be the first to access NVIDIA GPUs built for AI at scale
Summary
GPU time-slicing is a practical optimisation technique for workloads that do not require exclusive access to a GPU or that exhibit bursty usage patterns. In Kubernetes environments, it provides a structured way to increase utilisation and reduce cost without additional hardware.
Time-slicing is not the only GPU-sharing strategy, and this post does not attempt to compare alternatives. For readers interested in deeper comparisons, the following resources are recommended:
- Which GPU Sharing Strategy Is Right for You? A Comprehensive Benchmark Study Us
- GPU Sharing at CERN: Cutting the Cake Without Losing a Slice
- How companies are using Civo GPUs to accelerate AI innovation without runaway costs
- Improving GPU Utilization in Kubernetes
Accelerate your performance with Civo GPUs
Enterprise clouds weren’t made for the AI era. So we built one that is. Civo AI puts the power of the latest NVIDIA GPUs and multi-cloud control in your hands without cost, complexity or lock-in. Work at the speed of your ideas, without draining your budget – and keep your data close, compliant and completely under your control.

Technical Writer @ Civo
Jubril Oyetunji is a DevOps engineer and technical writer with a strong focus on cloud-native technologies and open-source tools. His work centers on creating practical tutorials that help developers better understand platforms such as Kubernetes, NGINX, Rust, and Go.
As a contract technical writer, Jubril authored an extensive library of technical guides covering cloud-native infrastructure and modern development workflows. Many of his tutorials achieved strong search rankings, helping developers around the world learn and adopt emerging technologies.
Share this article
Related Articles
24 July 2025
Understanding GPUs for AI success: Insights from our panel discussion
Emma Oram
Digital Marketing Executive @ Civo
4 August 2025
NVIDIA Blackwell B200 GPUs are now available on Civo
Josh Mesout
Chief Innovation Officer @ Civo
18 August 2025
AI startup on a budget? How to master GPU computing without overspending
Ben Norris
AI Engineer @ Civo