Since 1999, when NVIDIA invented the GPU (graphics processing unit), the demand has “skyrocketed”. At CES 2026, CEO Jensen Huang announced their latest GPU, named after Vera Rubin. This follows on from the announcement of their Blackwell lineup only two years ago.
Through this blog, we’ll explore what the industry knows about the Vera Rubin so far. Plus, we will take a look at some specs in comparison to the NVIDIA B200 from the Blackwell lineup.
What is the NVIDIA Vera Rubin?
Vera Rubin is NVIDIA's next-generation GPU architecture, the successor to the Blackwell family.
Throughout this blog, we'll refer to the architecture as Rubin. In practice, NVIDIA uses “Vera Rubin” to describe both the GPU architecture and a full data-center platform that includes CPUs, networking, and interconnects alongside the R100 GPU.
Within this architecture, the R100 has been announced as the first GPU product. When you see "Vera Rubin NVL72" or "DGX Rubin," those are systems that use R100 GPUs based on the Rubin architecture.
Why is the NVIDIA Rubin important?
Each release from NVIDIA sets out to improve a certain aspect of computation. At CES 2026, Jensen Huang highlighted that AI inference is no longer a simple one-shot request-response. With the rise of reasoning models and test-time scaling, inference has become a “thinking process”, whereby the model generates long chains of thought, tries different approaches, and iterates before producing a final answer. As Huang put it, "the longer it thinks, oftentimes it produces a better answer."
According to Huang, test-time scaling is causing the number of tokens generated per inference request to grow by roughly 5x every single year. At the same time, the race to the next frontier of AI means the cost of last-generation tokens drops by about 10x per year as newer, more efficient models and hardware replace them.
Making inference cheaper
Combining the demand for inference and long reasoning models, Rubin is designed to attack this problem from three angles:
| Angle | What changes | Why it matters |
|---|---|---|
| Doing more with less precision | Rubin’s 3rd-generation Transformer Engine introduces hardware-level support for NVFP4 (4-bit floating point), enabling inference at much lower numerical precision without meaningful quality loss. | Lower precision dramatically increases tokens per watt and per GPU, reducing inference cost while maintaining model accuracy. |
| Removing the memory bottleneck | Long-reasoning models generate massive token sequences that must be stored and repeatedly accessed as a KV cache. Rubin increases memory bandwidth and capacity to keep these models fed. | Higher bandwidth and larger memory pools prevent the KV cache from becoming the dominant limiter in long-context and chain-of-thought inference. |
| Splitting the workload | NVIDIA introduces CPX, a processor dedicated to prompt processing (prefill), while the R100 GPU focuses on token generation (decode). This is known as disaggregated inference. | Separating prefill and decode allows higher utilization and enables operators to serve more concurrent requests with fewer GPUs. |
An introduction to the NVIDIA Rubin (R100)
The R100 is the first GPU built on NVIDIA’s Rubin architecture and is designed specifically for large-scale AI inference and training workloads in data centers.
At a high level, the R100 represents a shift away from simply maximizing raw compute and toward optimizing the full inference pipeline, including memory access, interconnect bandwidth, and long-context reasoning efficiency. Key characteristics of the R100 include:
- Next-generation process node: R100 is manufactured on TSMC’s N3 process, enabling higher transistor density and improved performance-per-watt compared to Blackwell’s 4NP node.
- HBM4 memory subsystem: With 288 GB of HBM4 and up to 22 TB/s of memory bandwidth, the R100 is designed to keep long-context and reasoning-heavy models fed without stalling on memory access.
- Optimized for low-precision inference: R100 is tightly coupled with NVIDIA’s 3rd-generation Transformer Engine, providing native hardware support for NVFP4 to maximize throughput and efficiency during inference.
- High-bandwidth scale-out interconnect: Support for next-generation NVLink enables up to 3.6 TB/s of bidirectional bandwidth per GPU, allowing R100s to operate as part of tightly coupled, rack-scale systems.
- Designed for disaggregated systems: Rather than operating in isolation, the R100 is intended to work alongside specialized processors like CPX, with different parts of the inference pipeline mapped to the hardware best suited to each stage.
Unlike previous generations, the R100 is not positioned as a general-purpose accelerator for every workload. Instead, it is purpose-built for the realities of modern AI systems: long-running inference, large KV caches, and reasoning models that trade time and tokens for higher-quality outputs.
NVIDIA Rubin (R100) vs. NVIDIA Blackwell (B200) GPU
| Spec | R100 (Rubin) | B200 (Blackwell) |
|---|---|---|
| FP4 Inference | 50 PFLOPS | ~9 PFLOPS |
| FP4 Training | 35 PFLOPS | ~10 PFLOPS |
| Memory Type | HBM4 | HBM3e |
| Memory Capacity | 288 GB | 192 GB |
| Memory Bandwidth | 22 TB/s | 8 TB/s |
| NVLink Bandwidth (per GPU) | 3.6 TB/s | 1.8 TB/s |
| Transistors | 336 billion | 208 billion |
| Process Node | TSMC N3 | TSMC 4NP |
It is important to note, MLPerf results for the R100 are not yet available. MLPerf is an industry-standard benchmark suite maintained by ML Commons that provides standardized, reproducible performance measurements for ML training and inference across hardware platforms.
It is widely regarded as the closest thing to an apples-to-apples comparison in this space. Until R100 submissions appear, the numbers above are NVIDIA's own published specs rather than independently verified benchmarks.
Summary
NVIDIA's latest GPU lineup sets a promising premise in an age where demand for compute and inference is growing exponentially. Being able to do it cheaper and faster lowers the barrier for organizations building on AI, whether that means training the next frontier model or serving millions of inference requests at a fraction of the current cost.
With Rubin in production and MLPerf results still to come, the real test will be how these specs translate to real-world workloads. If you’re looking to learn more about previous generations of NVIDIA GPUs, here are some resources:
- A100 vs. L40s vs. H100 vs. H200 GH Superchips
- NVIDIA’s B200 vs. H100
- Inside Civo’s launch of NVIDIA Blackwell B200 cloud compute