NVIDIA Rubin (R100) vs. NVIDIA Blackwell (B200) GPU

Since 1999, when NVIDIA invented the GPU (graphics processing unit), the demand has “skyrocketed”. At CES 2026, CEO Jensen Huang announced their latest GPU, named after Vera Rubin. This follows on from the announcement of their Blackwell lineup only two years ago.

Through this blog, we’ll explore what the industry knows about the Vera Rubin so far. Plus, we will take a look at some specs in comparison to the NVIDIA B200 from the Blackwell lineup.

What is the NVIDIA Vera Rubin?

Vera Rubin is NVIDIA's next-generation GPU architecture, the successor to the Blackwell family.

💡 The name, Vera Rubin, follows NVIDIA's tradition of naming GPU architectures after pioneering scientists. In this case, they have selected Vera Rubin, the American astronomer whose observations of galaxy rotation curves provided some of the first compelling evidence for dark matter. Her work showed that galaxies contain five to ten times more mass than what's visible, fundamentally reshaping our understanding of the universe.

Throughout this blog, we'll refer to the architecture as Rubin. In practice, NVIDIA uses “Vera Rubin” to describe both the GPU architecture and a full data-center platform that includes CPUs, networking, and interconnects alongside the R100 GPU.

Within this architecture, the R100 has been announced as the first GPU product. When you see "Vera Rubin NVL72" or "DGX Rubin," those are systems that use R100 GPUs based on the Rubin architecture.

Why is the NVIDIA Rubin important?

Each release from NVIDIA sets out to improve a certain aspect of computation. At CES 2026, Jensen Huang highlighted that AI inference is no longer a simple one-shot request-response. With the rise of reasoning models and test-time scaling, inference has become a “thinking process”, whereby the model generates long chains of thought, tries different approaches, and iterates before producing a final answer. As Huang put it, "the longer it thinks, oftentimes it produces a better answer."

According to Huang, test-time scaling is causing the number of tokens generated per inference request to grow by roughly 5x every single year. At the same time, the race to the next frontier of AI means the cost of last-generation tokens drops by about 10x per year as newer, more efficient models and hardware replace them.

Making inference cheaper

Combining the demand for inference and long reasoning models, Rubin is designed to attack this problem from three angles:

Angle	What changes	Why it matters
Doing more with less precision	Rubin’s 3rd-generation Transformer Engine introduces hardware-level support for NVFP4 (4-bit floating point), enabling inference at much lower numerical precision without meaningful quality loss.	Lower precision dramatically increases tokens per watt and per GPU, reducing inference cost while maintaining model accuracy.
Removing the memory bottleneck	Long-reasoning models generate massive token sequences that must be stored and repeatedly accessed as a KV cache. Rubin increases memory bandwidth and capacity to keep these models fed.	Higher bandwidth and larger memory pools prevent the KV cache from becoming the dominant limiter in long-context and chain-of-thought inference.
Splitting the workload	NVIDIA introduces CPX, a processor dedicated to prompt processing (prefill), while the R100 GPU focuses on token generation (decode). This is known as disaggregated inference.	Separating prefill and decode allows higher utilization and enables operators to serve more concurrent requests with fewer GPUs.

💡 NVIDIA introduced a new processor class, the CPX. This specialized processor is designed specifically for the prefill stage of LLM inference. Traditional GPUs handle both the ‘prefill’ and ‘decode’ phases, but research (Splitwise and DistServe) showed that separating these workloads onto specialized hardware called disaggregated inference yields up to 1.4x higher throughput at 20% lower cost.

An introduction to the NVIDIA Rubin (R100)

The R100 is the first GPU built on NVIDIA’s Rubin architecture and is designed specifically for large-scale AI inference and training workloads in data centers.

At a high level, the R100 represents a shift away from simply maximizing raw compute and toward optimizing the full inference pipeline, including memory access, interconnect bandwidth, and long-context reasoning efficiency. Key characteristics of the R100 include:

Next-generation process node: R100 is manufactured on TSMC’s N3 process, enabling higher transistor density and improved performance-per-watt compared to Blackwell’s 4NP node.
HBM4 memory subsystem: With 288 GB of HBM4 and up to 22 TB/s of memory bandwidth, the R100 is designed to keep long-context and reasoning-heavy models fed without stalling on memory access.
Optimized for low-precision inference: R100 is tightly coupled with NVIDIA’s 3rd-generation Transformer Engine, providing native hardware support for NVFP4 to maximize throughput and efficiency during inference.
High-bandwidth scale-out interconnect: Support for next-generation NVLink enables up to 3.6 TB/s of bidirectional bandwidth per GPU, allowing R100s to operate as part of tightly coupled, rack-scale systems.
Designed for disaggregated systems: Rather than operating in isolation, the R100 is intended to work alongside specialized processors like CPX, with different parts of the inference pipeline mapped to the hardware best suited to each stage.

Unlike previous generations, the R100 is not positioned as a general-purpose accelerator for every workload. Instead, it is purpose-built for the realities of modern AI systems: long-running inference, large KV caches, and reasoning models that trade time and tokens for higher-quality outputs.

NVIDIA Rubin (R100) vs. NVIDIA Blackwell (B200) GPU

Spec	R100 (Rubin)	B200 (Blackwell)
FP4 Inference	50 PFLOPS	~9 PFLOPS
FP4 Training	35 PFLOPS	~10 PFLOPS
Memory Type	HBM4	HBM3e
Memory Capacity	288 GB	192 GB
Memory Bandwidth	22 TB/s	8 TB/s
NVLink Bandwidth (per GPU)	3.6 TB/s	1.8 TB/s
Transistors	336 billion	208 billion
Process Node	TSMC N3	TSMC 4NP

It is important to note, MLPerf results for the R100 are not yet available. MLPerf is an industry-standard benchmark suite maintained by ML Commons that provides standardized, reproducible performance measurements for ML training and inference across hardware platforms.

It is widely regarded as the closest thing to an apples-to-apples comparison in this space. Until R100 submissions appear, the numbers above are NVIDIA's own published specs rather than independently verified benchmarks.

Summary

NVIDIA's latest GPU lineup sets a promising premise in an age where demand for compute and inference is growing exponentially. Being able to do it cheaper and faster lowers the barrier for organizations building on AI, whether that means training the next frontier model or serving millions of inference requests at a fraction of the current cost.

With Rubin in production and MLPerf results still to come, the real test will be how these specs translate to real-world workloads. If you’re looking to learn more about previous generations of NVIDIA GPUs, here are some resources:

Looking for something a bit more technical? I recently covered GPU time-slicing here >

NVIDIA Rubin (R100) vs. NVIDIA Blackwell (B200) GPU

What is the NVIDIA Vera Rubin?

Why is the NVIDIA Rubin important?

Making inference cheaper

An introduction to the NVIDIA Rubin (R100)

NVIDIA Rubin (R100) vs. NVIDIA Blackwell (B200) GPU

Summary

Jubril Oyetunji