Cerebras CS-3: Bring on the NVIDIA Blackwell Competition

Cerebras CS-3: Bring on the NVIDIA Blackwell Competition

The News: Cerebras Systems introduced CS-3 powered by the Wafer Scale Engine 3 (WSE-3) at Cerebras AI Day, designed to deliver twice the performance of the Cerebras WSE-2 at the same power draw and for the same price. Subsequently, at GTC24, NVIDIA unveiled its DGX B200 platform. How can the offerings be compared? Read the full blog on the Cerebras Systems website.

Cerebras CS-3: Bring on the NVIDA Blackwell Competition

Analyst Take: Cerebras Systems launched its WSE-3 AI chip purpose-built for training the AI ecosystem’s largest AI models. The 5nm-based, 4 trillion transistor WSE-3 powers the Cerebras CS-3 AI supercomputer, delivering 125 petaflops of peak AI performance through 900,000 AI optimized compute cores. Notably, this is the largest square that semiconductor foundry partner TSMC can produce today.

Key WSE-3 specs include:

  • 4 trillion transistors
  • 900,000 AI cores
  • 125 petaflops of peak AI performance
  • 44 GB on-chip SRAM
  • 5nm TSMC process
  • External memory: 1.5 TB, 12 TB, or 1.2 petabytes (PB)
  • Trains AI models up to 24 trillion parameters
  • Cluster size of up to 2048 CS-3 systems

With a huge memory system of up to 1.2 PB, the CS-3 is designed to train next generation frontier models tenfold larger than GPT-4 and Gemini. 24 trillion parameter models can be stored in a single logical memory space without partitioning or refactoring. This approach substantially streamlines training workflow and accelerates developer productivity as Cerebras distributes memory alongside cores with the objective of preserving compute and data as close together as possible. As such, training a one-trillion parameter model on the CS-3 can be as straightforward as training a one billion parameter model on GPUs.

Of key importance, the CS-3 is built for both enterprise and hyperscale needs. Compact four system configurations can fine tune 70 billion (70B) models in a day while at full scale using 2048 systems, Llama 70B can be trained from scratch in a single day—an unprecedented feat for generative AI (GenAI).
The latest Cerebras Software Framework provides native support for PyTorch 2.0 and the latest AI models and techniques such as multi-modal models, vision transformers, mixture of experts, and diffusion. From my view, Cerebras remains the only platform that provides native hardware acceleration for dynamic and unstructured sparsity, speeding up training by up to eightfold.

Cerebras Body Slams NVIDIA Across Multiple Key AI Chip Categories

Cerebras CS-3: Bring on the NVIDIA Blackwell Competition
Image Source: Cerebras

From my perspective, Cerebras has already demonstrated that its WSE-3 product offers sharp competitive advantages over NVIDIA’s H100 across key metrics such as core density, on-chip memory, memory bandwidth, fabric bandwidth, and chip size. For example, the WSE-3 900,000 core count delivers a 52x differential over the NVIDIA H100 16,896 FP32 + 528 Tensor core count plus WSE-3 on-chip memory of 44 GB provides an 880x edge over the NVIDIA 0.05 GB.

By keeping the TSMC manufactured WSE-3 wafer whole, Cerebras produces remarkable competitive differentiation in relation to NVIDIA, which uses the same practice of established GPU/CPU vendors of slicing the TSMC wager into a multitude of smaller units to make its H100 GPU chip. In addition to sharp core, memory-related, and fabric bandwidth differentiators, Cerebras WSE-3 provides a compelling alternative in avoiding the complexities of using NVIDIA GPU clusters.

Currently, GPU clusters can require tens of thousands of GPUs or AI accelerators for high-end AI computing processes such as training GenAI workloads. As a result, NVIDIA GPU clusters using InfiniBand, Ethernet, NVLink, and PCIe switches demand massive power consumption and related costs to re-link chips. In contrast, Cerebras reduces the number of chips needed by a factor of 50+, which can decrease interconnect and networking costs.

The Cerebras CS-3 interconnect system is built using a completely different technological approach. Instead of using external interconnects between cores, Cerebras uses on-wafer wiring to connect hundreds of thousands of cores together, which can provide the highest performance fabric at the lowest power. As such, the CS-3 on-wafer fabric provides 27 PB per second of aggregate bandwidth across 900,000 cores. In contrast, this is more bandwidth than 1,800 DGX B200 servers. Even when compared with the full rack, 72 GPU NVL72, a single CS-3 provides more than 200x the interconnect bandwidth. By keeping the bulk of compute and IO on-wafer, Cerebras obviates the need for convoluted interconnects, power-intensive wiring, and complex programming models.

Wait There’s More: Cerebras CS-3 Delivers Sharp Competitive Edge over New NVIDIA DGX B200

At GTC24, NVIDIA announced its Blackwell platform touting the ability to enable organizations to build and run real-time GenAI on trillion-parameter large language models (LLMs) at up to 25x less cost and energy consumption than its predecessor (H100). However, how does the offering stack up against Cerebras CS-3 hardware across key large model training considerations such as training compute performance, memory capacity, interconnect bandwidth, and power efficiency considerations?

Cerebras CS-3 and B200 Hardware Comparison

Cerebras CS-3: Bring on the NVIDIA Blackwell Competition
Image Source: Cerebras

Let’s take a closer look at the training performance, memory, and power metric comparisons. The B200 is composed of two GPU dies coupled together through NVLink with a total of 208B transistors. The B200 provides 4.4 petaflops of FP16 AI compute and comes with 192 GB of memory. The B200 is available in two server formats. The DGX B200 is a 10U server with 8x B200 GPUs. It offers 36 petaflops of AI compute, 1.5TB of memory, and consumes 14.3 kW. The DGX NVL72 is a full rack solution with 72 B200 GPUs connected via NVLink. It provides 360 petaflops of AI compute and consumes 120kW of power. B200 products are slated to ship in Q4 2024. Notably, the Cerebras CS-3 solution is available today.

For training performance, with 900,000 dedicated AI cores, the Cerebras CS-3 attains 125 petaflops of AI compute using industry standard FP16 precision. A single NVIDIA B200 GPU outputs 4.4 petaflops of AI compute while a DGX B200 with 8 GPUs totals 36 petaflops. In raw performance, a single CS-3 equates to about 3.5 DGX B200 servers. From my view, Cerebras reinforces its sharp competitive advantage by delivering such performance differentiators in a more compact footprint that requires only half the power consumption and a significantly streamlined programming model.

For memory, I find that Cerebras registers a key differentiator by building its hardware on a disaggregated memory architecture. In contrast to the architectural approach of using small amounts of high-bandwidth memory (HBM) close to the GPU, Cerebras developed its dedicated, external memory device branded MemoryX to store weights. MemoryX uses flash and DRAM alongside a custom software stack to pipeline load/store requests with minimal latency. Designed for GPT-5 and beyond, the company’s 1.2 PB Hyperscale SKU trains models with 24 trillion parameters. As a result, it has 6,000x more memory capacity than a B200 GPU, over 700x more memory than a DGX B200, and over 80x the memory capacity of a full rack NVL72.

The memory bandwidth implications are profound. For instance, a 100B parameter model requires over 2 TB of memory. Using GPU infrastructure that would require 12 B200s. The same model can be stored and trained on a single CS-3 with 2.4 TB MemoryX. A 1 trillion parameter model requires over a hundred B100s; only a single CS-3 1.2 PB rack is needed to train the same model. A 10 trillion parameter model requires over 200TB of memory or over a thousand B200s, a costly proposition for most organizations. Using Cerebras hardware, a single CS-3 with 1.2 PB of MemoryX attached can load the model on the equivalent basis of basically loading a 1B parameter model on a GPU.

Across power efficiency considerations, the CS-3 consumes 23kW peak while the DGX B200 consumes 14.3 kW. At first glance this would appear to be one major category that NVIDIA has an edge. However, the CS-3 is significantly faster, providing 125 petaflops in relation to 36 petaflops of the DGX B200. This translates to a 2.2x improvement in performance per watt, resulting in more than halved power expenses over the system’s operational lifespan. As a result, I see that model trainers and IT buyers stand to gain a net positive total cost of ownership benefit.

The Cerebras Touch: Cerebras Custom AI Model Services

In a briefing conversation, Andy Hock, SVP, Product & Strategy, spotlighted how Cerebras Custom AI Model Services are integral to the overall differentiation of the company’s proposition. Through its services, Cerebras pairs expert AI strategy with swift model development to deliver the GenAI solutions best suited for its customers including the vetting of AI models and data sets to determine which can deliver the best outcomes.

Machine language researchers ride sidesaddle with domain experts in ensuring that the customer’s proprietary data aligns optimally with the AI models and data sets selected. This includes the delivery of custom chatbots, code completion, summarization, and classification, plus visual question answering coming soon.

Key Takeaways: Cerebras WSE-3 Laps NVIDIA DGX B200

I believe that Cerebras has demonstrated why the CS-3, powered by the WSE-3, can be viewed as the fastest AI chip across the entire AI ecosystem including the NVIDIA Blackwell platform despite NVIDIA’s positioning the offering as the new class of AI superchip. Now the CS-3 represents a captivating choice for ML engineers and organizations that prioritize the optimal training of large AI models that deliver improved business outcomes with vigilance and swiftness.

Disclosure: The Futurum Group is a research and advisory firm that engages or has engaged in research, analysis, and advisory services with many technology companies, including those mentioned in this article. The author does not hold any equity positions with any company mentioned in this article.

Analysis and opinions expressed herein are specific to the analyst individually and data and other information that might have been provided for validation, not those of The Futurum Group as a whole.

Other Insights from The Futurum Group:

Intel Vision 2024: Intel Unleashes Gaudi 3 Led Enterprise AI Strategy

Oracle and NVIDIA Boost Sovereign AI Globally

AMD Q4 2023 Earnings Highlight a Strong Finish

Image Credit: Cerebras

Author Information

Ron is an experienced, customer-focused research expert and analyst, with over 20 years of experience in the digital and IT transformation markets, working with businesses to drive consistent revenue and sales growth.

He is a recognized authority at tracking the evolution of and identifying the key disruptive trends within the service enablement ecosystem, including a wide range of topics across software and services, infrastructure, 5G communications, Internet of Things (IoT), Artificial Intelligence (AI), analytics, security, cloud computing, revenue management, and regulatory issues.

Prior to his work with The Futurum Group, Ron worked with GlobalData Technology creating syndicated and custom research across a wide variety of technical fields. His work with Current Analysis focused on the broadband and service provider infrastructure markets.

Ron holds a Master of Arts in Public Policy from University of Nevada — Las Vegas and a Bachelor of Arts in political science/government from William and Mary.


Latest Insights:

Christina Day, Director of DRAM Product Marketing at Samsung Semiconductor, joins hosts Dave Nicholson and Lisa Martin to share her insights on how advanced memory technology is critical for accelerating and enhancing AI capabilities, highlighting the potential of Processing-In-Memory (PIM).
Skymel Introduces Groundbreaking AI Inferencing Technology to Optimize Costs and Enhance Application Performance
Paul Nashawaty, Practice Lead, and Sam Holschuh, Analyst, at The Futurum Group share their insight on how Skymel's NeuroSplit is set to reshape AI application development by optimizing GPU economics and enhancing end-user experience.
Net Loss Has Jumped over the Previous Quarter, But the Quantum Technology Is Improving
The Futurum Group’s Dr. Bob Sutor looks at quantum computing company Rigetti’s earnings in the first quarter of 2024. The net loss is troubling, but the company is making good technical progress on its small 9-qubit Novera quantum processing unit.
Transition in Leadership at AWS Signals Shift in Cloud Industry Dynamics
Paul Nashawaty, Practice Lead, and Sam Holschuh, Analyst, at The Futurum Group share their insight on the impact of the AWS CEO transition on the cloud computing landscape and developer community.