Cerebras Achieves 100x Defect Tolerance in Wafer-Scale Processors

The Yield Challenge in Semiconductor Manufacturing

Conventional wisdom in semiconductor manufacturing has long held that larger chips result in poorer yields. However, Cerebras has successfully built and commercialized a chip 50 times larger than the largest computer chips, achieving yields comparable to those of significantly smaller chips. This breakthrough raises the question: How is usable yield achieved with a wafer-scale processor?

Rethinking Fault Tolerance and Chip Size

The key to Cerebras’ success lies in reimagining the relationship between chip size and fault tolerance. By comparing the manufacturing yields of the Cerebras Wafer Scale Engine and a traditional H100-sized chip—both fabricated at a 5nm process node—Cerebras demonstrates the critical role of defect rates, core size, and fault tolerance in achieving wafer-scale integration with equal or superior yields compared to reticle-limited GPUs.

Understanding Yield and Defect Tolerance

Historically, larger chips were more susceptible to defects, causing exponential declines in yields with increasing die area. However, as transistor budgets expanded, chip designers began to incorporate core-level fault tolerance, allowing processors with multiple cores to function even if some cores were defective. This design paradigm shift, embraced by companies like Intel, Nvidia, and AMD, has become standard practice in the industry.

The Role of Defect Tolerance in Modern CPUs and GPUs

In today’s high-performance processors, fault tolerance is crucial. It is common for CPUs and GPUs to ship with some cores disabled due to defects. For instance, Nvidia’s H100 GPU, which measures 814mm², includes 144 small cores (SMs) but commercial products feature only 132 active SMs, enabling the chip to withstand multiple defects without losing functionality.

Advancing Wafer-Scale Technology at Cerebras

Designing Fault-Tolerant, Small Cores

To build a wafer-scale chip, Cerebras designed exceptionally small fault-tolerant cores. Each AI core in the Wafer Scale Engine 3 is merely 0.05mm², or about 1% the size of an H100 SM core. This innovative approach enhances fault tolerance by minimizing the silicon area affected by each defect.

Dynamic Routing Architecture

Beyond small cores, Cerebras developed an advanced routing architecture capable of dynamically reconfiguring connections between cores to bypass defects. This routing system, coupled with a reserve of spare cores, allows the WSE to maintain high yields with minimal redundancy.

Comparative Analysis: Traditional GPU vs. Wafer-Scale Chip

Yield Calculations Using TSMC 5nm Process

Analyzing yield dynamics, a traditional GPU like the Nvidia H100 features 72 full die chips per wafer, each occupying an area prone to 59 defects, with significant silicon area lost. In contrast, the Cerebras Wafer Scale Engine 3, covering 46,225mm², encounters fewer defects, resulting in significantly less silicon area loss—demonstrating a 164x improvement in fault tolerance compared to traditional GPUs.

Maximizing Silicon Utilization

By employing a combination of small, fault-tolerant cores and a dynamic routing architecture, Cerebras achieves 93% silicon utilization in its third-generation WSE engine—outperforming leading AI accelerators and showcasing the commercial viability of wafer-scale computing.

Cerebras’ innovative design strategies not only resolve the challenges of wafer-scale manufacturing but also establish a new benchmark for commercial scalability and efficiency in the semiconductor industry.