
A Look at Cerebras Wafer-Scale Engine: Half Square Foot Silicon Chip - rbanffy
https://fuse.wikichip.org/news/3010/a-look-at-cerebras-wafer-scale-engine-half-square-foot-silicon-chip/
======
ZhuanXia
Interesting things I’ve learned about this chip after a little sluthing.

CEO implies in the FYI podcast that it can handle model sizes of a max of ~4
billion parameters per wafer. Respectable but not as large as I assumed would
be possible when I first read of the scale of the chip.

CEO claims model parallelism will actually work well with these devices. Would
be intriguing to know the limits of this.

Based on the cooling requirements, at least a GHz clock speed appears likely.
If we take the heretical position that 1 parameter is >= 1 synapse, 4 billion
parameters is about the size of a bee brain. 1GHz is about 5 million times
faster than a bee brain.

It would take about 20000 such chips to simulate a human-sized network. Not
economical and presumably the model parallelism would break long before this.
But it is interesting to note that under the not implausible 1 parameter is >=
1 synapse assumption, we are only a few orders of magnitude away from human-
sized networks training 5 million times real-time.

~~~
K0balt
OK, a stupid question. Since this ANN if 5Bx as fast as animal neurons,
couldn't it be "multiplexed" to simulate a much larger network in biological
time (with massive state-storage memory, I presume)? I realize there will be
propagation dependencies, but each dependency layer could be precomped before
the next, I would guess?

Or is there a reason that this won't work (or at least won't be worth it) for
ANN structures?

~~~
0-_-0
> Or is there a reason that this won't work (or at least won't be worth it)
> for ANN structures?

If all the weights and activations fit in on-chip memory, then you can do
calculations at close to 100% efficiency. If you want to simulate a 20k times
larger network, you also need to transfer the 4 Billion parameters per
iteration, which would take a singificantly longer time. In other words, you
would be seriously bottlenecked.

------
fernly
Unfortunately that archived link doesn't extend to "page 2".

Some more info easily found at cerebras's site,

[https://www.cerebras.net/cerebras-wafer-scale-engine-why-
we-...](https://www.cerebras.net/cerebras-wafer-scale-engine-why-we-need-big-
chips-for-deep-learning/)

> a very big chip, with memory close to cores, all connected by a high-
> bandwidth, low-latency fabric.

and at this site:

[https://www.servethehome.com/cerebras-wafer-scale-engine-
ai-...](https://www.servethehome.com/cerebras-wafer-scale-engine-ai-chip-is-
largest-ever/)

~~~
2sk21
Looks exactly like a mesh of Inmos Transputers except on a single chip! Back
in the late 80s, I spent months trying to get backpropagation to work on a
Meiko computing surface - but no real speedups were possible.

------
jsjohnst
Site down, but here’s another link:

[https://web.archive.org/web/20191117035933/https://fuse.wiki...](https://web.archive.org/web/20191117035933/https://fuse.wikichip.org/news/3010/a-look-
at-cerebras-wafer-scale-engine-half-square-foot-silicon-chip/)

~~~
anonytrary
Seems a bit odd that it's down. This post has only a few upvotes right now, so
the hug of death must have been pretty small.

~~~
Tade0
I wonder how small a small HoD is really? 10k concurrent requests?

~~~
jsjohnst
You are way high for almost anything but a company website that has an
engineering team dedicated to it. Yes, there are exceptions (my blog could
handle orders of magnitude higher traffic than that, but I had a multi-tier
CDN in front of it with 100% full page caching. Did it need it, no, but it was
fun to setup).

Typically speaking, most personal / non-professional sites going down from a
HoD on HN probably got under 500 concurrent requests.

------
orbifold
There is academic effort that has been working on a similar concept for 8+
years. A recent paper that also discusses some of the challenges (routing
around defects, ...) is
[https://www.frontiersin.org/articles/10.3389/fnins.2019.0120...](https://www.frontiersin.org/articles/10.3389/fnins.2019.01201/full)
. If you dig into the publications and PhD theses conducted on this system,
you will also find partial answers to some of the issues raised here in the
comments (how to interconnect reticles, power supply (the first prototype had
really scary centimeter thick copper bars that supplied the wafer). There is a
second generation system in development..

------
peterhj
> Due to the complexity involved, Cerebras is not only designing the chip, but
> they also must design the full system. ... The WSE will come in 15U units
> with a chassis for the WSE and another one for the power and miscellaneous
> components. The final product is intended to act like any other network-
> attached accelerator over a 100 GbE.

So it's NIC bandwidth bottlenecked. Though 100 GbE is in the same ballpark as
PCIe 3, but last I heard 40-100 GbE were pretty CPU-intensive compared to
alternatives.

------
kingludite
The performance jump blows the mind. If AI is going to evolve like that our
expectations will be rather silly compared to reality. Its kinda like
[collectively] we haven't a clue what we are doing.

------
evancox100
How are they connecting the die that span reticles? A WL-CSP flow? Or just the
regular process but the exposure for some of the layers is offset, so some
fields over lap?

------
gwern
Still no benchmarks, so this doesn't add all _that_ much to the earlier
coverage, unfortunately.

------
osamagirl69
>They re-purposed the scribe lines – the mechanical barrier between two
adjacent dies that are typically used for test structures and eventually for
strangulation...

Glad I don't work at that fab! Would much rather my dies be singulated than
strangulated

------
amelius
Can't deep learning computations be structured such that a large number of
smaller interconnected dies (but with the same total area) give the same
performance as this wafer?

------
ianai
Is the reason this hasn’t been considered for CPUs before now intel and amd
not wanting to produce safer scale chips?

~~~
IshKebab
This is a 5 kW chip. It's like trying to cool 2 kettles running continuously.
Very difficult!

To even get power into the chip they have to have wires running perpendicular
to the board - ordinary traces don't cut it.

This is a really far-out design that isn't appropriate for the mass market at
all.

~~~
jcims
Two kettles in the UK, closer to three in the US where circuits typically top
out at 1800w. I actually looked into the feasibility of installing some 240v
circuits in my kitchen and ordering kettles/blenders/etc from Europe...too
much work lol.

------
JohnJamesRambo
Can someone provide info on how you cool this beast?

~~~
Merrill
A "cold plate" of chilled water cooled metal in direct contact with the wafer.

~~~
rbanffy
If I were to build a box out of these, I'd do immersion cooling and place the
chip(s) across the front of the machine behind laminated glass.

Keeping the fluid cool would be left to the ugly parts nobody would ever see.

~~~
Merrill
Immersion cooling might not work since it depends on convection flow to move
the heated liquid away from the plate. With a cold plate, pumps and pressure
can be used to move the chilled water so that water in contact with the plate
is always cool and heat is extracted and moved to the chiller more rapidly.

~~~
rbanffy
You could circulate the fluid. Since it'd be boiling, temperature would remain
constant near the surfaces and bubbles would immediately displace hot fluid. A
condenser and heat exchanger would recycle the vapor back into the tank.

