
Cerebras Wafer Scale Engine Gen2 7nm 2.6T Transistors - ramshanker
https://www.servethehome.com/cerebras-wafer-scale-engine-gen2-7nm-2-6t-transistors/
======
segfaultbuserr
How do you connect such an enormous chip to a circuit board? What packaging
and wire bonding technologies are available? I don't think it uses a
conventional package like a (Flip-Chip) BGA - the process of soldering a
65535-pin BGA, and instantly losing millions of dollars if there are defective
joints is simply unimaginable. The huge chip also have a heat dissipation
problem that conventional packaging is unable to solve.

~~~
aveni
A lot of system-related questions were answered at HotChips last year:
[https://www.hotchips.org/hc31/HC31_1.13_Cerebras.SeanLie.v02...](https://www.hotchips.org/hc31/HC31_1.13_Cerebras.SeanLie.v02.pdf)

And SC19:
[https://secureservercdn.net/198.12.145.239/a7b.fcb.myftpuplo...](https://secureservercdn.net/198.12.145.239/a7b.fcb.myftpupload.com/wp-
content/uploads/2019/11/Cerebras_for_SC19.pdf)

~~~
npunt
It's mind-boggling to consider what it takes to power and cool a 15 kilowatt
'chip'. That is a huge amount of electricity.

~~~
bleepblorp
For comparison's sake, 15kw is enough power to heat a house in a moderate
winter climate. Getting a furnance worth of power into and out of a chip
without melting it is, indeed, very impressive.

~~~
segfaultbuserr
15 kW is 7x more than anything that can be (legally) powered from a standard
European 230 V, 10 A outlet. With a 2.3 kW wall outlet, you can already power
a Bitcoin miner unit or a small IBM mainframe computer. It's indeed
impressive.

Which brings us to the next topic: What does the power supply look like, and
how does power delivery work? Given the huge amount of power, the PSU for the
chip alone is an interesting piece of electronics. Assuming its power supply
is similar to a PC or server... First, the mains power is converted to a
system DC voltage such as 12 V. For a typical modern cheap and boring
switched-mode power supply, the standard efficiency requirement is 80%. But in
this case, it means the PSU itself will waste 3 kW of energy, this is
unacceptable both from an environmental and thermal perspective. A really good
supply can achieve an efficiency of 95%-98%, and this is when things get
interesting, at least you'll see some expensive controller chips and high-
power, high-frequency transistors in such a PSU, terms like SiC, GaN or IGBT
come to mind, switching 1,250 amps on and off. And it doesn't end here, the
next step is converting the 12 V system voltage to the chip's core voltage,
say 1.0 V, by a Voltage Regulator Module near or on the motherboard. Now,
don't even mention the technical challenge of designing a VRM. I can't even
imagine how they run the huge current on the motherboard, it's 15,000 amps!

Unfortunately, there's no information about power supply and power delivery.

~~~
avianes
There are probably several power domains in the chip and therefore several
separate power supplies, which avoids the need to manage 15kW in a single
spot.

Edit: In the slides shared above [1] (page 16), it says "12x 4kW hot-swappable
universal PSUs".

[1] :
[https://secureservercdn.net/198.12.145.239/a7b.fcb.myftpuplo...](https://secureservercdn.net/198.12.145.239/a7b.fcb.myftpupload.com/wp-
content/uploads/2019/11/Cerebras_for_SC19.pdf#page=16)

------
Rafuino
Did anyone put their first gen to use? Saw that huge chip last year (or maybe
the year before at this point... time doesn't exist anymore...), but I never
heard of actual users or buyers.

~~~
Cthulhu_
I'm sure companies that do a lot with AI - the FAANGs, universities - would
have their hands on these to see if they're viable for their businesses,
either for their own use or to rent out to customers through their cloud
offerings. But they don't really advertise with it, probably corporate
secrecy.

I think it has the same customer base as the quantum processors of today.

~~~
petra
Is there some recent cloud offering that might be using cerebras systems ?

------
etaioinshrdlu
The most interesting part of this to me is the staggering amount of SRAM on
the processor.

I wouldnt have thought this amount of sram to be practical from a size and
cost standpoint compared to DRAM.

~~~
modeless
Yeah, but it's not enough. The whole chip doesn't have enough SRAM to hold a
tenth of GPT-3, let alone GPT-4. These are the kinds of models you would hope
a chip like this would help with, but as soon as the model gets too big to fit
in the SRAM then the advantage goes way down. And I think the models of the
future will continue to get much, much bigger.

~~~
petra
There are a few companies doing analog compute-in-memory for ML. Usually on
flash processes.

Maybe cerebras could fit well for that.

As for the models of the future getting much bigger - is the GPT family
representative ? Are we seeing the same growth is in other domains ? And isn't
there a big niche for smaller than GPT models ?

------
ramshanker
I can't help imagining INTEL / AMD / NVIDIA doing the same with their own
chips.

Top500 Supercomputer in a single Rack. Just imagine.

And if we take liberty to dream extreme, "Available on Amazon, On Demand". :D

~~~
hinkley
Someone is going to figure out that if they cut the wafers on a diagonal they
can make even bigger wafers.

And someone will be spending a lot of time trying to figure out how to support
a wafer cut lengthwise without shattering it during processing.

~~~
Tuna-Fish
That wouldn't work. Modern chips need to be built on epitaxial surfaces, and
the only plane they can reliably produce an epitaxial surface on is the
horizontal one.

------
kregasaurusrex
Watched the talk at Hot Chips today[0]- 400,000(!) cores on a single wafer is
absolutely remarkable! There was a slide not posted in which there were
virtual segments of the wafer designated for each step, and the example given
by the presenter showed a perfectly in-order workflow execution.

[0] [https://www.anandtech.com/show/14758/hot-chips-31-live-
blogs...](https://www.anandtech.com/show/14758/hot-chips-31-live-blogs-
cerebras-wafer-scale-deep-learning)

~~~
lsb
Their 2020 announcement is 850K cores, which is even more impressive!

------
jszymborski
So, just wondering from a practical perspective, how do you train/serve models
from these? I doubt they play nice with e.g. PyTorch or TF.

Maybe they're meant to be used with a DL compiler like TVM [0]?

[0] [https://tvm.apache.org/](https://tvm.apache.org/)

~~~
aveni
You can use standard ML frameworks :) The code then goes through the Cerebras
Graph Compiler to produce an executable that runs on the wafer.

More of the software stack was described at HotChips today, covered by
AnandTech: [https://www.anandtech.com/show/16006/hot-chips-2020-live-
blo...](https://www.anandtech.com/show/16006/hot-chips-2020-live-blog-
cerebras-wse-programming-300pm-pt)

------
evancox100
Still wondering how they’ve handled interconnecting beyond each reticle
limit/exposure. Are they staggering the exposures for each layer? And how do
they handle packaging/IO/power delivery?

I guess these are the secret sauces and we are unlikely to get in depth
information.

~~~
gary_0
In the PDF[0] that aveni linked above, they mention they "add wires across
scribe line in partnership with TSMC, Extend 2D mesh across die". So the wafer
isn't exactly one big chip.

[0]
[https://www.hotchips.org/hc31/HC31_1.13_Cerebras.SeanLie.v02...](https://www.hotchips.org/hc31/HC31_1.13_Cerebras.SeanLie.v02.pdf)

------
genr8
Ironic this article was posted the same day as : Why Don't They Make BIGGER
CPUs? - (Techquickie)
[https://www.youtube.com/watch?v=8JAWz9Da5og](https://www.youtube.com/watch?v=8JAWz9Da5og)

Now I know its just a LinusTechTips level video, but I guess they havent heard
of "Wafer Scale Engine", and I hadn't either, but now this proves the video is
already obsolete…

~~~
segfaultbuserr
Not obsolete yet, the technology is not mature enough, and the previous
arguments against WSI are still valid.

People have been trying Wafer-Scale Integration [0] since the 1970s, there was
quite some hype of building a "super chip" back at that time [1], but all
efforts failed miserably. Cerebras' success is only the beginning, even if
this approach is workable (which remains a question), there's still at least a
decade to go from a HPC-specific chip to a general-purpose chip. Another
possibility is that WSI will forever be a technology used in massively-
parallel computers.

[0] [https://en.wikipedia.org/wiki/Wafer-
scale_integration](https://en.wikipedia.org/wiki/Wafer-scale_integration)

[1] Giant microcircuits for superfast computers, _Popular Science_ , 1984.
[https://books.google.com/books?id=eAAAAAAAMBAJ&pg=PA66](https://books.google.com/books?id=eAAAAAAAMBAJ&pg=PA66)

~~~
gridlockd
Another possibility is that the company will just fail, because it can not
offer an advantage over commodity chips.

~~~
segfaultbuserr
I already mentioned it,

> _even if this approach is workable (which remains a question)_

No need to repeat.

------
blueblisters
Might it not be more optimal to attach the individual GPU dielets onto a
silicon interposer to avoid the yield issues that are inherent when making a
large monolithic chip, and save some of the overheads of dealing with dead
dielets? I think I read about some researchers/companies going that route but
haven't kept up.

~~~
lsb
Part of the genius of the Cerebras on-chip routing is that it can route around
malfunctioning dielets.

~~~
gridlockd
Even then, wouldn't it be better to just use commodity chips on an interposer?

I don't actually know, but that seems to be the route the industry is going
and economics of scale will likely make it the winning approach.

------
11thEarlOfMar
It must have some redundancy in order to survive process errors. Does it have
to be defect-free to yield?

~~~
segfaultbuserr
Answered in the PDF [0] that HN user "aveni" linked below,

[https://www.hotchips.org/hc31/HC31_1.13_Cerebras.SeanLie.v02...](https://www.hotchips.org/hc31/HC31_1.13_Cerebras.SeanLie.v02.pdf)

> Redundancy is Your Friend

> Uniform small core architecture enables redundancy to address yield at very
> low cost

> Design includes redundant cores and redundant fabric links

> Redundant cores replace defective cores

> Extra links reconnect fabric to restore logical 2D mesh

------
debbiedowner
So does this beat 8 V100s in a DGX? And if yes what TPU v3-n is it comparable
to?

~~~
trhway
by transistor count, by energy consumption and i'd expect by the price it is
an equivalent of 3 x 16 V100 DGX/Lambdasystems. Can it beat such a
minicluster? Vertical vs. horizontal scaling seems to be still an open
question for DL. It looks like the horizontal - ie. V100 approach in this case
- would naturally have advantage over the vertical when it comes to sheer
scale, i.e. like the maximum possible size of model.

------
spicyramen
Interested in how Google TPU new generation compares to this one. Is this
comparison even reasonable?

------
mvn9
How much does it cost or rather how much does that entire rack with all the
supporting systems cost?

------
epicureanideal
What does this mean for Moore's law being dead or not? Anything?

~~~
throwaway189262
Nah, were near the scaling limit of silicon. And huge chips are super
expensive and impractical.

One day we'll find an easier way to print chips. Won't end moore's law but may
shift algorithm implementation back to hardware after decades of software
eating everything.

Medium term, they'll get a new lease on moore's law by switching off silicon
to smaller atoms. And/or switching to something that handles heat better so we
can go 3D. This is already happening in power electronics with GaN. My bet is
on this tech slowly spreading into digital logic

