Hacker News new | past | comments | ask | show | jobs | submit login
The Cerebras CS-2 wafer-scale engine (850k cores, 40 GB SRAM) (cerebras.net)
127 points by unwind 13 days ago | hide | past | favorite | 52 comments





I remember when I was a college student at UIC and I took a class that taught me how to use Maple (now they use Sage). The professor also taught supercomputing and he bought some specialized NUMA computers and a few dense multiprocessor systems. If you don't have a large company that will support what you buy then there is a huge risk if the company folds and then your hardware becomes useless to the point that you will let an Undergraduate try to figure out how to fix it (that undergraduate was me). Unfortunately, I was unsuccessful.

The CS1 had a Tensorflow / pytorch API; the CS2 probably has the same capability. There’s still a lot of risk for sure, but at least it’s not sawzall.

Was the hardware broken or was it a matter of figuring out how to run something on it?

IIRC Hardware was broken and unserviceable (at least the ones I was given).

Wouldn’t these things, like quantum computers, be leased? The complexity is getting so high that you would need full time staff dedicated just to that hardware. At least if it goes belly up, you are not stuck with unsupported hardware.

I would be curious to understand what the equivalent cluster solution would look like. A big selling point here is apparently that your researchers "don't have to understand the do's and don'ts of cluster scale computing", but if your budget is already in the millions, how big of an issue is that really? Is the barrier to entry for large scale NN training really that there aren't enough engineers with experience navigating the issues on commodity/cloud hardware?

Overall I guess I'm just confused at what scale of problems this is meant for. GPT-3 cost $5M to train, GPT-2 cost ~$50k. GPT-3 is way too big to fit into 40GB of SRAM, GPT-2 fits but isn't exactly bleeding edge. If this product is $2-5M, you could train GPT-2 40-100 times to make it cost effective, but now you're locked into the scale provided by this platform, and not all problems are GPT-2 sized. I'm not an expert so happy for someone else to chime in and correct me if I've gone wrong somewhere.


GPT-2/3 is at the high end in terms of size. There's a whole universe of problems, and most of the market, below it. A great many business/research needs for at least the near future would fit into a CS-2 quite nicely.

Yes I suppose you're right, GPT-2/3 aren't particularly representative of the "average" industry problem. Honestly what I'm jonesing for is an insider's take on the cost/benefit of a solution like this vs cloud, in terms of problem size, future flexibility, raw price, talent required, etc. Even just a ballpark "this is probably an order of magnitude more efficient vs cloud GPU compute, game changer" vs "we'd have to carefully consider and test performance, etc"

But hey, that's the hard part right. Probably aren't many people total with experience with CS-1, let alone browsing HN :D


So is this an available product? Everyone hates 'contact sales' 'get a demo' websites - what's the price, what's the spec, where can I order?

Since this chip is a whole wafer we can make some back-off the-envelope calculations.

Let's pretend this was an Intel product and Intel's typical yield was like 150 CPUs / wafer. Using high-end Intel CPUs as a price-point, we would already be in the $300,000 price range for this chip.

But this isn't an Intel product and not made at scale, more likely to-order. Further they don't just sell you the wafer/chip, but a complete solution. They have to recoup development costs with much fewer units sold.

I wouldn't be surprised if this goes for $2m - $5m.

Edit: The previous generation of this product was sold "for a couple million" according to some voices on the internet.


It is essentially a B2B for a deep-pocket kind of customer. Think energy, defense, or a big financial company (or fund).

If you can calculate and iterate fast something you want to build / identify where it is feasible to drill / etc. then the price of this is peanuts compared with doing the wrong thing (tm).


It's an eight figure product, so regardless of how you or anyone else on a forum feels about it (I mean, everyone likes to window shop, including me), I suspect their sales pipelines aren't going to be hamstrung by it.

Even at eight figures still makes sense for accurate value at risk calculation.

A friend that works in risk management told me a financial institution you must have enough cash 'at rest' to cover your part of your whole VAR.

The problem here is VAR is hard to calculate.

For a small bank it might be simple, but for a big one exposed to ousands of financial instruments it is not simple...

... and every million that you can have at work counts.


Just like IBM mainframes, you won't be able to just "click here to buy" these.

They do seem to sell actual systems, though. For several millions a piece, so not exactly a product for the masses.


These types of monster-chips are probably sold as part of a solution for a specific problem so they probably don't actually have a fixed price in the sense that no one buys just a chip.

It's one of those products where if you have to ask, you can't afford it.

A friend visited the Pittsburgh Supercomputing Center, where he saw one of these. Said that they throw off a LOT of heat. Stainless steel braided piping is used to move the liquid cooling around.

Their home page says 23kw in 15 rack units.

Compare that to somewhere in the range of 5 to 25kw for 42 rack units for a more traditional mix of equipment in a rack.

Sounds like it consumes somewhere in the range of 2.5 to 10 times more than a normal server that would fit in the same space. 1.5kw per RU as compared to 0.12-0.6 per RU for "normal equipment".


At 1:51 of this video, you can see them fitting it to what looks like a giant heatsink, to be plugged into some very large pipes: https://youtu.be/qSqAxEXtZY0?t=111

The 1st generation units draw 23kW peak, so yeah, basically a computational space heater.

https://www.anandtech.com/show/16626/cerebras-unveils-wafer-...


Out of curiosity, does anyone know of an actual company that's bought and uses one of these chips (say, the CS-1)? What's the use case?

They mention "low latency datacenter inference." Surely, the Facebooks and Amazons in the world could do better by using localized, smaller-scale machines for inference, since their use cases are distributed geographically. The one use case that I can think of, is high-frequency trading. But I can tell you that 20PB/s memory bandwidth is an overkill, and also, you're going to have a bad time trying to cool this thing down in a colocation center that doesn't belong to you.


Under the "Industries" part of their webpage they have testimonials from GlaxoSmithKline, Lawrence Livermore National Lab, Argonne National Lab

This slide (https://images.anandtech.com/doci/16626/Cerebras%20WSE2%20La...) lists announced Deployments at:

Argonne National Laboratory

Lawrence Livermore National Laboratory

Pittsburgh Supercomputer Center

Edinburgh Parallel Computing Centre

GlaxoSmthKline

"Other wins in heavy manufacturing, Pharma, Biotech, military and intelligence"


There are AI models that are too big to effectively run on smaller hardware. The performance benefit of a monolithic system like this vs distributed GPU based systems can be orders of magnitude.

There is something on YouTube. They are very profitable, machine learning, filtering and stuff. No trading.


No actual specifications about precision of computations, operations supported, etc., just impressive I/O numbers.

There's better info here: https://blog.inten.to/hardware-for-deep-learning-part-4-asic...

"The instruction set supports operations on INT16, FP16, and FP32 types. Floating-point adds, multiplies, and fused multiply-accumulate (or FMAC, with no rounding of the product prior to the add) can occur in a 4-way SIMD manner for 16-bit operands. The instruction set supports SIMD operations across subtensors of four-dimensional tensors, making use of tensor address generation hardware to efficiently access tensor data in memory."

The source appears to be this PDF: https://arxiv.org/pdf/2010.03660.pdf


Just recently read the interview transcript [0] of CEO and CTO (Jim Keller) of Tenstorrent which I believe they're in the same field. That's interestingly deep conversation open my eye a little bit about AI processor and how it's not only about a chip that accelerate "AI task".

[0] https://www.anandtech.com/show/16709/an-interview-with-tenst...


These "AI processors" are just matrix-multiplication engines, often 8-bit or 16-bits. Maybe 4-bit in some cases.

16-bit (and smaller) just isn't enough for most compute problems. But emphasis on "most". Neural Nets clearly are fine with 16-bit, but some video game lighting effects can be calculated on 16-bit and look "good enough".

Maybe a certified Jewelry Appraiser can tell the difference between 1.3 refractive index vs 1.35, but the typical video game player (and dare I say, human) won't be able to tell the difference. If all the light bounces are just slightly off, as long as its "close enough", you probably have good-enough looking ray-tracing or whatever.

--------

I've also heard of "iterative solvers" being accelerated by 16-bit and 32-bit passes before doing a 64-bit pass. That's not applicable to all math problems of course, but that's still a methodology where your 16-bit performance accelerates your 64-bit ultimate answer.


For casual readers, the trend in neural nets has changed a bit, and for training TensorFloat-32 is gaining popularity: https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-prec...

Pure f16 overflows/underflows in many scenarios


What a charmingly impractical product! I wonder what their yields are like.

I am reminded of a recording of a Seymour Cray talk where he discussed some of the challenges his company faced with actually manufacturing their computers. The challenges associated with physical packaging, cooling, and power were as much or more interesting to me than the computing capabilities of the resulting product. I bet there are similar interesting rabbit-holes w/ this product.

I have some nostalgia for the past world of "supercomputers" that were actually purpose-built devices, rather than clusters of off-the-shelf hardware. Economics don't favor the purpose-built, for sure, but things like the old Connection Machines and Cray machines seem very romantic to me. This product seems a bit like an artifact from that time.

Edit: I think the Cray talk I'm thinking about is https://www.youtube.com/watch?v=8Z9VStbhplQ


I think yields are ~100% actually, I've read the chip is architected in a way that if any of those cores doesn't work the rest of the chip still works.

I figured there would be spare silicon on the die. I guess I was thinking more about the raw yields and how much over-capacity was engineered into the product to handle real-world physics. There has to be some pretty impressive engineering behind this thing.

1-2% over-capacity.

""Both Cerebras’ first and second generation chips are created by removing the largest possible square from a 300 mm wafer to create 46,000 square millimeter chips roughly the size of a dinner plate. An array of repeated identical tiles (84 of them) is built into the wafer, enabling redundancy.""


They're fabbing it at a 7nm process, too! This is really exciting stuff.

This rabbit hole got me to this Anandtech article: https://www.anandtech.com/show/16626/cerebras-unveils-wafer-...


> I have some nostalgia for the past world of "supercomputers" that were actually purpose-built devices, rather than clusters of off-the-shelf hardware.

From mainframes to supercomputers, these parts are still far from off-the-shell hardware. They might use components that share a common heritage with their COTS cousins (in that they're not being developed specifically for a single supercomputer), but from CPU to interconnects you won't be able to source the components as-is new on the free market. Everything is still customised to varying degrees to meet customer specs and -pricing.

Packaging and cooling has been solved, btw. and the company indeed sells a "plug-and-play" system complete with software support for relevant software packages (e.g. TensorFlow and friends).


> They might use components that share a common heritage with their COTS cousins (in that they're not being developed specifically for a single supercomputer)

That's what I'm talking about. This system isn't just new arrangement of off-the-shelf x86/x64-based CPUs or commodity GPU cores with exotic interconnects.

> Packaging and cooling has been solved, btw. and the company indeed sells a "plug-and-play" system

I'm aware that they have a product. My interest would be in hearing about how they solved their packaging, power, and cooling challenges. That's actually more interesting to me than the capabilities of the system to deliver neural network processing. There had to be a ton of fun engineering in figuring out how to make a system out of the raw technology.



For someone who has started coding when 128kilobytes were considered luxury, that is just, well, incommensurate :-)

In a way, it's a throwback to that. Each "core" has access to 48kb of SRAM, there is no shared memory across cores.

48Kb, my comfort zone !

Skynet is prettier than I thought it would be.

Don't be silly, Skynet will be an IBM z

Ever since cererus-1 launched, I have been dreaming of Either Intel or AMD going full-on wafer scale research. All the individual university departments gonna love their own Million Doller personal super-computer. Even business who can't afford the super computers would want one if Wafer Scale makes it affordable.

I guess only inertia is stopping them from going full scale on wafer scale.


Dumb question: how are they able to sell each of these for ~ $2-3M? I doubt they have sold more than 20-30 of these systems. Given the upfront cost of design, verification, wafer construction, etc, shouldn't the unit price be much higher?

Those processors apparently have a 100% yield rate in production because a small percentage of cores are only for redundancy and can balance out small defects. The actual distribution of the software and the flow of data is handled by the compiler later.

Don't know much about the development costs though.

Edit: I got those infos from a YT video[0], which talks a bit about it and the second generation of the chip

[0] https://www.youtube.com/watch?v=FNd94_XaVlY


If I remember well Intel did something similar with the i486sx - most were just i468DX with a faulty i487 coprocessor disabled.

Not sure how they did it. Probably burning the SX microcode after testing?


Unless they are able to introduce smaller module like “GPU” version, it will only be a nice demo product of what we can do with state of the art semiconductor manufacturing.

I'm sure people had the same things to say about computers that required the space of entire rooms in the 40s.

"That's nice but we'll never be able to scale this down any more."


Isn’t scaling it down just the normal CPUs we have now?

Where do you think all those next-gen deep learning models for autonomous weapons are going to be trained? For more civil applications, imagine GPT3.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: