Overall I guess I'm just confused at what scale of problems this is meant for. GPT-3 cost $5M to train, GPT-2 cost ~$50k. GPT-3 is way too big to fit into 40GB of SRAM, GPT-2 fits but isn't exactly bleeding edge. If this product is $2-5M, you could train GPT-2 40-100 times to make it cost effective, but now you're locked into the scale provided by this platform, and not all problems are GPT-2 sized. I'm not an expert so happy for someone else to chime in and correct me if I've gone wrong somewhere.
But hey, that's the hard part right. Probably aren't many people total with experience with CS-1, let alone browsing HN :D
Let's pretend this was an Intel product and Intel's typical yield was like 150 CPUs / wafer. Using high-end Intel CPUs as a price-point, we would already be in the $300,000 price range for this chip.
But this isn't an Intel product and not made at scale, more likely to-order. Further they don't just sell you the wafer/chip, but a complete solution. They have to recoup development costs with much fewer units sold.
I wouldn't be surprised if this goes for $2m - $5m.
Edit: The previous generation of this product was sold "for a couple million" according to some voices on the internet.
If you can calculate and iterate fast something you want to build / identify where it is feasible to drill / etc. then the price of this is peanuts compared with doing the wrong thing (tm).
A friend that works in risk management told me a financial institution you must have enough cash 'at rest' to cover your part of your whole VAR.
The problem here is VAR is hard to calculate.
For a small bank it might be simple, but for a big one exposed to ousands of financial instruments it is not simple...
... and every million that you can have at work counts.
They do seem to sell actual systems, though. For several millions a piece, so not exactly a product for the masses.
Compare that to somewhere in the range of 5 to 25kw for 42 rack units for a more traditional mix of equipment in a rack.
Sounds like it consumes somewhere in the range of 2.5 to 10 times more than a normal server that would fit in the same space. 1.5kw per RU as compared to 0.12-0.6 per RU for "normal equipment".
They mention "low latency datacenter inference." Surely, the Facebooks and Amazons in the world could do better by using localized, smaller-scale machines for inference, since their use cases are distributed geographically. The one use case that I can think of, is high-frequency trading. But I can tell you that 20PB/s memory bandwidth is an overkill, and also, you're going to have a bad time trying to cool this thing down in a colocation center that doesn't belong to you.
Argonne National Laboratory
Lawrence Livermore National Laboratory
Pittsburgh Supercomputer Center
Edinburgh Parallel Computing Centre
"Other wins in heavy manufacturing, Pharma, Biotech, military and intelligence"
"The instruction set supports operations on INT16, FP16, and FP32 types. Floating-point adds, multiplies, and fused multiply-accumulate (or FMAC, with no rounding of the product prior to the add) can occur in a 4-way SIMD manner for 16-bit operands. The instruction set supports SIMD operations across subtensors of four-dimensional tensors, making use of tensor address generation hardware to efficiently access tensor data in memory."
The source appears to be this PDF: https://arxiv.org/pdf/2010.03660.pdf
16-bit (and smaller) just isn't enough for most compute problems. But emphasis on "most". Neural Nets clearly are fine with 16-bit, but some video game lighting effects can be calculated on 16-bit and look "good enough".
Maybe a certified Jewelry Appraiser can tell the difference between 1.3 refractive index vs 1.35, but the typical video game player (and dare I say, human) won't be able to tell the difference. If all the light bounces are just slightly off, as long as its "close enough", you probably have good-enough looking ray-tracing or whatever.
I've also heard of "iterative solvers" being accelerated by 16-bit and 32-bit passes before doing a 64-bit pass. That's not applicable to all math problems of course, but that's still a methodology where your 16-bit performance accelerates your 64-bit ultimate answer.
Pure f16 overflows/underflows in many scenarios
I am reminded of a recording of a Seymour Cray talk where he discussed some of the challenges his company faced with actually manufacturing their computers. The challenges associated with physical packaging, cooling, and power were as much or more interesting to me than the computing capabilities of the resulting product. I bet there are similar interesting rabbit-holes w/ this product.
I have some nostalgia for the past world of "supercomputers" that were actually purpose-built devices, rather than clusters of off-the-shelf hardware. Economics don't favor the purpose-built, for sure, but things like the old Connection Machines and Cray machines seem very romantic to me. This product seems a bit like an artifact from that time.
Edit: I think the Cray talk I'm thinking about is https://www.youtube.com/watch?v=8Z9VStbhplQ
""Both Cerebras’ first and second generation chips are created by removing the largest possible square from a 300 mm wafer to create 46,000 square millimeter chips roughly the size of a dinner plate. An array of repeated identical tiles (84 of them) is built into the wafer, enabling redundancy.""
This rabbit hole got me to this Anandtech article: https://www.anandtech.com/show/16626/cerebras-unveils-wafer-...
From mainframes to supercomputers, these parts are still far from off-the-shell hardware. They might use components that share a common heritage with their COTS cousins (in that they're not being developed specifically for a single supercomputer), but from CPU to interconnects you won't be able to source the components as-is new on the free market. Everything is still customised to varying degrees to meet customer specs and -pricing.
Packaging and cooling has been solved, btw. and the company indeed sells a "plug-and-play" system complete with software support for relevant software packages (e.g. TensorFlow and friends).
That's what I'm talking about. This system isn't just new arrangement of off-the-shelf x86/x64-based CPUs or commodity GPU cores with exotic interconnects.
> Packaging and cooling has been solved, btw. and the company indeed sells a "plug-and-play" system
I'm aware that they have a product. My interest would be in hearing about how they solved their packaging, power, and cooling challenges. That's actually more interesting to me than the capabilities of the system to deliver neural network processing. There had to be a ton of fun engineering in figuring out how to make a system out of the raw technology.
I guess only inertia is stopping them from going full scale on wafer scale.
Don't know much about the development costs though.
Edit: I got those infos from a YT video, which talks a bit about it and the second generation of the chip
Not sure how they did it. Probably burning the SX microcode after testing?
"That's nice but we'll never be able to scale this down any more."