Memory seems very low for inference on bigger models?. Memory bandwidth is also massively lower than something like a 4090 or 7900xtx?. Am I missing some fairy dust magic here?
This is purely a developer kit, letting people get used to the hardware configuration and the software stacks before the big hardware comes later in future generations.
I'd like to know what different precisions/ quantizations if any are supported. For LLMs 8GB is fine for playing around if weights are quantized. And as the article mentions it's more than enough for lots of computer vision models.
Grayskull is make before LLMs being a thing. And their plan is like Groq, to distribute the compute graph across multiple processors to get higher effective memory and throughput by pipelining. But better-ish by having RAM so you can fit models on much less cards. Grayskull doesn't have this ability. The next generation Wormhole does by having 100GbE interfaces on the cards.
Also the CPUs on Grayskull is 32bit. Memory is addressed through the bank address so it works for now. But they'll have to upgrade to 64bit soon.
I mean it says LPDDR4, nothing about how many channels but I would be surprised if it's anything more than dual-channel. Big cache but not anything not already seen in the AMD 3XD chips.
I wonder, is the dual-use of GPUs a significant engineering bottleneck for Nvidia et al? My understanding is that for libraries like Tensorflow and PyTorch, much of the complexity is supporting the many-to-many mapping of model structures to GPU silicon in a way that reduces “compile time”, with all the tradeoff landmines you might expect there. I would imagine that abstracting this at the hardware layer is really valuable.
Are these non-GPU architectures the future of machine learning, and can we expect the big vendors to start competing here or is this a business risk for them? (robbing Peter to pay Paul, like Google investing in AI that cannibalizes their search revenue)
Tbh it’s mildly surprising they even removed NVENC considering the overall size of the chip (in the absolute we are only talking about low-single-digit mm2 savings) and then H100 is still advertised and has features targeting VM graphics/visualization still… remember they also still put a full graphics pipeline with ROPs/TMUs on the chip, just no actual display hardware.
It's a tradeoff with verification and validation, not so much die-space. If you include the IP, you need to run an expensive (time-consuming) simulation and emulation for the IP. Then, you need to validate it on first silicon and pray that it doesn't impact yield in HVM.
These days, simulation runs basically nonstop on all the FPGAs the company is able to throw at it until tape out. Then, in post-silicon validation, you have engineering teams working 24/7 to validate the device on the bench. Finally, test engineering tries to achieve yield in HVM.
Anything that you don't need on the spec sheet or the designer insists on is cut to save time to market.
> I wonder, is the dual-use of GPUs a significant engineering bottleneck for Nvidia et al?
Not much. GPUs became array processing engines in the early 2000s and the amount of residual fixed function graphics hardware is tiny now.
The biggest thing is that a lot of network evaluation is being done at lower precision (BF16/ FP8 / ternary) now, but my understanding is that training the networks still requires FP32. As these companies are really targeting the training market the high precision modes needed for graphics can't be thrown away. If a company was to target the end user, then you could optimise for lower precision, but end users find dedicated neural net acceleration a luxury they can get for free by buying a GPU.
The dual-use of GPUs is what's driving Nvidia's demand, for the most part. They bet big on more outlandish designs (eg. CUDA/Tensor cores) and mixed-precision computation, and did a lot of the hardware/software work to get it shipped out years ago. There wasn't much interest in CUDA outside niche research/industry applications for a while, it's a small miracle they kept it around long enough to see the crypto and AI booms. Now their bet is paying big dividends, and other companies are trying to figure out fast ways to leverage raster compute for general-purpose acceleration.
Apple initially developed OpenCL with Khronos as a similar-ish analog for GPU acceleration a while ago. There were a few partners that invested in it, but it suffered the same lack of demand as CUDA and languished for a while. Now Apple doesn't support OpenCL, so the purpose of a multi-platform acceleration library has kinda been scuttled. Nvidia played their cards well, and their competitors are going to feel the pain for a while unless they work together again.
I know it isn’t the what you mean, but as an aside (rather than a contradiction), the gaming marked is still much bigger than the ML model training one, right? So presumably people care about the graphics part in the sense that development there funds the development of the cards, haha.
I always assumed that was what killed Xeon Phi. Hard to justify a GEMM card in and of itself. However clever you get, NVIDIA will be back next year with twice as much bandwidth.
According to my googling, data center has been bigger than gaming for nvidia since 2022, and almost x2 in Q2 2023 following a crazy nosedive in gaming sales.
Nvidia burnt quite a lot of goodwill with the 40xx series cards and pricing.
Melting connectors, obscene pricing, marginal performance increase, zero availability; then when hardly any of that was addressed, Ti versions release which are equally expensive, and equally lacklustre. Combine that with patchy support for frame generation and 30xx series cards still going really strong, and you’ve got not a lot of incentive to upgrade.
I think dollars was the right figure to defeat my point. My thought was that they would be impossible for their more datacenter type competition to beat, because their R&D budget came from a much larger gaming segment. But, if they are around the same size, or the data center type users actually pay more, it is a level playing field.
I’m sure they don’t mind having both revenue streams of course! But on the other hand, they have to serve two sets of interests.
I'll bet against that. NVIDIA's margin is so large that Google/Microsoft and others can fund / are funding large teams to make TPUs and other dedicated hardware competitive. I also believe LLMs are part of the hype cycle just like crypto. In ten years the gaming market will return to being NVIDIA's primary market.
1. The hardware isn't nVidia's primary moat. CUDA is. Currently all machine learning tooling assumes you have a CUDA device.
2. The hype cycle may be real, but nobody really knows how much better AI will become. The uncertainty is driving the hype (which is obviously on the optimistic side), but it could well be that models become so good that everyone "needs" an AI chip like how we "need" a smartphone today.
3. The gaming/graphics market is probably going to slowly shrink. There's only so many pixels you can push to the eye. There's very real diminishing returns there. At some point people literally won't be able to tell the difference between the newer/better/faster graphics and the older generation, at which point the GPU vendor's margins drop significantly.
If by "LLMs are part of the hype cycle" you mean "people are hyping them more than their current capabilities entirely warrant" maybe. If you mean "LLMs are a hype technology that is going to peter out without leaving a major impact on society" as kindly as I can phrase it, what on earth are you smoking, or are you Amish?
To be fair, GP compared the LLM hype with crypto... And as niche crypto may be in the future, I don't think it will also "peter out without leaving a major impact on society"...
Crypto will not be niche in the future. Bitcoin has a market cap of 1,349,125,982,176 (https://coinmarketcap.com/currencies/bitcoin/) and recently surpassed the market cap of gold itself and has large investors from Black Rock to Apple, ans recently got approved for ETFs which means the general population's 401k's will have bitcoin in them starting this year. Other crypto currencies may die out, but bitcoin is here permanantly.
If you haven't yet, buy some bitcoin and stop letting the government steal your hard earned money through inflation.
grayskull is "inference only", but wormhole will also "support" inference.
I put those in quotes, because theoretically both can do training, it's just that grayskull isn't well suited for it, because of the internal fp19 format, and no good support for scaling to multiple cards. wormhole will have internal fp32, and is designed to scale out with across multiple cards, and servers.
FP19 is perfectly fine for training - llama is trained in fp16, many train in bf16, with microscaling exponents, you can do 6 bit training : https://arxiv.org/abs/2310.10537
I think the broader context here is that it's "fine for training" in the sense that you can successfully train a model, but it's not "fine for training" in the sense that it can only train small models due to the lack of scaling across cards, which directly cuts against where ML has been trending over the last several years.
In LLM-land we've rapidly gone from training bespoke models to doing fine tuning to RLHF to zero-shot prompting. The better the underlying model the more you can do without additional training, so hardware that fails to scale up to the largest training runs will have limited practical utility despite technically supporting training.
Wormhole does support scale out though via it's built in networking? And there seem to be links for something on top of the board, NVLINK style.
And yes, I know, I've been working on LLM pretraining for about 4 years now, since 2020. The number formats themselves so far are mostly scale invariant or improve with larger scale - you can quantize a larger model and see less performance drop than a smaller model.
GPGPU means "using a GPU for things that are General Purpose (GPGPU)". The aforementioned IC is not a GPU that is used for general-purpose computations, but a specialized chip for AI computations.
Been following this for a while b/c Jim Keller, but every time I look at the arch [1; as linked by other commenter] as someone who doesn't know the first thing about CPU/ASIC design it just looks sort of... "wacky"? Does anyone who understands this have a good explainer for the rationale behind having a grid of cores with memory and IFs interspersed between and then something akin to a network interconnecting them with that topology? What is it about the target workloads here that makes this a good approach?
Sounds like a manycore architecture. If you have played TIS-100 it is that exact same idea. If you haven’t, but have played Factorio think of instead of having a central area where all the work happens you build a series of interconnected stations, each doing their own part of calculation before passing it onto the next.
Upside is each core has its own code and is fully Turing complete and independent of eachother. You can handle conditionals much better. And you lose the latency of having network hops for workers.
Downside is you need to break down your process to map onto specific nodes and flows.
(Assuming it is in fact manycore - which is not the same as multicore)
Sounds kind of like the IBM/Sony/Toshiba Cell? It made an appearance in the PlayStation 3, but was supposed to be a more general high-performance architecture. At some point IBM sold blade servers with Cell processors.
Because you can have a core set up for each branch and just pass to that vs. “context-switching” your core to execute the branch that ends up being taken?
> Because you can have a core set up for each branch
and
> because each core has its own instruction counter
You can have each core be a tighter loop because you can limit the types of cases it handles.
But you also have your own instruction program counter you can take all sorts of branches in a way that you can't with SIMD, because in SIMD as the name implies you get only a Single-Instruction to deal with Multiple-Data. So if you need to change the behaviour depending on the exact value of the data, you are better off using separate cores than wider vectors.
This reminds me very much of transputers. The idea here is that each cpu can context switch extremely quickly with minimal latency to any resource and you have a topology that is great for matrix maths as a result.
What makes the resulting topology great for matrix math, vs. non-matrix math workloads? Naively if you know you‘re „only“ going to multiply matrices, what do you need the flexibility and fast context-switching for? Is the end-game here that you can lay out the workload s.t. you have a series of closely colocated cores carrying out the operations of some linalg expression one after the other and the memory for intermediate results right in between, or something like that?
Your core needs to be fully programmable so you can do things like kernel fusion. The simplest form is to load quantized weights and dequantize them to bfloat16 as you go. Llama.cpp and it's gguf files support various types of quantization and most of them require programmability to efficiently support them.
I suspect thats basically it (operations one by one and then pipelining to saturate). Thats basically what Groq does also AFAIK. From their website it seems the chips are designed to be connected together into one big topology, the "Galaxy" system. Also similar to TPU's, although they use HBM with only a few very powerful "cores" vs DRAM with low powered cores.
It's another iteration of the unit record machine [0] - batch processes done with a physical arrangement of the steps in the process.
CPU design moved away from this analogy a long while ago because the tasks being done with CPUs involved more dynamic control flow structures and arbitrary workloads. But workloads that are linear batches of brute force math don't need that kind of dynamism, so gridded designs become fashionable as a way of expressing a configurable pipeline with clear semantics - everything on the grid is at a linear, integer-scalable distance, buffers will be of the same size, etc.
Definitely looks wacky!
Has nice concept though, I like the "Network On Chip" reversed toruses.
Hopefully some of y'all tinkerers with time and dough can bear some of these ideas to fruition, keep Nvidia on their toes ;)
64gb is a good RAM amount IMO, cheap yet still vastly underutilized since we play to the LCD of users... guessing Linux will be able to make that pivot much faster/so little baggage.
but isn't that aspect common among this, CPUs, GPUs, ...? And it feels like the whole NOC thing would add quite some overhead to moving things around.
Are you saying proximity here more than offsets this vs. e.g. each core having its own cache as I think they do in a "normal" CPU? And if so, is this more true of ML inference workloads than other workloads, for some reason?
I think the distinction with ML inference workloads is that you often have very little control flow, so this type of architecture lets you match layers to adjacent cores so that each operation gets it's data directly from the step before rather than from RAM.
They still use this today, and the ring interconnect is also the topology inside zen3/zen4 CCXs (with 8 cores). Ring is one of the simplest and best systems for >4 cores, until you get to about 8-10 cores (at which point you generally split it into multiple “tiers” like multiple CCXs etc).
I honestly don't understand this latency obsession for LLMs. You are loading millions of parameters sequentially for each matrix. The access pattern is perfectly predictable. I just ran llama.cpp in with profiling and 99.9% of the time is spent in matrix multiplication. This shocked me, honestly, because I genuinely thought that there is going to be much more variety.
The real question is how do they plan to compete with say a Ryzen 8700G with 32 GB of overclocked RAM and the Ryzen AI NPU. 2x DDR5-6600 gives you more memory bandwidth than grayskull. Their primary advantage appears to be the large SRAM and not much else.
In general putting memory physically close to compute is good. If two cores need to share that memory doesn't it make sense to place the memory at the interface?
Since people seem to be wondering how the architecture works, and the software stack is open (although I'm not sure to what extend), I'll share my understanding of it.
The basic system consists of the cards consists of a bunch of Tensix cores and shared memory:
> Each Tensix core contains a high-density tensor math unit (FPU) which performs most of the heavy lifting, a SIMD engine (SFPU), five Risc-V CPU cores, and a large local memory storage. [0]
> The cores are connected with two torus-shaped, going in opposite directions. [0]
The RISC-V cores in the Tensix cores, are tiny rv32i cores, that can control the FPU, SFPU, and are also used to prepare/move data.
The FPU, does "dense tensor math", so I think it's probably a matmul engine, but I don't know any more specifics. [1]
The SFPU is a more general purpose SIMT engine, that can be driven from the RISC-V cores.
There is a SFPU simulator you can play around with on their github. [2] See the low level Kernels example for how the programming model works. [3]
The grayskull SFPU has 4 general purpose LRegs, which each hold 64 19-bit values. Wormhole has 8 general purpose LRegs, which each hold 32 32-bit values.
I've been told that wormhole SFPU has a ~3x IPC increase over grayskull, and a few new SFPU instructions.
You can probably find out more by browsing the docs and rummaging through the github repos. [4]
Yes, the RISC-V Vector extension would be absolutely overkill for an ML accelerator, and waste a lot of die space.
Currently the cards need an x86_64 host, but they plan to replace that with their ascalon risc-v cpu, that supports the risc-v vector extension (rvv).
So most compute is done by the accelerator cards, but for some things the host system steps in. There rvv can come in handy, because it's a lot more flexible.
It sounds very cool. I've been following them since watching a couple interviews with Jim Keller, and then seeing they're (in part) a Toronto area local company. I don't work in machine-learning but find the HW direction really interesting. I could think of a few uses for that kind of high bandwidth vector/matrix crunching that aren't restricted to ML purposes.
> "...including BERT for natural language processing tasks, ResNet for image recognition, Whisper for speech recognition and translation, YOLOv5 for real-time object detection, and U-Net for image segmentation."
I wonder why they are starting with these models. One could speculate that they are going for power efficiencies but that does not quite add up entirely.
From what I've gathered from the Tenstorrent discord:
The grayskull cards only have 8 GB of RAM and don't have a fast enough memory to NoC bandwidth to make working with multiple cards practical. The next generations, that is wormhole and newer don't have this limitations and are specifically designed to work in server racks, see the galaxy system. [0]
The Grayskull really only is a devboard. It has some other quirks that will be improved by wormhole, like native 19 bit floats in their SIMT engines, instead of 32 bit, in wormhole.
Highly speculative, and I suspect that I don't do these models justice:
The purpose of these boards seems to get people acquainted with their programming model. Not as in using board and model as a turnkey solution (if that happens, fine, but this is not the goal), but as in getting potential customers for future boards to learn how to make their own models or third party models run on the board. The more models they supported out of the box, the less well the goal of building buyer-side expertise in the programming model would be served.
Those models do seem to be the best currently available (IMO), so yeah power efficiencies for guaranteed common use cases should be a 0-day integration.
As I see this is : TT-Metalium [1] = a low level API [1]
"TT-Metalium is a platform for programming heterogeneous collection of CPUs (host processors) and Tenstorrent acceleration devices, composed of many RISC-V processors. Target users of TT-Metal are expert parallel programmers wanting to write parallel and efficient code with full access to the Tensix hardware via low-level kernel APIs."
"The software stacks come in two varieties – a high level and a low level. The high-level is called TT-Buda, using higher-level APIs to get things up and running, along with interfaces into modern machine learning frameworks. The lower level is TT-Metalium, which provides fine-grained control over the hardware for custom operators, custom control, and even non-machine learning code. Tenstorrent states that there are no black boxes, no encrypted APIs, and no hidden functions."https://morethanmoore.substack.com/p/unboxing-the-tenstorren...
The System Requirements on this page say 64GB RAM is a prerequisite on the host system. Why? Wouldn’t an inference server be barebones aside from the Inference hardware?
> It has more to do with the memory resources required during model compilation. The requirement will vary from model to model, so 64GB is a safe limit for all.
I was unable to find any reference to performance or architectural details. Memory bandwidth is very low for an ML-focused device. Price is extremely high.
I assume they are sold to hardware manufacturers and maybe software developers so that they check out and test their systems against real processor. Processors are probably manufactured in some relatively old process technology.
The real thing will be manufactured using different 2nm process according to the article and performance will be different.
Did Larrabee contain anything that wasn't an x86 core? According to other comments, this seems to consist of massive highly specialized processing units with just a few tiny CPU sprinkled in to keep the former fed. Quite the opposite of how I remember Larabee.
I wonder if with the recent research coming to light about the effectiveness of using 3 state weights do they plan to release something along those lines?
Can they really compete in FLOP/$ with the likes on Nvidia even if it is a more bespoke architecture than a modified GPU?
Nvidia has insane margins so it isn't hard* for a well-funded player to beat them in FLOPS/$. The hard part is the compiler toolchains, the libraries, the documentation, getting models to run fast.
* Everything is relative. It's hard to make even a simple microcontroller - but also it isn't hard, you know?
Nobody cares about FLOPs anymore when it comes to transformer based architectures and considering the transformer architecture is applicable to almost any usecase beyond LLMs, memory bandwidth and memory capacity are the most important metrics. Grayskull has enough 16bit float performance to outrun its memory bandwidth even with 1 ternary bit quantization.
This looks very similar to the Picochip designs used in a lot of small cellular base stations for SDR. I hope it is similar, because those were fantastic chips to program for. That influence could have come via Intel's acquisition of Picochip.
GDDR6, 14 GT/s if we want to be accurate. So about double the DRAM bandwidth, but less than 1/20th the SRAM capacity, and fairly different architectures on-chip. I'm not sure what ML workloads would benefit from the extra SRAM enough to overcome the DRAM bandwidth deficit, but they probably exist or Tenstorrent wouldn't be making such a SRAM-heavy chip.
I'm getting sick of this: Groq, Tenstorrent, all the promising startups are offering inference-only solutions. I have it from Groq official channels that they do not plan to invest development time into enabling training, because inference is where the big money is at. Which I understand insofar as inference demand probably outstrips training demand by ~millions, but I still can't help but find all of this so egreriously disappointing!
- Grayskull e75 | drawing 75W | 96 Tensix cores | 1 GHz clock | 96 MB SRAM | 8GB LPDDR4 @ 102.4 GB/s | $599
- Grayskull e150 | drawing 200W | 120 Tensix cores | 1.2 GHz clock | 120 MB SRAM | 8GB LPDDR4 @ 118.4 GB/s | $799
It will be interesting to see their inference performance, compared to graphics cards. Will they be interesting for home labs?
I found one interview with unboxing a preview version (if I understood correctly), with some background info, but no performance numbers: https://morethanmoore.substack.com/p/unboxing-the-tenstorren...