Hacker News new | past | comments | ask | show | jobs | submit login
Tenstorrent unveils Grayskull, its RISC-V answer to GPUs (techradar.com)
278 points by Brajeshwar 11 months ago | hide | past | favorite | 126 comments



Dev kits:

- Grayskull e75 | drawing 75W | 96 Tensix cores | 1 GHz clock | 96 MB SRAM | 8GB LPDDR4 @ 102.4 GB/s | $599

- Grayskull e150 | drawing 200W | 120 Tensix cores | 1.2 GHz clock | 120 MB SRAM | 8GB LPDDR4 @ 118.4 GB/s | $799

It will be interesting to see their inference performance, compared to graphics cards. Will they be interesting for home labs?

I found one interview with unboxing a preview version (if I understood correctly), with some background info, but no performance numbers: https://morethanmoore.substack.com/p/unboxing-the-tenstorren...


Memory seems very low for inference on bigger models?. Memory bandwidth is also massively lower than something like a 4090 or 7900xtx?. Am I missing some fairy dust magic here?


  This is purely a developer kit, letting people get used to the hardware configuration and the software stacks before the big hardware comes later in future generations.
I'd like to know what different precisions/ quantizations if any are supported. For LLMs 8GB is fine for playing around if weights are quantized. And as the article mentions it's more than enough for lots of computer vision models.


Grayskull supports FP8, FP16, BFP16, VFP2, BFP4, BFP8 [0], and some sort of 19-bit floating point in the SFPU. [1]

I don't know how and with what performance those are supported.

FP4 has 221 TFLOPS on the Grayskull e75, and 332 TFLOPS on the Grayskull e150. [0]

While wormhole has 82 INT8 TOPS per card. [2]

I haven't found any other numbers.

Edit: Some more info about data formats: https://docs.tenstorrent.com/tenstorrent/v/tt-buda/dataforma...

[0] https://docs.tenstorrent.com/tenstorrent/add-in-boards-and-c...

[1] https://tenstorrent-metal.github.io/tt-metal/latest/tt_metal...

[2] https://tenstorrent.com/systems/galaxy/


That's a lot of formats for FP8, FP4, and even FP2! (never heard of). I wonder how they did it. Nice.


Cool, thank you!


The website indicates each Tensix core ~1TOP theoretical. So not amazing.

It will have to compete on price/memory in the crowded inference space.


This card isn't competing on price - it's a dev kit.


Maybe power usage ?


Why are they so skimpy on memory capacity? I guess this is a dev kit for anything but LLMs?


Grayskull is make before LLMs being a thing. And their plan is like Groq, to distribute the compute graph across multiple processors to get higher effective memory and throughput by pipelining. But better-ish by having RAM so you can fit models on much less cards. Grayskull doesn't have this ability. The next generation Wormhole does by having 100GbE interfaces on the cards.

Also the CPUs on Grayskull is 32bit. Memory is addressed through the bank address so it works for now. But they'll have to upgrade to 64bit soon.


What's the memory interface like? Is an extreme NUMA thing with several orders greater potential for parallelization?

I'm probably dreaming; new CPU core designs are one thing, a completely new memory design is likely high fantasy..


I mean it says LPDDR4, nothing about how many channels but I would be surprised if it's anything more than dual-channel. Big cache but not anything not already seen in the AMD 3XD chips.


The quoted ~100-120GBps is going to be quad-channel LPDDR4


Will graphics cards eventually get renamed?

Are the optimizations for inference basically identical to graphics?

I imagine the workloads are different, and that inference has been the main market for years now...


The new things are already getting renamed, but the dust hasn't settled yet.

NPU: https://en.wikipedia.org/wiki/AI_accelerator

TPU: https://en.wikipedia.org/wiki/Tensor_Processing_Unit


why bother renaming? just re-acronym to "greatly parallel unit"


These specifications rather look like some AI acceleration ICs, and not like GPUs (recall that "GPU" stands for Graphics Processing Unit).


That’s why the article characterizes them as an “alternative to GPUs […] for AI development“.

That being said, it’s a Grayskull Processing Unit. ;)


I wonder, is the dual-use of GPUs a significant engineering bottleneck for Nvidia et al? My understanding is that for libraries like Tensorflow and PyTorch, much of the complexity is supporting the many-to-many mapping of model structures to GPU silicon in a way that reduces “compile time”, with all the tradeoff landmines you might expect there. I would imagine that abstracting this at the hardware layer is really valuable. Are these non-GPU architectures the future of machine learning, and can we expect the big vendors to start competing here or is this a business risk for them? (robbing Peter to pay Paul, like Google investing in AI that cannibalizes their search revenue)


I doubt it. Toolkits like TensorRT are specifically optimised, and NVIDIA already segregated their HPC vs gaming dies (and corresponding, die space).

The H100s feature a minuscule amount of ROPs and TMUs (mainly for graphics); and doesn’t come with NVENC/DEC, etc. DirectX, Vulkan is not supported.

You’ll just see more and more tensor cores, optimised CUDA cores for bfloat, etc.


H100 does have NVDEC and jpeg ingest accelerators

https://www.servethehome.com/wp-content/uploads/2023/10/NVID...

Tbh it’s mildly surprising they even removed NVENC considering the overall size of the chip (in the absolute we are only talking about low-single-digit mm2 savings) and then H100 is still advertised and has features targeting VM graphics/visualization still… remember they also still put a full graphics pipeline with ROPs/TMUs on the chip, just no actual display hardware.


It's a tradeoff with verification and validation, not so much die-space. If you include the IP, you need to run an expensive (time-consuming) simulation and emulation for the IP. Then, you need to validate it on first silicon and pray that it doesn't impact yield in HVM.

These days, simulation runs basically nonstop on all the FPGAs the company is able to throw at it until tape out. Then, in post-silicon validation, you have engineering teams working 24/7 to validate the device on the bench. Finally, test engineering tries to achieve yield in HVM.

Anything that you don't need on the spec sheet or the designer insists on is cut to save time to market.


These can also be used for machine learning actually (see Dali for data loading for instance)


> I wonder, is the dual-use of GPUs a significant engineering bottleneck for Nvidia et al?

Not much. GPUs became array processing engines in the early 2000s and the amount of residual fixed function graphics hardware is tiny now.

The biggest thing is that a lot of network evaluation is being done at lower precision (BF16/ FP8 / ternary) now, but my understanding is that training the networks still requires FP32. As these companies are really targeting the training market the high precision modes needed for graphics can't be thrown away. If a company was to target the end user, then you could optimise for lower precision, but end users find dedicated neural net acceleration a luxury they can get for free by buying a GPU.


The dual-use of GPUs is what's driving Nvidia's demand, for the most part. They bet big on more outlandish designs (eg. CUDA/Tensor cores) and mixed-precision computation, and did a lot of the hardware/software work to get it shipped out years ago. There wasn't much interest in CUDA outside niche research/industry applications for a while, it's a small miracle they kept it around long enough to see the crypto and AI booms. Now their bet is paying big dividends, and other companies are trying to figure out fast ways to leverage raster compute for general-purpose acceleration.

Apple initially developed OpenCL with Khronos as a similar-ish analog for GPU acceleration a while ago. There were a few partners that invested in it, but it suffered the same lack of demand as CUDA and languished for a while. Now Apple doesn't support OpenCL, so the purpose of a multi-platform acceleration library has kinda been scuttled. Nvidia played their cards well, and their competitors are going to feel the pain for a while unless they work together again.


Sure, no one actually cares about the graphics part anymore in these contexts. Many Nvidia Datacenter GPUs are also no longer GPUs in that sense.


I know it isn’t the what you mean, but as an aside (rather than a contradiction), the gaming marked is still much bigger than the ML model training one, right? So presumably people care about the graphics part in the sense that development there funds the development of the cards, haha.

I always assumed that was what killed Xeon Phi. Hard to justify a GEMM card in and of itself. However clever you get, NVIDIA will be back next year with twice as much bandwidth.


According to my googling, data center has been bigger than gaming for nvidia since 2022, and almost x2 in Q2 2023 following a crazy nosedive in gaming sales.


Woah, that’s really something


Nvidia burnt quite a lot of goodwill with the 40xx series cards and pricing.

Melting connectors, obscene pricing, marginal performance increase, zero availability; then when hardly any of that was addressed, Ti versions release which are equally expensive, and equally lacklustre. Combine that with patchy support for frame generation and 30xx series cards still going really strong, and you’ve got not a lot of incentive to upgrade.


In dollars, for sure.

In units sold?


I think dollars was the right figure to defeat my point. My thought was that they would be impossible for their more datacenter type competition to beat, because their R&D budget came from a much larger gaming segment. But, if they are around the same size, or the data center type users actually pay more, it is a level playing field.

I’m sure they don’t mind having both revenue streams of course! But on the other hand, they have to serve two sets of interests.


No, datacenter GPU revenue has surpassed gaming revenue some time ago. At least for Nvidia and I fully expect this trend to continue.


I'll bet against that. NVIDIA's margin is so large that Google/Microsoft and others can fund / are funding large teams to make TPUs and other dedicated hardware competitive. I also believe LLMs are part of the hype cycle just like crypto. In ten years the gaming market will return to being NVIDIA's primary market.


A couple factors to consider:

1. The hardware isn't nVidia's primary moat. CUDA is. Currently all machine learning tooling assumes you have a CUDA device.

2. The hype cycle may be real, but nobody really knows how much better AI will become. The uncertainty is driving the hype (which is obviously on the optimistic side), but it could well be that models become so good that everyone "needs" an AI chip like how we "need" a smartphone today.

3. The gaming/graphics market is probably going to slowly shrink. There's only so many pixels you can push to the eye. There's very real diminishing returns there. At some point people literally won't be able to tell the difference between the newer/better/faster graphics and the older generation, at which point the GPU vendor's margins drop significantly.


> 1. The hardware isn't nVidia's primary moat. CUDA is.

People really don't appreciate this. So many times I see people comment that AMD should implement CUDA and not try to get software to move standard.

AMD have a massive task ahead because they NEED to get CUDA unseated just to play on a level playing ground, let alone win.


If by "LLMs are part of the hype cycle" you mean "people are hyping them more than their current capabilities entirely warrant" maybe. If you mean "LLMs are a hype technology that is going to peter out without leaving a major impact on society" as kindly as I can phrase it, what on earth are you smoking, or are you Amish?


To be fair, GP compared the LLM hype with crypto... And as niche crypto may be in the future, I don't think it will also "peter out without leaving a major impact on society"...


Crypto will not be niche in the future. Bitcoin has a market cap of 1,349,125,982,176 (https://coinmarketcap.com/currencies/bitcoin/) and recently surpassed the market cap of gold itself and has large investors from Black Rock to Apple, ans recently got approved for ETFs which means the general population's 401k's will have bitcoin in them starting this year. Other crypto currencies may die out, but bitcoin is here permanantly.

If you haven't yet, buy some bitcoin and stop letting the government steal your hard earned money through inflation.


Many of them don't even have video out ports, they are purely for the compute power.


The hardware is still on the chip.


Yep just need to load the model. It can and probably should be headless.


Obviously they are graphicless processing units.


They are "inferrence only" according to the article


grayskull is "inference only", but wormhole will also "support" inference.

I put those in quotes, because theoretically both can do training, it's just that grayskull isn't well suited for it, because of the internal fp19 format, and no good support for scaling to multiple cards. wormhole will have internal fp32, and is designed to scale out with across multiple cards, and servers.

At least that's how I understood it.


FP19 is perfectly fine for training - llama is trained in fp16, many train in bf16, with microscaling exponents, you can do 6 bit training : https://arxiv.org/abs/2310.10537


I think the broader context here is that it's "fine for training" in the sense that you can successfully train a model, but it's not "fine for training" in the sense that it can only train small models due to the lack of scaling across cards, which directly cuts against where ML has been trending over the last several years.

In LLM-land we've rapidly gone from training bespoke models to doing fine tuning to RLHF to zero-shot prompting. The better the underlying model the more you can do without additional training, so hardware that fails to scale up to the largest training runs will have limited practical utility despite technically supporting training.


Wormhole does support scale out though via it's built in networking? And there seem to be links for something on top of the board, NVLINK style.

And yes, I know, I've been working on LLM pretraining for about 4 years now, since 2020. The number formats themselves so far are mostly scale invariant or improve with larger scale - you can quantize a larger model and see less performance drop than a smaller model.


Matrix-Math Coprocessors.


8087 - vector edition.


GPGPU is a thing.


> GPGPU is a thing.

Indeed.

GPGPU means "using a GPU for things that are General Purpose (GPGPU)". The aforementioned IC is not a GPU that is used for general-purpose computations, but a specialized chip for AI computations.


Isn't a GPGPU just a PPU?


Been following this for a while b/c Jim Keller, but every time I look at the arch [1; as linked by other commenter] as someone who doesn't know the first thing about CPU/ASIC design it just looks sort of... "wacky"? Does anyone who understands this have a good explainer for the rationale behind having a grid of cores with memory and IFs interspersed between and then something akin to a network interconnecting them with that topology? What is it about the target workloads here that makes this a good approach?

1. https://docs.tenstorrent.com/tenstorrent/v/tt-buda/hardware


Sounds like a manycore architecture. If you have played TIS-100 it is that exact same idea. If you haven’t, but have played Factorio think of instead of having a central area where all the work happens you build a series of interconnected stations, each doing their own part of calculation before passing it onto the next.

Upside is each core has its own code and is fully Turing complete and independent of eachother. You can handle conditionals much better. And you lose the latency of having network hops for workers.

Downside is you need to break down your process to map onto specific nodes and flows.

(Assuming it is in fact manycore - which is not the same as multicore)


Sounds kind of like the IBM/Sony/Toshiba Cell? It made an appearance in the PlayStation 3, but was supposed to be a more general high-performance architecture. At some point IBM sold blade servers with Cell processors.



No, the Cell is a many-core architecture with a Power 'general' core and 6-8 'special purpose' vector processing units.


Yes. "Coarse-grain dataflow processor". It's like an FPGA. But LUTs are replaced with RISC-V cores.


> You can handle conditionals much better

Because you can have a core set up for each branch and just pass to that vs. “context-switching” your core to execute the branch that ends up being taken?


Both

> Because you can have a core set up for each branch

and

> because each core has its own instruction counter

You can have each core be a tighter loop because you can limit the types of cases it handles.

But you also have your own instruction program counter you can take all sorts of branches in a way that you can't with SIMD, because in SIMD as the name implies you get only a Single-Instruction to deal with Multiple-Data. So if you need to change the behaviour depending on the exact value of the data, you are better off using separate cores than wider vectors.


because each core has its own instruction counter


This reminds me very much of transputers. The idea here is that each cpu can context switch extremely quickly with minimal latency to any resource and you have a topology that is great for matrix maths as a result.


What makes the resulting topology great for matrix math, vs. non-matrix math workloads? Naively if you know you‘re „only“ going to multiply matrices, what do you need the flexibility and fast context-switching for? Is the end-game here that you can lay out the workload s.t. you have a series of closely colocated cores carrying out the operations of some linalg expression one after the other and the memory for intermediate results right in between, or something like that?


Is this some kind of trick question?

Your core needs to be fully programmable so you can do things like kernel fusion. The simplest form is to load quantized weights and dequantize them to bfloat16 as you go. Llama.cpp and it's gguf files support various types of quantization and most of them require programmability to efficiently support them.


not a trick question, I'm just genuinely ignorant about the topic


I suspect thats basically it (operations one by one and then pipelining to saturate). Thats basically what Groq does also AFAIK. From their website it seems the chips are designed to be connected together into one big topology, the "Galaxy" system. Also similar to TPU's, although they use HBM with only a few very powerful "cores" vs DRAM with low powered cores.


It's another iteration of the unit record machine [0] - batch processes done with a physical arrangement of the steps in the process.

CPU design moved away from this analogy a long while ago because the tasks being done with CPUs involved more dynamic control flow structures and arbitrary workloads. But workloads that are linear batches of brute force math don't need that kind of dynamism, so gridded designs become fashionable as a way of expressing a configurable pipeline with clear semantics - everything on the grid is at a linear, integer-scalable distance, buffers will be of the same size, etc.

[0] https://en.m.wikipedia.org/wiki/Unit_record_equipment


Definitely looks wacky! Has nice concept though, I like the "Network On Chip" reversed toruses.

Hopefully some of y'all tinkerers with time and dough can bear some of these ideas to fruition, keep Nvidia on their toes ;)

64gb is a good RAM amount IMO, cheap yet still vastly underutilized since we play to the LCD of users... guessing Linux will be able to make that pivot much faster/so little baggage.

Plus..."Grayskull"


> I like the "Network On Chip" reversed toruses

What about them do you like as a design decision? (genuinely curious, as again, I don't understand it)


> 64gb is a good RAM amount IMO

It doesn’t have 64gb of RAM, it has 8. The system requirements otherwise need 64gb of RAM for model compilation


You want memory to be close to where it's used because at the speeds of high-performance ICs, the latency caused by distance is actually significant.


but isn't that aspect common among this, CPUs, GPUs, ...? And it feels like the whole NOC thing would add quite some overhead to moving things around.

Are you saying proximity here more than offsets this vs. e.g. each core having its own cache as I think they do in a "normal" CPU? And if so, is this more true of ML inference workloads than other workloads, for some reason?


I think the distinction with ML inference workloads is that you often have very little control flow, so this type of architecture lets you match layers to adjacent cores so that each operation gets it's data directly from the step before rather than from RAM.


Right, so latency is less of an issue (because control flow is predictable) and performance becomes about bandwidth.


I think the NOC approach is fundamentally similar to Intel's rings that they used for core interconnect back in the mid-2010s. It works.

https://www.realworldtech.com/includes/images/articles/snbep...

https://en.wikichip.org/wiki/intel/microarchitectures/sandy_...


They still use this today, and the ring interconnect is also the topology inside zen3/zen4 CCXs (with 8 cores). Ring is one of the simplest and best systems for >4 cores, until you get to about 8-10 cores (at which point you generally split it into multiple “tiers” like multiple CCXs etc).



not for consumer chips, which still use the same ringbus design.

e-cores do have a CCX/core cluster, but the clusters themselves go on the ringbus lol


I honestly don't understand this latency obsession for LLMs. You are loading millions of parameters sequentially for each matrix. The access pattern is perfectly predictable. I just ran llama.cpp in with profiling and 99.9% of the time is spent in matrix multiplication. This shocked me, honestly, because I genuinely thought that there is going to be much more variety.


Seeing the topology I had a flashback to college and the MasPar [1] we were using in ‘92!

[1] https://en.wikipedia.org/wiki/MasPar


The real question is how do they plan to compete with say a Ryzen 8700G with 32 GB of overclocked RAM and the Ryzen AI NPU. 2x DDR5-6600 gives you more memory bandwidth than grayskull. Their primary advantage appears to be the large SRAM and not much else.


In general putting memory physically close to compute is good. If two cores need to share that memory doesn't it make sense to place the memory at the interface?


Since people seem to be wondering how the architecture works, and the software stack is open (although I'm not sure to what extend), I'll share my understanding of it.

The basic system consists of the cards consists of a bunch of Tensix cores and shared memory:

> Each Tensix core contains a high-density tensor math unit (FPU) which performs most of the heavy lifting, a SIMD engine (SFPU), five Risc-V CPU cores, and a large local memory storage. [0]

> The cores are connected with two torus-shaped, going in opposite directions. [0]

The RISC-V cores in the Tensix cores, are tiny rv32i cores, that can control the FPU, SFPU, and are also used to prepare/move data.

The FPU, does "dense tensor math", so I think it's probably a matmul engine, but I don't know any more specifics. [1]

The SFPU is a more general purpose SIMT engine, that can be driven from the RISC-V cores.

There is a SFPU simulator you can play around with on their github. [2] See the low level Kernels example for how the programming model works. [3]

The grayskull SFPU has 4 general purpose LRegs, which each hold 64 19-bit values. Wormhole has 8 general purpose LRegs, which each hold 32 32-bit values.

I've been told that wormhole SFPU has a ~3x IPC increase over grayskull, and a few new SFPU instructions.

You can probably find out more by browsing the docs and rummaging through the github repos. [4]

[0] https://docs.tenstorrent.com/tenstorrent/v/tt-buda/hardware

[1] https://docs.tenstorrent.com/tenstorrent/v/tt-buda/terminolo...

[2] https://github.com/tenstorrent-metal/sfpi/blob/master/tests/...

[3] https://tenstorrent-metal.github.io/tt-metal/latest/tt_metal...

[4] https://github.com/tenstorrent-metal/


So, modern transputer but RISC-V. Seems neat.


Oh, so am I to read this right that the actual vector work being done here is not done via the RiscV [V]ector extension, but by a purely custom core?


Yes, the RISC-V Vector extension would be absolutely overkill for an ML accelerator, and waste a lot of die space. Currently the cards need an x86_64 host, but they plan to replace that with their ascalon risc-v cpu, that supports the risc-v vector extension (rvv). So most compute is done by the accelerator cards, but for some things the host system steps in. There rvv can come in handy, because it's a lot more flexible.


It sounds very cool. I've been following them since watching a couple interviews with Jim Keller, and then seeing they're (in part) a Toronto area local company. I don't work in machine-learning but find the HW direction really interesting. I could think of a few uses for that kind of high bandwidth vector/matrix crunching that aren't restricted to ML purposes.


> "...including BERT for natural language processing tasks, ResNet for image recognition, Whisper for speech recognition and translation, YOLOv5 for real-time object detection, and U-Net for image segmentation."

I wonder why they are starting with these models. One could speculate that they are going for power efficiencies but that does not quite add up entirely.


From what I've gathered from the Tenstorrent discord:

The grayskull cards only have 8 GB of RAM and don't have a fast enough memory to NoC bandwidth to make working with multiple cards practical. The next generations, that is wormhole and newer don't have this limitations and are specifically designed to work in server racks, see the galaxy system. [0]

The Grayskull really only is a devboard. It has some other quirks that will be improved by wormhole, like native 19 bit floats in their SIMT engines, instead of 32 bit, in wormhole.

[0] https://tenstorrent.com/systems/galaxy/


Highly speculative, and I suspect that I don't do these models justice:

The purpose of these boards seems to get people acquainted with their programming model. Not as in using board and model as a turnkey solution (if that happens, fine, but this is not the goal), but as in getting potential customers for future boards to learn how to make their own models or third party models run on the board. The more models they supported out of the box, the less well the goal of building buyer-side expertise in the programming model would be served.


> I wonder why they are starting with these models.

They were building Greyskull in 2020/2021 initially. BERT_Large and similar were the SOTA at that time.


YOLOv5 is pretty widely used, and it can be useful to get efficient compute at the edge.

Disclosure: I work for a Tenstorrent customer.


Like Intel SSE for media optimization probably?

Those models do seem to be the best currently available (IMO), so yeah power efficiencies for guaranteed common use cases should be a 0-day integration.


Which reminds me... Hope those libraries are open and vetted...


Next-gen BERT models are still very popular for embeddings.


Grayskull™ e150 = Tensix Cores: 120 * 5 RISC-V => 600 RISC-V CPU core

( 1 Tensix core = 5 RISC-V core : https://docs.tenstorrent.com/tenstorrent/v/tt-buda/hardware )


Do you think this thing will run Linux (re: now removed comment about Linux default core limit)?


As I see this is : TT-Metalium [1] = a low level API [1]

"TT-Metalium is a platform for programming heterogeneous collection of CPUs (host processors) and Tenstorrent acceleration devices, composed of many RISC-V processors. Target users of TT-Metal are expert parallel programmers wanting to write parallel and efficient code with full access to the Tensix hardware via low-level kernel APIs."

https://tenstorrent-metal.github.io/tt-metal/latest/tt_metal...

[1]

"The software stacks come in two varieties – a high level and a low level. The high-level is called TT-Buda, using higher-level APIs to get things up and running, along with interfaces into modern machine learning frameworks. The lower level is TT-Metalium, which provides fine-grained control over the hardware for custom operators, custom control, and even non-machine learning code. Tenstorrent states that there are no black boxes, no encrypted APIs, and no hidden functions." https://morethanmoore.substack.com/p/unboxing-the-tenstorren...


I don't think that is as clear as you think it is.

If you had to communicate it orally, what words would you add?

ADDED. I think I figured out what it means: a Grayskull e150 contains 120 'Tensix cores', each of which contains 5 RISC-V cores.


The System Requirements on this page say 64GB RAM is a prerequisite on the host system. Why? Wouldn’t an inference server be barebones aside from the Inference hardware?

https://tenstorrent.com/cards/


From the tenstorrent discord:

> It has more to do with the memory resources required during model compilation. The requirement will vary from model to model, so 64GB is a safe limit for all.


And why do they specify PCIe 4.0?

I imagine it's just to leverage bandwidth correctly? Or is there a necessary feature in there?


I was unable to find any reference to performance or architectural details. Memory bandwidth is very low for an ML-focused device. Price is extremely high.

What am I missing?


Those are DevKits.

I assume they are sold to hardware manufacturers and maybe software developers so that they check out and test their systems against real processor. Processors are probably manufactured in some relatively old process technology.

The real thing will be manufactured using different 2nm process according to the article and performance will be different.


Outsider to the field and just curious: Does anyone know how these kinds of processors compare to the custom silicon by aws/google/tesla?


Much more efficient per watt for some specialized types of operations.

Obviously immature ecosystem, expect a couple of years at minimum before adoption if it's the real deal.


The stated architecture reminds me of how the Intel Project Larrabee GPU was supposed to work, except with RISC-V instead stripped x86 cores.


Did Larrabee contain anything that wasn't an x86 core? According to other comments, this seems to consist of massive highly specialized processing units with just a few tiny CPU sprinkled in to keep the former fed. Quite the opposite of how I remember Larabee.


Not really, Xeon Phis were basically the a x86 CPU abusing AVX extensions to their limit


Do you see any similarities with the Xeon Phi boards?


I wonder if with the recent research coming to light about the effectiveness of using 3 state weights do they plan to release something along those lines?

Can they really compete in FLOP/$ with the likes on Nvidia even if it is a more bespoke architecture than a modified GPU?


Nvidia has insane margins so it isn't hard* for a well-funded player to beat them in FLOPS/$. The hard part is the compiler toolchains, the libraries, the documentation, getting models to run fast.

* Everything is relative. It's hard to make even a simple microcontroller - but also it isn't hard, you know?


> Nvidia has insane margins so it isn't hard* for a well-funded player to beat them in FLOPS/$

It's hard when you can't buy TSMC capacity


"Life in 1 comment"


I’d wait till the dust settles. Maybe 2bit encoding (-1, 0, 0.5, 1) would be easier to design hardware for; as 0.5 can be a bitwise shift.


Nobody cares about FLOPs anymore when it comes to transformer based architectures and considering the transformer architecture is applicable to almost any usecase beyond LLMs, memory bandwidth and memory capacity are the most important metrics. Grayskull has enough 16bit float performance to outrun its memory bandwidth even with 1 ternary bit quantization.


This looks very similar to the Picochip designs used in a lot of small cellular base stations for SDR. I hope it is similar, because those were fantastic chips to program for. That influence could have come via Intel's acquisition of Picochip.


This article is extremely low detail. Does anyone have a source with more information?


docs.tenstorrent.com

All the details you need!


How is the networked multi CPU core Grayskull a different approach compared to, say, Ampere's 100+ ARM core chip?


I bought a 3050 low profile / half length with DDR6 at 14GHz for half this price.

And it can play games too.


GDDR6, 14 GT/s if we want to be accurate. So about double the DRAM bandwidth, but less than 1/20th the SRAM capacity, and fairly different architectures on-chip. I'm not sure what ML workloads would benefit from the extra SRAM enough to overcome the DRAM bandwidth deficit, but they probably exist or Tenstorrent wouldn't be making such a SRAM-heavy chip.


how does 8gb cut it?


I'm getting sick of this: Groq, Tenstorrent, all the promising startups are offering inference-only solutions. I have it from Groq official channels that they do not plan to invest development time into enabling training, because inference is where the big money is at. Which I understand insofar as inference demand probably outstrips training demand by ~millions, but I still can't help but find all of this so egreriously disappointing!




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: