GPUs Go Brrr

Animats · 2024-05-12T23:32:01

"And we ask: if your matrix multiply is smaller than 16x16, are you sure what you’re doing is AI?

From a philosophical point of view, we think a frame shift is in order. A “register” certainly shouldn’t be a 32-bit word like on the CPUs of old. And a 1024-bit wide vector register, as CUDA uses, is certainly a step in the right direction. But to us a “register” is a 16x16 tile of data. We think AI wants this."

The hardware needs of AI are starting to focus. GPUs, after all, were designed for an entirely different job. They're used for AI because they have good matrix multiply hardware. "AI GPUs" get to leave out some of the stuff in a real GPU (does an H100 even have texture fill units?). Then there's a trend towards much shorter numbers. 16 bit floating point? 8 bit? 2 bit? 1 bit? That will settle out at some point. This paper indicates that hardware that likes 16x16 tiles makes a lot of sense. It's certainly possible to build such hardware. Someone reading this is probably writing it in VHDL right now, or will be soon.

Then we'll see somewhat simpler, less general, and cheaper devices that do "AI" operations with as little excess hardware baggage as possible. Nice.

bcatanzaro · 2024-05-13T03:12:26

GPUs have evolved to be AI machines with as little baggage as possible. People have been arguing GPUs were old technology and therefore unsuited for AI since at least 2014 (when Nervana was founded), but what they perhaps didn’t expect is that the GPU would evolve so quickly to be an AI machine.

celrod · 2024-05-13T08:19:51

Bill Dally from Nvidia argues that there is "no gain in building a specialized accelerator", in part because current overhead on top of the arithmetic is in the ballpark of 20% (16% of IMMA and 22% for HMMA units) https://www.youtube.com/watch?v=gofI47kfD28

AnthonyMouse · 2024-05-13T09:18:16

There does seem to be a somewhat obvious advantage: If all it has to do is matrix multiplication and not every other thing a general purpose GPU has to be good at then it costs less to design. So now someone other than Nvidia or AMD can do it, and then very easily distinguish themselves by just sticking a ton of VRAM on it. Which is currently reserved for GPUs that are extraordinarily expensive, even though the extra VRAM doesn't cost a fraction of the price difference between those and an ordinary consumer GPU.

bjornsing · 2024-05-13T09:32:14

Exactly. And that means you not only save the 22% but also a large chunk of the Nvidia margin.

Animats · 2024-05-13T11:00:36

And, sure enough, there's a new AI chip from Intellifusion in China that's supposed to be 90% cheaper. 48 TOPS in int8 training performance for US$140.[1]

[1] https://www.tomshardware.com/tech-industry/artificial-intell...

pfdietz · 2024-05-13T11:08:46

I wonder what the cost of power to run these chips is. If the power cost ends up being large compared to the hardware cost, it could make sense to buy more chips and run them when power is cheap. They could become a large source of dispatchable demand.

pclmulqdq · 2024-05-14T02:03:34

Int8 training has very few applications, and int8 ops generally are very easy to implement. Int8 is a decent inference format, but supposedly doesn't work well for LLMs that need a wide dynamic range.

cma · 2024-05-13T22:37:20

There are other operations for things like normalization in training, which is why most successful custom stuff has focused on inference I think. As architectures changed and needed various different things some custom built training hardware got obsoleted, Keller talked about that affecting Tesla's Dojo and making it less viable (they bought a huge nvidia cluster after it was up). I don't know if TPU ran into this, or they made enough iterations fast enough to keep adding what they needed as they needed it.

WithinReason · 2024-05-13T16:25:37

Designing it is easy and always has been. Programming it is the bottleneck. Otherwise Nvidia wouldn't be in the lead.

markhahn · 2024-05-13T16:45:34

but programming it is "import pytorch" - nothing nvidia-specific there.

the mass press is very impressed by Cuda, but at least if we're talking AI (and this article is, exclusively), it's not the right interface.

and in fact, Nv's lead, if it exists, is because they pushed tensor hardware earlier.

achierius · 2024-05-13T17:14:58

Someone does, in fact, have to implement everything underneath that `import` call, and that work is _very_ hard to do for things that don't closely match Nvidia's SIMT architecture. There's a reason people don't like using dataflow architectures, even though from a pure hardware PoV they're very powerful -- you can't map CUDA's, or Pytorch's, or Tensorflow's model of the world onto them.

WithinReason · 2024-05-13T17:11:53

I'm talking about adding Pytorch support for your special hardware.

Nv's lead is due to them having Pytorch support.

KaoruAoiShiho · 2024-05-13T18:46:53

Eh if you're running in production you'll want something lower level and faster than pytorch.

ericye16 · 2024-05-14T02:39:38

AI models are not all matrix multiplications, and they tend to involve other operations. Also, they change super fast, much faster than hardware cycles, so if your hardware isn't general-purpose enough, the field will move past you and obsolete your hardware before it comes out.

AnthonyMouse · 2024-05-14T03:32:05

AI models are mostly matrix multiplications and have been that way for a few years now, which is longer than a hardware cycle. Moreover, if the structure changes then the hardware changes regardless of whether it's general purpose or not, because then it has to be optimized for the new structure.

Everybody cares about VRAM right now yet you can get a P40 with 24GB for 10% of the price of a 24GB RTX 4090. Why? No tensor cores, the things used for matrix multiplication.

papruapap · 2024-05-13T09:46:39

I really hope we see AI-PU (or with some other name, INT16PU, why not) for the consumer market sometime soon. Or been able to expand GPU memory using a pcie socket (not sure if technically possible).

PeterisP · 2024-05-13T10:16:52

The while point of GPU memory is that it's faster to access than going to memory (like your main RAM) through the PCIe bottleneck.

throwaway4aday · 2024-05-13T13:09:36

My uninformed question about this is why can't we make the VRAM on GPUs expandable? I know that you need to avoid having the data traverse some kind of bus that trades overhead for wide compatibility like PCIe but if you only want to use it for more RAM then can't you just add more sockets whose traces go directly to where they're needed? Even if it's only compatible with a specific type of chip it would seem worthwhile for the customer to buy a base GPU and add on however much VRAM they need. I've heard of people replacing existing RAM chips on their GPUs[0] so why can't this be built in as a socket like motherboards use for RAM and CPUs?

[0] https://www.tomshardware.com/news/16gb-rtx-3070-mod

giobox · 2024-05-13T16:27:55

Expandable VRAM on GPUs has been tried before - the industry just hates it. It's like Apple devices - want more internal storage? Buy a new computer so we can have the fat margins.

The original REV A iMac in late 90s had slotted memory for its ATI card, as one example - shipped with 2mb, could be upgraded to 6mb after the fact with a 4MB SGRAM DIMM. There are also a handful of more recent examples floating around.

While I'm sure there are also packaging advantages to be had by directly soldering memory chips instead of slotting them etc, I strongly suspect the desire to keep buyers upgrading the whole card ($$$) every few years trumps this massively if you are a GPU vendor.

Put another way, what's in it for the GPU vendor to offer memory slots? Possibly reduced revenue, if it became industry norm.

Majromax · 2024-05-13T17:37:28

Expansion has to answer one fundamental question: if you're likely to need more X tomorrow, why aren't you just buying it today?

The answer to this question almost has to be "because it will be cheaper to buy it tomorrow." However, GPUs bundle together RAM and compute. If RAM is likely to be cheaper tomorrow, isn't compute also probably going to be cheaper?

If both RAM and compute are likely cheaper tomorrow, then the calculus still probably points towards a wholesale replacement. Why not run/train models twice as quickly alongside the RAM upgrades?

> I strongly suspect the desire to keep buyers upgrading the whole card ($$$) every few years trumps this massively if you are a GPU vendor.

Remember as well that expandable RAM doesn't unlock higher-bandwidth interconnects. If you could take the card from five years ago and load it up with 80 GB of VRAM, you'd still not see the memory bandwidth of a newly-bought H100.

If instead you just need the VRAM and don't care much about bandwidth/latency, then it seems like you'd be better off using unified memory and having system RAM be the ultimate expansion.

AnthonyMouse · 2024-05-13T18:45:32

> The answer to this question almost has to be "because it will be cheaper to buy it tomorrow."

No, it doesn't. It could just as easily be "because I will have more money tomorrow." If faster compute is $300 and more VRAM is $200 and I have $300 today and will have another $200 two years from now, I might very well like to buy the $300 compute unit and enjoy the faster compute for two years before I buy the extra VRAM, instead of waiting until I have $500 to buy both together.

But for something which is already a modular component like a GPU it's mostly irrelevant. If you have $300 now then you buy the $300 GPU, then in two years when you have another $200 you sell the one you have for $200 and buy the one that costs $400, which is the same one that cost $500 two years ago.

This is a much different situation than fully integrated systems because the latter have components that lose value at different rates, or that make sense to upgrade separately. You buy a $1000 tablet and then the battery goes flat and it doesn't have enough RAM, so you want to replace the battery and upgrade the RAM, but you can't. The battery is proprietary and discontinued and the RAM is soldered. So now even though that machine has a satisfactory CPU, storage, chassis, screen and power supply, which is still $700 worth of components, the machine is only worth $150 because nothing is modular and nobody wants it because it doesn't have enough RAM and the battery dies after 10 minutes.

hellofellows · 2024-05-13T18:16:17

hmm seems you're replying as a customer, but not as a GPU vendor...

the thing is, there's not enough competition in the AI-GPU space.

Current only option for no-wasting-time on running some random research project from github? buy some card from nvidia. cuda can run almost anything on github.

AMD gpu cards? that really depends...

and gamers often don't need more than 12?gb of GPU ram for running games on 4k.. so most high-vram customers are on the AI field.

> If you could take the card from five years ago and load it up with 80 GB of VRAM, you'd still not see the memory bandwidth of a newly-bought H100.

this is exactly what nvidia will fight against tooth-and-nail -- if this is possible, its profit margin could be slashed to 1/2 or even 1/8

carbotaniuman · 2024-05-13T14:46:54

Replacing RAM chips on GPUs involves resoldering and similar things - those (for the most part) maintain the signal integrity and performance characteristics of the original RAM. Adding sockets complicates the signal path (iirc), so it's harder for the traces to go where they're needed, and realistically given a trade-off between speed/bandwidth and expandability I think the market goes with the former.

AnthonyMouse · 2024-05-13T18:10:21

The problem with GPUs is they're designed to be saturated.

If you have a CPU and it has however many cores, the amount of memory or memory bandwidth you need to go with that is totally independent, and memory bandwidth is rarely the bottleneck. So you attach a couple memory channels worth of slots on there and people can decide how much memory they want based on whether they intend to have ten thousand browser tabs open or only one thousand. Neither of which will saturate memory bandwidth or depend on how fast the CPU is, so you don't want the amount of memory and the number of CPU cores tied together.

If you have a device for doing matrix multiplications, the amount of RAM you need is going to depend on how big the matrix you want to multiply is, which for AI things is the size of the model. But the bigger the matrix is, the more memory bandwidth and compute units it needs for the same number of tokens/second. So unlike a CPU, there aren't a lot of use cases for matching a small number of compute units with a large amount of memory. It'd be too slow.

Meanwhile the memory isn't all that expensive. For example, right now the spot price for 64GB of GDDR6 is less than $200. Against a $1000 GPU which is fast enough for that much, that's not a big number. Just include it to begin with.

Except that they don't. The high end consumer GPUs are heavy on compute and light on memory. For example, you can get the RTX 4060Ti with 16GB of VRAM. The RTX 4090 has four times as much compute but only 50% more VRAM. There would be plenty of demand for a 4090 that cost $200 more and had four times as much VRAM, only they don't make one because of market segmentation.

Obviously if they don't do that then they're not going to give one you can upgrade. But you don't really want to upgrade just the VRAM anyway, what you want is for the high performance cards to come with that much VRAM to begin with. Which somebody other than Nvidia might soon provide.

PeterisP · 2024-05-13T17:02:48

Technically we definitely can, but are there sufficiently many people willing to pay a sufficiently high premium for that feature? How much more would you be willing to pay for an otherwise identical card that has the option to expand RAM, and do you expect that a significant portion of buyers would want to pay a non-trivial up-front cost for that possibility?

throwaway48476 · 2024-05-13T17:53:53

Its a minor technical challenge with no financial benefit for the GPU makers.

rdsubhas · 2024-05-13T17:51:13

Isn't that what NPUs are technically?

https://en.m.wikipedia.org/wiki/AI_accelerator

hhsectech · 2024-05-13T10:05:54

Isn't this what resizeable BAR and direct storage are for?

dvt · 2024-05-13T00:14:54

> Then we'll see somewhat simpler, less general, and cheaper devices that do "AI" operations with as little excess hardware baggage as possible. Nice.

Apple has already been doing this for a few years now. The NPU is totally different from the GPU or CPU on the die itself[1]. Nvidia is likely working on this as well, but I think a device that's a gaming/entertainment/crypto/AI bundle (i.e. sticking with the video card) is probably a better business move.

[1] https://github.com/hollance/neural-engine/blob/master/docs/a...

talldayo · 2024-05-13T01:29:13

The NPUs on a lot of different systems occupy an awkward spot. For extremely small models, they're the way to go for low-power inference. But once you reach LLM or vision transformer size, it makes a lot more sense to switch to GPU shaders for that extra bit of large-model performance. For stuff like Llama and Stable Diffusion, those Neural Engines are practically wasted silicon. The biggest saving grace is projects like ONNX attempting to sew them into a unified non-15-competing-standards API, but even that won't change how underpowered they are.

Nvidia escapes this by designing their GPU architecture to incorporate NPU concepts at a fundamental level. It's less redundant silicon and enables you to scale a single architecture instead of flip-flopping to whichever one is most convenient.

nxobject · 2024-05-13T06:23:40

It's currently doable for Apple – I think their strategy is to slowly enhance iPhones, bit by bit, with special-purpose models for dealing with media like photo subject identification, OCR (in every language!), voice transcription, etc. Apple's currently learning from Microsoft's attempts to make AI stick everywhere.

everforward · 2024-05-13T15:25:09

I think Apple is more interested in features that work consistently than in giving power users the ability to play with essentially alpha or beta AI features.

I would guess that their strategy is to not include powerful client-side hardware, and supplement that with some kind of "AiCloud" subscription to do the battery-draining, heat-generating stuff on their cloud. They're trading off their branding as a privacy focused company under the (probably correct) belief that people will be more willing to upload their data to iCloud's AI than Microsoft's.

Fwiw, I think they're probably correct. It has always struck me as odd that people want to run AI on their phone. My impression of AI is that it creates very generalized solutions to problems that would be difficult to code, at the cost of being very compute inefficient.

I don't really want code like that running on my phone; it's a poor platform for it. Thermal dissipation and form factor limit the available processing power, and batteries limit how long you can use the processing power you have. I don't really want to waste either trying to do subject identification locally. I'm going to upload the photos to iCloud anyways; let me pay an extra $1/month or whatever to have that identification happen in the cloud, on a server built for it that has data center thermal dissipation and is plugged into the wall.

SJC_Hacker · 2024-05-14T18:26:10

>I'm going to upload the photos to iCloud anyways; let me pay an extra $1/month or whatever to have that identification happen in the cloud, on a server built for it that has data center thermal dissipation and is plugged into the wall.

You might not be in area of poor connection and can't connect to the cloud.

One use for AI is speech recognition / transcription for deaf/HoH individuals. Up until now its almost been done exclusively on the cloud and it works fairly well (depending on conditions). Recently there's been an interest in doing it locally without relying on a network connection.

There's also privacy issues with transmitting this data over a network.

unethical_ban · 2024-05-13T18:35:28

> It has always struck me as odd that people want to run AI on their phone. My impression of AI is that it creates very generalized solutions to problems that would be difficult to code, at the cost of being very compute inefficient.

I don't equate AI with coding. I want AI locally for photo sorting and album management, for general questions answering/list making that I use GPT for, and any number of other things.

I try not to upload personal data to sites that aren't E2E encrypted, so iCloud/Google photos is a no-go.

talldayo · 2024-05-13T16:12:39

The pinch (as far as I can see it) is that you're right, and Apple can't sell a freestanding service to save their life. If we do get an AppleGPT pay-as-you-go service, it's certain to be extraordinarily censored and locked-down as the exclusive first-party option on iPhone. It will feature "vertical integration" that no other AI can have, alongside censorship so prudish that it would make Maurey Povich gasp.

So... I think users will be stuck. They'll want to run uncensored models on their phone, but Apple will want to keep them in the walled garden at any cost. It feels like the whole "Fortnite" situation all over again, where users can agree they want something but Apple can't decide.

joquarky · 2024-05-13T06:38:46

Soon our phones will dream beside us every night (integrating new data into our personal model while on the charger)

serialx · 2024-05-13T10:12:31

Well, iPhone already does that with photos. :)

pierrefermat1 · 2024-05-13T14:38:58

Do you have a link to where they breakdown what inference for photos happens in realtime vs overnight/charging?

WhitneyLand · 2024-05-13T15:10:40

Anyone checked out the NPU on the new iPad? It’s supposed to be a bazillion times better according to Apple but I haven’t had a chance to dig into the reality.

I guess we can assume this is going to be what’s used in what’s being called Apple’s first AI phone, iPhone 16.

numpad0 · 2024-05-13T16:05:26

That 38 TOPS figure was a bit weird, it's literally below baseline(45 TOPS) for "AI PC" branding Qualcomm/Intel/Microsoft is launching this June, and also 10x less than typical GPUs. I think it was just a clever marketing exploiting the fact that "AI PC" branding hasn't launched yet.

fassssst · 2024-05-13T15:45:11

It has 38 TOPS of INT8 performance. Not very remarkable compared to consumer Nvidia GPU’s which are like one or two orders of magnitude faster.

talldayo · 2024-05-13T16:07:09

For reference, Nvidia's Jetson Orin NX robotics platform is 35-50 TOPS on average. Apple is catching up, but Nvidia still has by-far the more flexible (and better scaled) platform.

eru · 2024-05-13T02:36:13

And Google has their TPUs.

yosefk · 2024-05-13T04:02:55

For inference, Nvidia has DLA since 2017-ish if I remember correctly, which is completely separate from the GPU.

choppaface · 2024-05-13T01:22:15

“NVidia’s LIES..

On kernels such as flash attention, TMA and the L2 cache are both fast enough so as to hide these problems reasonably well. But to make the full use of the hardware, memory request must be coalesced and bank conflicts avoided ”

The depth of the competition is also starting to become apparent. There’s no way the documentation error was totally an accident. Diagrams are the easiest to steal / copy and there must have been some utility for nvidia to have left this in place. Remember when Naveen Rao’s Nervana was writing NVidia Maxwell drivers that out-performed NVidia’s own? Not every documentation mishap in a high-growth product is a competition counter-measure, but given that the researchers spent so long reverse-engineering wgmma and given the China-US political situation of the H100 in particular, it seems NVidia is up to its old tricks to protect its moat.

So don’t over-study the H100 peculiarities, as “what hardware does AI want?” really encompasses the commercial situation as well.

wiz21c · 2024-05-13T07:03:37

I don't understand. If they document their stuff with errors, it will hurt users, be they chinese or US ? Or is it expected that US users will call Nvidia's to ask for the correct documentation ?

acka · 2024-05-13T08:43:50

It could be a case of classic market segmentation. The lower tier customers get the incomplete or error-ridden documentation, and the upper tier trusted customers^W'partners' get access to the juicy stuff: complete and mostly correct documentation, including stuff intentionally left out of the lower tier package like application notes containing secret hardware handshakes to unlock hidden features, all under strict NDA of course.

SJC_Hacker · 2024-05-14T14:38:19

Kinda like a drug dealer cutting the product to increase profits.

Except for special customers who will pay for the genuine item.

choppaface · 2024-05-13T16:39:14

The vast majority of users use NVidia’s own kernels versus optimize their own. And those who do write custom kernels are typically not trying to compete with NVidia’s own GMM.

FuriouslyAdrift · 2024-05-13T13:28:56

AMD is already in their second generation of of Versal line.

https://www.amd.com/en/products/accelerators/alveo/v80.html

XDNA Architecture

https://www.amd.com/en/technologies/xdna.html

jiveturkey · 2024-05-13T03:05:31

hasn't google been building such devices for a decade now?

yayr · 2024-05-13T07:45:55

yep, and the main engineers have founded groq.com with an architecture that among others precisely solved the memory management issues

mvkel · 2024-05-12T23:59:31

Would you say this is ultimately "ASICs for AI"?

dartos · 2024-05-13T00:14:05

In the same way that CPUs are ASICs for integer operations, that makes sense to me.

saagarjha · 2024-05-13T06:03:38

Most CPUs do just fine on floating point too.

actionfromafar · 2024-05-13T07:28:47

I'm still getting used to that.

dartos · 2024-05-13T11:18:21

Floating point arithmetic _is_ integer arithmetic on the cpu level because of how floating point numbers work.

fwip · 2024-05-13T17:24:18

That's a good point - floating point operations are implemented with integer-math circuits (or at least can be - I'm not privy to how modern chip manufacturers implement them). E.g: your ALU may have an 11-bit adder specifically to add your f64 exponents.

Some slides to get the gist of it: https://users.encs.concordia.ca/~asim/COEN_6501/Lecture_Note...

WanderPanda · 2024-05-13T00:15:38

Wait but nvidia tensor-cores are exactly the hardware that likes 16x16 tiles, no? I thought that was the whole point? The hardware is already here and I'm sceptical if there is another order of magnitude in performance to be gained from even more specialized designs.

wtallis · 2024-05-13T01:22:01

What's the ratio of tensor cores to regular SIMD compute ("CUDA cores") on NVIDIA's current chips?

creato · 2024-05-13T04:41:27

This is in the article: if you aren't using the tensor cores, you aren't utilizing ~94% of the FLOPs available.

wtallis · 2024-05-13T07:08:13

Knowing what portion of the FLOPs are in the tensor cores isn't quite the right thing to be looking at. The key question is how much more tensor core performance can be gained by reducing or eliminating the dies area devoted to non-tensor compute and higher precision arithmetic. Most of NVIDIA's GPUs are still designed primarily for graphics: they have some fixed function units that can be deleted in an AI-only chip, and a lot of die space devoted to non-tensor compute because the tensor cores don't naturally lend themselves to graphics work (though NVIDIA has spent years coming up with ways to not leave the tensor cores dark during graphics work, most notably DLSS).

So the claims that NVIDIA's GPUs are already thoroughly optimized for AI and that there's no low-hanging fruit for further specialization don't seem too plausible, unless you're only talking about the part of the datacenter lineup that has already had nearly all fixed-function graphics hardware excised. And even for Hopper and Blackwell, there's some fat to be trimmed if you can narrow your requirements.

smallmancontrov · 2024-05-13T14:05:18

Mind the Dark Silicon Fraction.

Some fraction of your transistors MUST go unused on average or you melt the silicon. This was already a thing in the 20nm days and I'm sure it has only gotten worse. 100% TDP utilization might correspond to 60% device utilization.

wtallis · 2024-05-13T14:32:10

That's true for CPUs. Does it really apply to GPUs and other accelerators for embarrassingly parallel problems where going slower but wider is always a valid option?

incrudible · 2024-05-13T10:03:50

There is not a lot of fixed function left in the modern graphics pipeline, economics of scale dictate that there is no net benefit in trimming it.

wtallis · 2024-05-13T14:43:27

And yet, even NVIDIA does trim it from chips like the H100, which has no display outputs, RT cores, or video encoders (though they keep the decoders), and only has ROPs for two of the 72 TPCs.

Sharlin · 2024-05-13T08:21:11

On the H100 specifically. The figure is likely different on consumer cards.

muyuu · 2024-05-13T07:32:29

it's going to be awkward in consumer hardware either way

if you segregate AI units from the GPU, the thing is both AI and GPUs will continue to need massive amounts of matrix multiplication and as little memory latency as possible

the move to have more of it wrapped in the GPU makes sense but at least in the short and medium term, most devices won't be able to justify the gargantuan silicon wafer space/die growth that this would entail - also currently Nvidia's tech is ahead and they don't make state of the art x86 or ARM CPUs

for the time being I think the current paradigm makes the most sense, with small compute devices making inroads in the consumer markets as non-generalist computers - note that more AI-oriented pseudo-GPUs already exist and are successful since the earlier Nvidia Tesla lineup and then the so-called "Nvidia Data Center GPUs"

rfoo · 2024-05-13T08:31:22

> as little memory latency as possible

Should be "as much memory bandwidth as possible". GPUs are designed to be (relatively) more insensitive to memory latency than CPU.

muyuu · 2024-05-13T10:51:53

yep that's true, although AI compute modules do get significant benefit from low latency cache as well

UncleOxidant · 2024-05-13T17:30:47

> Then there's a trend towards much shorter numbers. 16 bit floating point? 8 bit? 2 bit? 1 bit?

There was that recent paper titled "The Era of 1-bit LLMs" [0] which was actually suggeting a 1.58 bit LLM (2 bits in practice).

> Someone reading this is probably writing it in VHDL right now, or will be soon.

Yeah, I think I'm in the "will be soon" camp - FPGA board has been ordered. Especially with the 2-bit data types outlined in that paper [0] and more details in [1]. There's really a need for custom hardware to do that 2-bit math efficiently. Customizing one of the simpler open source RISC-V integer implementations seems like something to try here adding in the tiled matrix registers and custom instructions for dealing with them (with the 2 bit data types).

[0] https://arxiv.org/abs/2402.17764 [1] https://github.com/microsoft/unilm/blob/master/bitnet/The-Er...

anticensor · 2024-05-14T18:16:27

> There was that recent paper titled "The Era of 1-bit LLMs" [0] which was actually suggeting a 1.58 bit LLM (2 bits in practice).

1 trit, not 2 bits. 3 trits are 27 states, which can be represented with 5 bits.

renonce · 2024-05-13T00:55:37

> NVIDIA’s lies. This is an extraordinarily misleading representation of the actual 128b swizzled wgmma layout. This diagram cost us three weeks of life that we will not get back, hence the public shaming.

Wondering if anyone would be surprised that a huge amount of progress in AI is on the engineering side (optimizing matmuls), and that a huge portion of the engineering is about reverse engineering NVIDIA chips

DeathArrow · 2024-05-13T07:18:02

Architecture doesn't make a difference. Big enough models trained with big enough data tend to give the same results regardless of architecture. So yes, most advances in AI are mostly due to the fact we can now multiply matrices very fast.

elcomet · 2024-05-13T07:46:52

That's not completely true. The architecture must behave well for scaling, which is not trivial. Basic multi-layer perceptrons do not scale well for example, the gradient will vanish or explode deeper in the network.

3abiton · 2024-05-13T08:15:26

And data quality. Ensuring the sourcing and quality is very important to get a good model.

fleischhauf · 2024-05-13T11:12:51

this, if you have money to spend in improving your model, more training data is the first thing I'd take a look at

Tarrosion · 2024-05-13T16:26:00

How do modern foundation models avoid multi-layer perceptron scaling issues? Don't they have big feed-forward components in addition to the transformers?

elcomet · 2024-05-24T18:45:31

They rely heavily on what we call residual or skip connexions. This means each layer does something like x = x + f(x). This helps the training a lot, ensuring the gradient can flow nicely in the whole network.

This is heavily used in ResNets (residual networks) for computer vision, and is what allows training much deeper convolutional networks. And transformers use the same trick.

heavenlyblue · 2024-05-13T16:36:22

They don't do global optimisation of all layers at the same time, instead training all layers independently of each other.

mk67 · 2024-05-13T18:32:37

I'm in the industry and nobody does that since over ten years. There was just a small phase when Hinton published "Greedy layer-wise training of deep networks" in 2007 and people did it for a few years at most. But already with the rise of LSTMs in the 2010s this wasn't done anymore and now with transformers also not. Would you care to share how you reached your conclusion as it matches 0 of my experience over the last 15 years and we also train large-scale LLMs in our company. There's just not much point to it when gradients don't vanish.

Tarrosion · 2024-05-14T17:49:23

Why don't gradients vanish in large scale LLMs?

mk67 · 2024-05-19T08:46:47

Not easy to give a concise answer here, but let me try:

The problem mainly occurs in networks with recurrent connections or very deep architectures. In recurrent architectures this was solved via LSTMs with the signal gates. In very deep networks, e.g. ResNet, this was solved via residual connections, i.e. skip connections over layers. There were also other advances, such as replacing sigmoid activations with the simpler ReLU.

Transformers, which are the main architecture of modern LLMs, are highly parallel without any recurrence, i.e. at any layer you still have access to all the input tokens, whereas in an RNN you process one token at a time. To solve the potential problem due to "deepness" they also utilize skip connections.

rfoo · 2024-05-13T09:06:14

idk, they do give the same results, but given the memory bottleneck it feels like we are at a point when architecture innovations matter again, for example check out DeepSeek V2 tech report, they modded model arch specifically for lower cost inference (by making k/v cache smaller)

__loam · 2024-05-13T20:31:31

Different architecture can result in hundreds of millions of dollars more in training costs no?

DeathArrow · 2024-05-14T05:56:52

Sure, but the point wasn't about the costs, it was about the capability of a trained model.

panki27 · 2024-05-13T13:29:41

Warp scheduler, 4 quadrants, tensor memory accelerator, unswizzled wgmma layouts...

The line between GPU lingo and Star Trek technobabble fades away further and further.

araes · 2024-05-13T17:28:50

There was some awareness reading the article, yet "we're warping through the quadrant in our tensor accelerator" is pretty Trek.

Have had that thought occasionally with some of the other articles. What it must read like to somebody who gets a ref link for an article over here. Wandered into some Trek nerd convention discussing warp cores.

winwang · 2024-05-13T19:24:52

I mean, if we're talking about "accelerating by modifying the metric tensor" then yeah, that would be pretty sci-fi :)

https://en.wikipedia.org/wiki/Metric_tensor_(general_relativ...

Agentlien · 2024-05-13T15:44:05

Your comment prompted me to take a step back and look at these terms with new eyes. That made me smile, because you're so right.

winternewt · 2024-05-13T09:17:57

I believe that reducing the power consumption and increasing the speed of AI inference will be best served by switching to analog, approximate circuits. We don't need perfect floating-point multiplication and addition, we just need something that takes an two input voltages and produces an output voltage that is close enough to what multiplying the input voltages would yield.

danielheath · 2024-05-13T15:03:10

I know someone working in this direction; they've described the big challenges as:

  * Finding ways to use extant chip fab technology to produce something that can do analog logic. I've heard CMOS flash presented a plausible option.
  * Designing something that isn't an antenna.
  * You would likely have to finetune your model for each physical chip you're running it on (the manufacturing tolerances aren't going to give exact results)

The big advantage is that instead of using 16 wires to represent a float16, you use the voltage on 1 wire to represent that number (which plausibly has far more precision than a float32). Additionally, you can e.g. wire two values directly together rather than loading numbers into an ALU, so the die space & power savings are potentially many, many orders of magnitude.

tasty_freeze · 2024-05-13T16:44:07

> which plausibly has far more precision than a float32

If that was true, then a DRAM cell could represent 32 bits instead of one bit. But the analog world is noisy and lossy, so you couldn't get anywhere near 32 bits of precision/accuracy.

Yes, very carefully designed analog circuits can get over 20 bits of precision, say A/D converters, but they are huge (relative to digital circuits), consume a lot of power, have low bandwidth as compared to GHz digital circuits, and require lots of shielding and power supply filtering.

This is spit-balling, but the types of circuits you can create for a neural network type chip is certainly under 8 bits, maybe 6 bits. But it gets worse. Unlike digital circuits where signal can be copied losslessly, a chain of analog circuits compounds the noise and accuracy losses stage by stage. To make it work you'd need frequent requantization to prevent getting nothing but mud out.

hnaccount_rng · 2024-05-15T06:28:14

You can get 8bit analog signal resolution reasonablyish easyish. The Hagen mode [1] of BrainScaleS [2] is essentially that. But.. yeah. No way in hell you are getting more than 16bit with that kind of technology, let alone more.

And those things are huge which lead to very small network sizes. This is partially due to the fabrication node, but also simply because there is even less well developed tooling for analog circuits compared to digital ones compared to software compilers

[1] https://electronicvisions.github.io/documentation-brainscale... [2] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8907969/ [3] https://arxiv.org/pdf/2003.11996

bobmcnamara · 2024-05-13T15:12:37

> which plausibly has far more precision than a float32

+/- 1e-45 to 3.4e38. granted, roughly half of that is between -1 and 1.

When we worked with low power silicon, much of the optimization was running with minimal headroom - no point railing the bits 0/1 when .4/.6 will do just fine.

> Additionally, you can e.g. wire two values directly together rather than loading numbers into an ALU

You may want an adder. Wiring two circuit outputs directly together makes them fight, which is usually bad for signals.

cyanide911 · 2024-05-14T18:34:51

Do you have any references of papers/people working on this? I'm very interested in the possibilities that lie here, but have no idea where to start

rcxdude · 2024-05-13T23:14:21

an analog value in such a chip has far, far less resolution than a float32. Maybe you get 16 bits of resolution, more likely 8, and your multiplications are going to be quite imprecise. The whole thing hinges on the models being tolerant of that.

Symmetry · 2024-05-13T17:52:58

I think we're far away from analog circuits being practically useful, but one place that where we might embrace the tolerance for imprecision is in noisy digital circuits. Accepting that one in a million, say, bits in an output will be flipped to achieve a better performance/power ratio. Probably not when working with float32s where a single infinity[1] could totally mess things but for int8s the occasional 128 when you wanted a 0 seems like something that should be tolerable.

[1] Are H100s' maxtrix floating point units actually IEEE 754 compliant? I don't actually know.

SJC_Hacker · 2024-05-14T14:53:01

I'd go a step further, something which resembles how "wet brains" (biological) actually work, but which could be produced easily.

Biological neural networks are nowhere near as connected as ANNs, which are typically fully connected. With biological neurons, the ingress / egress factors are < 10. So they are highly local

It is also an entirely different model, as there is no such thing as backpropagation in biology (that we know of).

What they do have is lieu of backpropagation is feedback (cycles)

And maybe there are support cells/processes which are critical to the function of the CNS that we don't know of yet.

There could also be a fair amount of "hard coded" connectedness, even at the higher levels. We already know of some. For instances, it is known that auditory neurons in the ears are connected and something similar to a "convolution" is done in order to localize sound source. It isn't an a emergent phenomena - you don't have to be "trained" to do it.

This is not surprising give life has had billions of years and a comparable number of generations in order to figure it out.

I guess in theory this could all be done in software. However given the trillion+ neurons in primate/human brains, this would be incredibly challenging on even the thousand-core machines we have nowadays. And before you scream "cloud" it would not have the necessary interconnectedness/latency.

It would be cool if you could successful model say, a worm/insect with this approach.

sdwr · 2024-05-14T20:56:46

> What they do have is lieu of backpropagation is feedback (cycles)

I wonder where the partial data / feedback is stored. Don't want to sound like a creationist, but it seems very improbable that "how good my sound localization is" is inferred exclusively from the # of children I have.

SJC_Hacker · 2024-05-14T22:33:05

It’s evolved in simpler organisms with a much shorter generation cycle and more offspring per generation.

Being able to localize sound source has a lot of benefits including predation avoidance and prey detection.

sdwr · 2024-05-15T02:15:59

Almost convincing, except there's no animal composition (beyond gut microbes)! You can't stick 2 animals that each evolved 1 thing together.

brazzy · 2024-05-13T10:29:42

Sounds pretty impossble to me do that with a sufficient combination of range and precision.

atoav · 2024-05-13T10:54:38

What do you mean with inpossible? You are aware that what radio equipment does is often equivalent of analog operations like multiplication, addition, etc. just at high frequencies?

Sure accuracy is an issue, but this is not as impossible as you may think it would be. The main question will be if the benefits by going analog outweigh the issues arising from it.

Symmetry · 2024-05-13T12:38:04

In general the problem with analog is that every sequential operation introduces noise. If you're just doing a couple of multiplications to frequency shift a signal up and down that's fine. But if you've got hundreds of steps and you're also trying to pack huge numbers of parallel steps into a very small physical area.

rsp1984 · 2024-05-13T14:53:55

TBH that sounds like a nightmare to debug.

dnedic · 2024-05-13T12:15:17

How do you inspect what is happening then without having ADCs sampling every weight, taking up huge die area?

jkaptur · 2024-05-13T14:48:22

Maybe a silly question (I don't know anything about this) - how do you program / reprogram it?

Arch485 · 2024-05-13T19:32:14

Realistically, you'd train your model the same way it's done today and then custom-order analog ones with the weights programmed in. The advantage here would be faster inference (assuming analog circuits actually work out), but custom manufacturing circuits would only really work at scale.

I don't think reprogrammable analog circuits would really be feasible, at least with today's tech. You'd need to modify the resistors etc. to make it work.

cptroot · 2024-05-13T16:01:48

Here's an example of Veritasium talking about this from 2022: https://www.youtube.com/watch?v=GVsUOuSjvcg

hot_gril · 2024-05-14T16:53:32

Even staying within digital, GPUs are not totally designed for AI learning or inference. But the more important problem right now is standardization.

brap · 2024-05-13T09:30:03

I don’t know why you’re being downvoted, that’s an active area of research AFAIK

gitfan86 · 2024-05-13T09:43:26

Maybe because that is a VERY different problem than the one discussed here.

Building a single analog chip with 1 billion neurons would cost billions of dollars in a best case scenario. A Nvidia card with 1 billion digital neurons is in the hundreds of dollars of range.

Those costs could come down eventually, but at that point CUDA may be long gone.

cyanide911 · 2024-05-14T18:33:47

Do you have any references of papers/people working on this? I'm very interested in the possibilities that lie here, but have no idea where to start

perfmode · 2024-05-12T23:11:16

This article rekindles the joy I experienced during CS 149 Parallel Programming.

Aaryan44 · 2024-05-14T22:45:21

Kayvon and Kunle are amazing - I took CS149 Parallel Programming two quarters ago and loved it :)

perfmode · 2024-05-17T20:55:33

lucky! i took it 11 years ago.

would love to revisit the material, especially in this new era of specialized processing units and UMA.

figbert · 2024-05-13T05:38:32

Appreciate the recommendation, will check out the course!

latchkey · 2024-05-13T02:27:54

Really impressed by the writing style of this post and very much looking forward to this on AMD MI300x. Let me know if you want some time on mine.

tracker1 · 2024-05-13T15:48:51

Have you done much AI work against AMD products? I'm not going to plunk down $2500+ for an RTX 4090, but have been considering an RX 7900XTX for playing around with, or at least getting started. Just curious how well it will or won't work in practice, or if saving a bit more and getting a 7900 XT over the XTX might be a better option, and how much less vram might impact usefulness in practice.

latchkey · 2024-05-13T16:00:48

My only work with consumer AMD GPUs was mining ethereum, I had 150,000 of them.

If you want to use enterprise AMD gpus, I'm renting them. That said, I haven't even had a chance to run/play with them myself yet, they have been rented since I got them last month.

Yes, we are getting more.

PeterisP · 2024-05-13T18:10:12

Caveat emptor and your mileage may vary; but unlike nVidia where you could just assume that everything is compatible with everything, for AMD I'd strongly recommend that you try before you buy - consider renting a cloud machine with that GPU to check if the software works for your needs before committing to a large purchase.

latchkey · 2024-05-14T14:46:10

Agreed! The problem is that you cannot rent a MI300x or other high end AMD. They all go into HPC. A problem that I love to work on.

globular-toast · 2024-05-13T07:49:53

Good writing is clear and unambiguous. With speech there is an opportunity to interrupt and ask for clarification. Writing has one chance to get the message across. A reader shouldn't have to consult knowyourmeme.com to figure out what the heck the authors are trying to say. I don't even know what the title means here. That's how far they've missed the mark.

_obviously · 2024-05-13T11:09:48

Wow that really sucks for you. I just read it in 5 minutes and feel much more informed about the subject pf nvidia memory twizzlization. It's kind of funny to me that presumably young college guys are writing in a style that's very readable for my old ass.

unethical_ban · 2024-05-13T18:55:56

>that really sucks for you

How can I put this in your vernacular...

"Most polite genZ meme enjoyer"

aetimmes · 2024-05-13T11:35:58

Even if you're not familiar with the "go brrr" meme (which is the only use of meme-idiom in the article and is used exactly twice), its meaning is easily inferred via context clues from the opening paragraphs.

Good writing is also entertaining and engaging.

globular-toast · 2024-05-13T13:07:03

Keyword being also.

throwaway1492 · 2024-05-13T14:00:56

As someone who witnessed A-10 CAS fuck some stuff up in a combat zone ie the real “brrrrt” I’ve been mystified by the meme and current useage. No one knows where it comes from nor the slaughter it represents.

aoeusnth1 · 2024-05-13T14:20:33

You’re mistaken, the “go brrr” format comes from the money printer meme in 2020.

onemiketwelve · 2024-05-13T18:00:17

as intense as a a10 might be, it's short lived and only affects a few dudes on the receiving end. When the federal reserve goes brrr, it has far reaching impact that affects every single person in the global economy.

https://brrr.money/

jsemrau · 2024-05-13T04:59:43

Really? It gives me PTSD from the Wallstreetbets days.

forrestthewoods · 2024-05-13T05:19:48

I also enjoyed the article's style. I utterly despise "academic paper speak". It is, imho, not the most effective style to communicate complex ideas. I find it so much easier to learn from a more casual "blog post" or in-person presentation over stiff, rigid academic speak.

kaycey2022 · 2024-05-13T05:44:46

I find both to be useful in different stages. The casual style is very helpful when starting out. But once I have put in a few weeks or months of study in, then the rigor and preciseness of academic style is good as well.

I agree with you in the sense that something has "died" in writings the follow academic paper speak these days. Just yesterday I saw an ancient article surfaced by Scientific American and Peter Norvig on System Analysis by Strachey. It uses quite a bit of formal language but is super approachable at the same time. That kind of skill is rarely seen these days.

david927 · 2024-05-13T16:17:35

> the Wallstreetbets days.

https://twitter.com/TheRoaringKitty/status/17900418133798504...

diginova · 2024-05-13T06:57:26

What should I do if I want to understand such articles in complete? where to start on the roadmap?

kolinko · 2024-05-13T07:16:40

This is a good course on gpu programming. Around 4.0 lesson you’ll get the required basics: https://youtube.com/playlist?list=PLzn6LN6WhlN06hIOA_ge6Srgd...

Also, write your own cuda kernel to do vector-matrix multiplication (if you use pycuda, you can focus on the kernel, and write everything else with python). Just tell chatgpt that you want to write your own implementation that multiplies a 4000-element vector by 4000x12000 matrix, and to guide you through the whole process.

For renting gpus, runpods is great - right now they have everything from lower tier gpus to h100s. You can start with a lesser gpu at the beginning.

abstractcontrol · 2024-05-13T11:11:33

For a deep dive, maybe take a look at the Spiral matrix multiplication playlist: https://www.youtube.com/playlist?list=PL04PGV4cTuIWT_NXvvZsn...

I spent 2 months implementing a matmult kernel in Spiral and optimizing it.

selimthegrim · 2024-05-13T15:13:56

Are Winograd’s algorithms useful to implement as a learning exercise?

abstractcontrol · 2024-05-13T16:36:13

Never tried those, so I couldn't say. I guess it would.

Even so, creating all the abstractions needed to implement even regular matrix multiplication in Spiral in a generic fashion took me two months, so I'd consider that good enough exercise.

You could do it a lot faster by specializing for specific matrix sizes, like in the Cuda examples repo by Nvidia, but then you'd miss the opportunity to do the tensor magic that I did in the playlist.

selimthegrim · 2024-05-13T17:28:51

You are the author of the playlist/maker of the videos?

abstractcontrol · 2024-05-14T07:34:14

justplay · 2024-05-13T13:59:04

sorry for noob question, how gpu programming is helpful ?

abstractcontrol · 2024-05-13T16:47:37

NNs for example are (mostly) a sequence of matrix multiplication operations, and GPUs are very good at those. Much better than CPUs. AI is hot at the moment, and Nvidia is producing the kind of hardware that can run large models efficiently which is why it's a 2 trillion-dollar company right now.

However, in the Spiral series, I aim to go beyond just making an ML library for running NN models and break new ground.

Newer GPUs actually support dynamic memory allocation, recursion, and the GPU threads have their own stacks, so you could in fact treat them as sequential devices and write games and simulators directly on them. I think once I finish the NL Holdem game, I'll be able to get over 100x fold improvements by running the whole program on the GPU versus the old approach of writing the sequential part on a CPU and only using the GPU to accelerate a NN model powering the computer agents.

I am not sure if this is a good answer, but this is how GPU programming would be helpful to me. It all comes down to performance.

The problem with programming them is that the program you are trying to speed up needs to be specially structured, so it utilizes the full capacity of the device.

joaquincabezas · 2024-05-13T11:29:45

wow their graphs at the GitHub README (https://github.com/HazyResearch/ThunderKittens/blob/main/att...) make me extremely dizzy. Are these wavy bars even legal? :P

bogtog · 2024-05-13T11:41:28

I second this. It's like they're trying to incorporate some optical illusion. I'd even prefer just seeing numbers without any bars

hoosieree · 2024-05-13T12:42:39

It looks like the xkcd theme for matplotlib[1]. But I agree the waves are too extreme.

[1]: https://matplotlib.org/stable/gallery/showcase/xkcd.html#sph...

lucidrains · 2024-05-13T15:10:02

would be interested to see thunderkittens (great name!) tackle the flash attention backwards pass, which is an order of magnitude harder than the forward

Aaryan44 · 2024-05-13T23:23:17

good news - we've actually included optimized causal and non-causal versions of the flash attention backwards pass with TK - would love for you to check them out!

causal: https://github.com/HazyResearch/ThunderKittens/blob/main/exa...

non-causal: https://github.com/HazyResearch/ThunderKittens/blob/main/exa...

pama · 2024-05-14T04:05:21

Awesome. Do you happen to have a benchmark against the latest (v9.1) cuDNN implementation?

Aaryan44 · 2024-05-14T18:40:46

@pama, if useful - here are utilization numbers for our attention backwards kernels (causal and non-causal, head dim = 64): https://github.com/HazyResearch/ThunderKittens/blob/main/att...

lucidrains · 2024-05-14T00:02:06

amazing work! thank you!

Aaryan44 · 2024-05-14T18:39:51

Thanks @lucidrains :)

imiric · 2024-05-13T07:55:07

Hasn't this research been done by teams building NPUs today? E.g. chips built by Groq use an architecture built specifically for AI, which is why they're able to deliver the performance they do. On the consumer side, Apple silicon is also quite capable.

I'm not in this field at all, but it seems to me that using general purpose processors that communicate over (relatively) slow lanes can only get us so far. Rethinking the design at the hardware level, and eventually bringing the price down for the consumer market seems like a better long-term strategy.

resource_waste · 2024-05-13T12:15:27

>On the consumer side, Apple silicon is also quite capable.

I am not sure that is true. A glance/or long stay at the reddit localllama subreddit basically has a bunch of frustrated CPU users trying their absolute best to get anything to work at useful speeds.

When you can get an Nvidia GPU for a few hundred dollars or a full blown gaming laptop with a 4050 6gb vram for $900, its hard to call a CPU based AI capable.

Heck we don't have GPUs at work, and CPU based is just not really reasonable without using tiny models and waiting. We ended up requesting GPU computers.

I think there is a 'this is technically possible', and there is a 'this is really nice'. Nvidia has been really nice to use. CPU has been miserable and frustrating.

serialx · 2024-05-13T12:56:03

Actually, llama.cpp running on Apple silicon uses GPU(Metal Compute Shader) to inference LLM models. Token generation is also very memory bandwidth bottlenecked. On high end Apple silicon it's about 400MB/s to 800MB/s, comparable to NVIDIA RTX 4090, which has memory bandwidth of 1000MB/s. Not to mention that Apple silicon has unified memory architecture and has high memory models (128GB, up to 192GB), which is necessary to run large LLMs like Llama 3 70B, which roughly takes 40~75GB of RAM to work reasonably.

resource_waste · 2024-05-13T13:20:37

[flagged]

brrrrrm · 2024-05-13T14:19:45

I use it all the time?

imtringued · 2024-05-13T16:56:51

The number of people running llama3 70b on NVidia gaming GPUs is absolutely tiny. You're going to need at least two of the highest end 24 GB VRAM GPUs and even then you are still reliant on 4 bit quantization with almost nothing left for your context window.

resource_waste · 2024-05-14T15:43:27

The cognitive dissonance here.

70B models arent better than 7B models outside roleplay. The logic all sucks the same. No one even cares about 70B models.

imiric · 2024-05-13T12:42:58

I don't think NVIDIA's reign will last long. The recent AI resurgence is not even a decade old. We can't expect the entire industry to shift overnight, but we are seeing rapid improvements in the capability of non-GPU hardware to run AI workloads. The architecture change has been instrumental for this, and Apple is well positioned to move the field forward, even if their current gen hardware is lacking compared to traditional GPUs. Their silicon is not even 5 years old, yet it's unbeatable for traditional workloads and power efficiency, and competitive for AI ones. What do you think it will be capable of in 5 years from now? Same for Groq, and other NPU manufacturers. Betting on NVIDIA doesn't seem like a good long-term strategy, unless they also shift their architecture.

phinnaeus · 2024-05-13T00:25:35

FYI the caption of the "spirit animals" image says "canadian goose" instead of "Canada Goose".

adzm · 2024-05-13T01:22:27

Likely a regional thing; they are consistently called Canadian Geese where I grew up and where I currently live.

silisili · 2024-05-13T07:50:04

I've only heard people in my entire lifetime call them Canadian Geese.

The only time I've ever even seen or heard of Canada Goose/Geese are people on the internet telling others they are wrong.

I think it's time to just accept it as correct.

FearNotDaniel · 2024-05-13T08:51:09

Absolutely, it's like living in London and eventually having to accept that tourists will always say "Big Ben" when they mean the clock tower of the Palace of Westminster, which encloses the bell whose actual name is Big Ben. The name of the tower is, de facto, Big Ben, and life gets so much easier when you drop the urge to tell people they are wrong all the time...

Edit: TIL the tower was properly renamed "Elizabeth Tower" in 2012 [0] but I seriously doubt a single person in the last 12 years has ever used that name...

[0] https://en.wikipedia.org/wiki/Big_Ben

globular-toast · 2024-05-13T10:02:14

I wouldn't put that in the same category. If you say Canada Goose everyone still knows what you mean. If you say Elizabeth Tower, they probably don't.

hatthew · 2024-05-13T22:00:58

In real life, I have only ever heard Canada Goose.

xarope · 2024-05-13T05:24:46

I am missing the reference to the canadian goose and the retriever puppy as spirit animals. Is that to say the H100 is an ornery thing, but the RTX4090 is friendly?

Mtinie · 2024-05-13T10:06:04

I’d assumed (like you) it meant that the H100 is ornery AND pickier about what it consumes, while the RTX4090 is playful and will eat damn near anything within reach of its mouth (with its sharp, velociraptor-like puppy teeth), whether you want it to or not.

But that may be straining the meme somewhat. :)

downrightmike · 2024-05-13T01:14:24

Don't worry, the Geese are en route to location, resolution incoming. Stand by.

hoherd · 2024-05-13T13:21:48

In my experience, Canadian geese are never en route to anywhere. They stay next to the pond year round and crap everywhere you might want to step. EG: https://sanjosespotlight.com/going-to-santa-clara-central-pa...

bombcar · 2024-05-13T04:12:06

It’s a Canada Goose from Canada. A Canadian Canada Goose, or Canadian Goose.

gosub100 · 2024-05-13T05:04:19

https://en.wikipedia.org/wiki/Buffalo_buffalo_Buffalo_buffal...

wglb · 2024-05-13T01:10:57

An error too often made.

bn-l · 2024-05-13T07:24:45

Who cares

fastball · 2024-05-13T00:39:54

Canadian goose seems better in [current year], to avoid confusion with the clothing brand.

adrian_b · 2024-05-13T07:16:32

I consider bad the habit of English to use nouns also as adjectives, because it causes many ambiguities, some of which can be very annoying, even if they are a rich source of jokes and word plays.

In most languages the use of a noun as an adjective is marked, by a particle or by an affix or at least by a different stress pattern (like moving the stress to the last syllable), which removes the ambiguities.

So for most non-native speakers "Canadian goose" makes much more sense than "Canada goose" (which may feel like "Canada and a goose" or "a goose that is also Canada" and not like "a goose from Canada").

p0w3n3d · 2024-05-13T07:38:03

always the former noun is describing the latter. Butter fly is not a flying butter (as my children's teacher told them to make a joke about butterfly) but a fly made of butter instead.

kitd · 2024-05-13T07:33:03

"Canada" isn't being used as an adjective though. The name of the species is "Canada Goose", like "Long Island Shellfish" or "Dublin Bay Prawns".

adrian_b · 2024-05-14T03:18:29

You have just proven my point.

In the names "Canada Goose", "Long Island Shellfish" and "Dublin Bay Prawns", "Canada", "Long Island" and "Dublin Bay" are adjectives, because geese are not also "Canada", shellfish are not also "Long Island" and prawns are not also "Dublin Bay".

This kind of names is typical for English, but not for most other languages.

For instance, the scientific name of the Canada goose is "canadensis", which means "Canadian", not "Canada".

An adjective (in the broad sense) is a word that describes a subset of the set named by the noun to which it is attached.

While most languages also include distinct words that are adjectives in the narrow sense, i.e. which have degrees of comparison, adjectives in the broad sense (sometimes called relational adjectives) can be derived from any noun by various means, e.g. genitive case markers, prepositions, postpositions, suffixes, prefixes or accentual patterns, except for ambiguous languages like English, where any noun can also be used as an adjective, and sometimes also as a verb.

actionfromafar · 2024-05-13T07:32:47

Now you made me think of ways to English Adjective my text for word play... make it stop.

WanderPanda · 2024-05-13T00:20:22

Is this "just" CUTLASS in user friendly?

weinzierl · 2024-05-13T14:03:01

"For this post, we’re going to focus on the NVIDIA H100 [... because] we think the trends it implies are going to continue in future generations, and probably from other manufacturers, too."

Is it though? Wouldn't we expect to see more advanced packaging technology eventually?

If that happens the increased memory bandwidth could be an enabler for a unified memory architecture like in the Nvidia Jetson line. In turn that would make a lot of what the article says make GPU go Brr today moot.

bombela · 2024-05-13T10:55:20

I cannot tell for sure if units are really all power of 10.

I found some datasheet that states 80GB of VRAM, and a BAR of 80GiB. All caches are also in power of two. The bandwidth are all power of 10 though.

https://www.nvidia.com/content/dam/en-zz/Solutions/gtcs22/da...

chefandy · 2024-05-13T15:21:59

One of my biggest struggles in doing AI stuff on consumer hardware is heat. I noticed zero discussion of this so I assume it's an implementation detail on small systems that doesn't really factor into more robust setups. Is that the really case, or is this just diving into the comp sci layer of hardware utilization and ignoring things like heat because it's not salient to this subtopic?

nostrebored · 2024-05-13T15:25:53

It factors into robust setups but is part and parcel of doing any HPC where you're pushing through a ton of TFLOPS. It's a problem that is assumed to have been solved when you're doing this kind of work.

hi-v-rocknroll · 2024-05-15T00:16:57

NVIDIAs stock will plummet in 3-4 years after Microsoft and Meta stop spending tens of billions without having a specific use for H100's and end up with a ridiculous amount of excess capacity. Hopefully, that means some H100-based systems will end up on eBay in ~5-8 years for home lab use.

_spl · 2024-05-13T09:28:39

It reminds me of when I first read about superscalar CPU architecture and was amazed. GPUs are really next level.

nmstoker · 2024-05-12T22:09:11

Some related material here too:

https://twitter.com/bfspector/status/1789749117104894179?t=k...

jauntywundrkind · 2024-05-12T23:02:29

The ThunderKittens mascot has great kitten/Sony-Aibo vibes. Nicely generated, AI (I presume). https://github.com/HazyResearch/ThunderKittens

layer8 · 2024-05-12T23:27:10

It looks off because the head isn’t centered on the neck.

Satam · 2024-05-13T05:39:06

Easy fix: https://imgur.com/a/Ahwt6tr (although not sure which one is actually better)

john_minsk · 2024-05-13T04:21:14

Great attention to detail! I, like the parent, was surprised by the quality as well. However now I can't unsee it:-)

wmab · 2024-05-14T16:52:27

The amount of comma splicing, (parentheses for extra points) -- and em dashes for good measure! that this post has makes it entirely unreadable.

eimrine · 2024-05-14T16:56:14

This is common in Russian texts but I haven't found any other signs of that suppose.

rajnathani · 2024-05-15T07:02:13

Not related to the substance of the post: They should've really avoided the 4chan like drawings in the post.

LordShredda · 2024-05-13T15:21:55

Standford research team just published an article with a wojak in it. That by itself is bigger news than AI

danjl · 2024-05-13T16:44:08

I bet traditional image processing would love to be implemented in ThunderKitten.

apsec112 · 2024-05-13T00:41:39

Interesting! Would this support fp8? Does anyone know how it would compare to Triton?