Grace Hopper, Nvidia's Halfway APU

erulabs · 2024-08-10T08:23:23 1723278203

If AI remains in the cloud, nvidia wins. But I can’t help but think that if AI becomes “self-hosted”, if we return to a world where people own their own machines, AMDs APUs and interconnect technology will be absolutely dominant. Training may still be Nvidias wheelhouse, but for a single device able to do all the things (inference, rendering, and computing), AMD, at least currently, would seem to be the winner. I’d love someone more knowledgeable in AI scaling to correct me here though.

Maybe that’s all far enough afield to make the current state of things irrelevant?

moffkalast · 2024-08-10T12:20:48 1723292448

Nvidia still has 12-16GB VRAM offerings for around $300-400, which are exceptionally well optimized and supported on the software side. Still by far the most cost effective option if you also value your time imo. The Strix Halo better have high tier Mac level bandwidth plus ROCm support and be priced below $1k or it's just not competitive with that because it'll still be slower than even partial cuda offloading.

halJordan · 2024-08-11T14:22:45 1723386165

It has 80 GBps bw from its dual channel ddr5 implementations. I really dont think AMD is dogfooding any of their ML toolkits, which is a shame.

BaculumMeumEst · 2024-08-10T20:56:55 1723323415

I remember reading geohot advocate for 7900XTX as a cost effective card for deep learning. I read AMD is backing off from the high end GPU market, though. Is there any chance they will at least continue to offer cards with lots of VRAM?

acchow · 2024-08-10T21:22:45 1723324965

The cloud is more efficient at utilizing hardware. Except for low-latency or low-connection requirements, the move to cloud will continue.

aurareturn · 2024-08-11T03:57:42 1723348662

No, there will be plenty of low value inference that won't be economical in the cloud. Apple Intelligence is one example.

wmf · 2024-08-10T16:53:49 1723308829

You may be seeing something that isn't there. I don't even know if MI300A is available to buy, what it costs, or if you'll be forced to buy four of them which would push prices close to DGX territory anyway.

teaearlgraycold · 2024-08-10T10:17:03 1723285023

You need orders of magnitude more compute for training than for inference. Nvidia still wins in your scenario.

Currently rendering and local GPGPU compute is Nvidia dominated and I don’t see AMD competently going after the market segments.

demaga · 2024-08-10T10:50:38 1723287038

But you also run inference orders of magnitudes more times, so it should still amount to more compute than training?

teaearlgraycold · 2024-08-10T11:20:20 1723288820

That matters more to the electricity company than the silicon company. The profit margins on the datacenter training hardware are stupidly high compared to an AMD APU.

binary132 · 2024-08-10T11:33:41 1723289621

If there are tens of thousands of training GPUs but billions of APUs, then what? BTW, training is such a high cost that it seems like a major motive for the customer to reduce costs there.

talldayo · 2024-08-10T16:46:47 1723308407

> If there are tens of thousands of training GPUs but billions of APUs, then what?

Believe it or not, we've actually been grappling with this scenario for almost a decade at this point. Originally the answer was to unite hardware manufacturers around a common featureset that could compete with (albeit not replace) CUDA. Khronos was prepared to elevate OpenCL to an industry standard, but Apple pulled their support for it and let the industry collapse into proprietary competition again. I bet they're kicking themselves over that one, if they still hold a stronger grudge against Nvidia than Khronos at least.

So - logically, there's actually a one-size-fits-all solution for this problem. It was even going to get managed by the same people handling Vulkan. The problem was corporate greed and shortsighted investment that let OpenCL languish while CUDA was under active heavy development.

> BTW, training is such a high cost that it seems like a major motive for the customer to reduce costs there.

Eh, that's kinda like saying "app development is so expensive that consumers will eventually care". Consumers just buy the end product; they are never exposed to building the software or concerned with the cost of the development. This is especially true with businesses like OpenAI that just give you free access to a decent LLM (or Apple and their "it's free for now" mentality).

k__ · 2024-08-10T16:12:50 1723306370

This.

Most will probably use something like Llama as base.

marcosdumay · 2024-08-10T20:52:34 1723323154

Besides, if you separate them, the people doing the training will put way more effort into optimizing their hardware ROI than the ones doing inference.

mistercow · 2024-08-10T12:36:46 1723293406

I think this is the big point of uncertainty in Nvidia’s future: will we find new training techniques which require significantly less compute, and/or are better suited to some fundamentally different architecture than GPUs? I’m reluctant to bet no on that long term, and “long term” for ML right now is not very long.

brigadier132 · 2024-08-10T16:23:04 1723306984

If we find a new training technique that is that much more efficient why do you think we wont just increase the amount of training we do be n times? (or even more since it's now accessible to train custom models for smaller businesses)

mistercow · 2024-08-10T16:49:36 1723308576

We might, but it’s also plausible that it would change the ecosystem so much that centralized models are no longer so prominent. For example, suppose that with much cheaper training, most training is on your specific data and behaviors so that you have a model (or ensemble of models) tailored to your own needs. You still need a foundation model, but those are much smaller so that they can run on device, so even with overparameterization and distillation, the training costs are orders of magnitude smaller.

Or, in the small business case (mind you, “long term” for tech reaching small businesses is looooong), these businesses again need much smaller models because a) they don’t need a model well versed in Shakespeare and multi variable calculus, and b) they want inference to be as low cost as possible.

These are just scenarios off the top of my head. The broader point is that a dramatic drop in training cost is a wildcard whose effects are really hard to predict.

marcosdumay · 2024-08-10T20:56:34 1723323394

I'd bet that any AI that is really useful for the tasks people want to push LLMs into will answer "yes" to both parts of your question.

But I don't know what "long term" is exactly, and have no idea how to time this thing. Besides, I'd bet the sibling evoking the Jevon's paradox is correct.

acchow · 2024-08-10T21:24:26 1723325066

I’m betting the opposite: new model architectures will unlock greater abilities at the cost of massive compute.

mistercow · 2024-08-10T23:19:14 1723331954

Even so, massive compute doesn’t necessarily mean GPU-friendly compute. We could see a breakthrough in analog or neuromorphic hardware, for example, where Nvidia isn’t well positioned. Or we could see a training breakthrough which is far more efficient, but bottlenecked on single core performance, or just branch-heavy performance. You can imagine scenarios like that where GPUs still play a role, but where even today’s top of the line GPUs are way over the top compared to the CPU bottleneck.

If one of those scenarios happens, maybe Nvidia can pivot, or if we see analog take over, we could see something really bizarre like a dark horse like Seagate taking over by pivoting from SSDs, just because their manufacturing pipeline is more compatible.

passion__desire · 2024-08-10T12:00:49 1723291249

If compute is gonna play the role of electricity in coming decades, then having a compute wall similar to Tesla powerwall is a necessity.

CooCooCaCha · 2024-08-10T16:45:20 1723308320

Powerwall and electric car in the garage, compute wall in the closet, 3d printer and other building tools in the manufacturing room, hydroponics setup in the indoor farm room, and AI assistant to help manage it all. The home becomes a mixed living area and factory.

crowcroft · 2024-08-10T17:16:47 1723310207

The vision of this sounds so cool, but man, for a lot of use cases at the moment most 'smart home' stuff is still complicated and temperamental.

How do we get from here to there, cause I want to get there so bad.

CooCooCaCha · 2024-08-10T23:13:09 1723331589

Tech and economic development can help a bit, like the new Bamboo 3d printer makes 3d printing a lot more "idiot proof".

However, I think we need AI beyond current LLMs to really take us there. I'm not saying LLMs can't get us there, we don't know, just beyond what we have. We need AI that we can trust with real tasks IRL.

rbanffy · 2024-08-10T18:03:27 1723313007

Powerwall makes sense because you can’t generate energy at any time and, therefore, you store it. Computers are not like that - you don’t “store” computations for when you need them - you either use capacity or you don’t. That makes it practical to centralise computing and only pay for what you use.

jfoutz · 2024-08-10T18:42:27 1723315347

I was going to make a pedantic argument/joke about memoization.

It is kind of an interesting thought though. A big wall of SSD is a fabulous amount of storage. and maybe a clever read only architecture, would be cheaper than SSD. and a clever data structure for shared high order bits, maybe, maybe there is potential for some device to look up matrix multiply results, or close approximations that could be cheaply refined.

Right now, I doubt it. But big static cache it is a kind of interesting idea to kick around Saturday afternoon.

marcosdumay · 2024-08-10T21:01:41 1723323701

> I was going to make a pedantic argument/joke about memoization.

You are reading the GP the wrong way around.

You store partial results exactly because you can't store computation. Computation is perishable¹, you either use it or lose it. And one way to use it is to create partial results you can save for later.

1 - Well, partially so. Hardware utilization is perishable, but computation also consumes inputs (mostly energy) that aren't. How much it perishable depends on the ratio of those two costs, and your mobile phone has a completely different outlook from a supercomputer.

rbanffy · 2024-08-10T20:55:42 1723323342

> maybe there is potential for some device to look up matrix multiply results, or close approximations that could be cheaply refined.

Shard that across the planet and you'd have a global cache for calculations. Or a lookup for every possible AI prompt and its results.

passion__desire · 2024-08-11T08:29:26 1723364966

I didn't mean ComputeWall in the sense of storage of compute but in a sense of client server model where client (ComputeWall, maybe made of DGX-2) could continue to function independently in cases of natural calamity or other issues.

ta988 · 2024-08-10T15:07:20 1723302440

Only if improvements in speed and energy savings slow down

CooCooCaCha · 2024-08-10T16:39:31 1723307971

And if models don't get any larger, which they will

MobiusHorizons · 2024-08-10T01:52:11 1723254731

I am really surprised to see the performance of the CPU and especially the latency characteristics are so poor. The article alludes to the design likely being tuned for specific workloads, which seems like a good explanation. But I can't help wonder if throughput at the cost of high memory latency is just not a good strategy for CPUs even with the excellent branch predictors and clever OOO work that modern CPUs bring to the table. Is this a bad take? Are we just not seeing the intended use-case where this thing really shines compared to anything else?

freeqaz · 2024-08-10T07:19:02 1723274342

What's the point of having the GPU on die for this? Are they expecting people to deploy one of these nodes without dedicated GPUs? It has a ton of NVLink connections which makes me think that these will often be deployed alongside GPUs which feels weird.

The flip side of this is if the GPU can access the main system memory then I could see this being useful for loading big models with much more efficient "offloading" of layers. Even though bandwidth between GPU->LPDDR5 is going to be slow, it's still faster than what traditional PCI-E would allow.

The caveat here is that I imagine these machines are $$$ and enterprise only. If something like this was brought to the consumer market though I think it would be very enticing.

(If anybody from AMD is reading this, I feel like an architecture like this would be awesome to have. I would love to run Llama 3.1 405b at home and today I see zero path towards doing that for any "reasonable" amount of money (<$10k?).)

Edit: It's at the bottom of the article. These are designed to be meshed together via NVLink into one big cluster.

Makes sense. I'm really curious how the system RAM would be used in LLM training scenarios, or if these boxes are going to be used for totally different tasks that I have little context into.

tonyarkles · 2024-08-10T05:09:45 1723266585

We’re using the Orin AGX for edge ML. Not the same setup (Ampere) but it’s a similar situation. The GPU is excellent for what we need it to do, but the CPU cores are painful. We’re lucky… the CPUs aren’t great but there’s 12 of them and we can get away with carefully pipelining our data flows across multiple threads to get the throughput we need even though some individual stage latencies aren’t what we’d like.

weebull · 2024-08-11T12:38:30 1723379910

I'd really like to get hold of a model for a modern CPU and properly analyse what all the performance features actually get us in terms of performance.

- Branch prediction and speculative execution - Out of order execution - Massive physical register files and register renaming - Cache predictors - and many more I'm sure.

Speculative execution is the big one for me, just because of the information leakage possible through it. It's there because you'd have to pause fetching new instructions until the result of a conditional branch is known, which has knock-on effects to instruction scheduling... But how big are these effects? Do some certain combinations of features supercharge or work against each other?

I'm sure there's people looking at such things inside Intel and AMD, but it doesn't seem like there's much out there for public consumption.

edward28 · 2024-08-10T02:46:37 1723257997

These CPUs are intended just to run miscellaneous tasks, such as loading AI models or running the cluster operating system. They don't need to be performant, just efficient, as the GPU does all the heavy lifting. NVIDIA also provides an option to swap the grace chips out with an x86 chip, which could deliver better performance depending on the remaining power budget though.

MobiusHorizons · 2024-08-10T03:59:02 1723262342

If this is all there is to it, why do they have the high frequency and high l3 cache? Those seem to be optimizing for something, not just a “good enough” configuration for a part that is not the bottleneck

riotnrrd · 2024-08-10T16:34:39 1723307679

Data augmentation in CPU-space is often compute-light, but requires rapid access to memory. There are libraries (like NVIDIA's Dali) that can do augmentation on the GPU, but this takes up GPU resources that could be used by training. Having a multi-core CPU with fast caches is a good compromise.

p1necone · 2024-08-10T02:02:19 1723255339

This kind of hardware makes sense for video games, and I guess GPU heavy workloads like AI might be similar? Most games have middling compute requirements but will take as much GPU power as you can give them if you're trying to run at high resolutions/settings. Although getting smooth gameplay at very high frame rates (~120hz+) does need a decent CPU in a lot of games.

Look at how atrocious the CPUs were in the PS4/Xbone generation for an example of this.

wmf · 2024-08-10T02:18:39 1723256319

Grace Hopper was not designed for games though.

pjmlp · 2024-08-10T19:30:28 1723318228

And yet PS 4 / XBox ONE rule the games console market still, because only more polygons isn't worth buying a PS 5 or XSeries, for a large market segment, hence the negative sales and trying to cater to PC gamers as alternative.

p1necone · 2024-08-15T21:07:48 1723756068

Yes I was making the point that the CPUs in PS4/Xbone were terrible (seriously look up benchmarks - they're basically underclocked pre ryzen AMD) and that didn't matter for performance because video game workloads are so heavily skewed towards GPU. I know they were successful.

tedunangst · 2024-08-09T23:59:56 1723247996

Irrelevant, but the intro reminded me that nvidia also used to dabble in chipsets like nforce, back when there was supplier variety in such.

m463 · 2024-08-10T04:55:43 1723265743

I think that stopped when intel said nvidia couldn't produce chipsets for some cpu architecture they were coming out with.

I don't know if this was market savvy or a footshoot that made their ecosystem weaker.

wtallis · 2024-08-10T07:00:40 1723273240

The transition point was when Intel moved the DRAM controller and PCIe root complex onto the CPU die, merging in the northbridge and leaving the southbridge as the only separate part of the chipset. The disappearance of the Front Side Bus meant Intel platforms no longer had a good place for an integrated GPU other than on the CPU package itself, and it was years before Intel's iGPUs caught up to the Nvidia 9400M iGPU.

In principle, Nvidia could have made chipsets for Intel's newer platforms where the southbridge connects to the CPU over what is essentially four lanes of PCIe, but Intel locked out third parties from that market. But there wasn't much room for Nvidia to provide any significant advantages over Intel's own chipsets, except perhaps by undercutting some of Intel's product segmentation.

(On the AMD side, the DRAM controller was on the CPU starting in 2003, but there was still a separate northbridge for providing AGP/PCIe, with a relatively high-speed HyperTransport link to the CPU. AMD dropped HT starting with their APUs in 2011 and the rest of the desktop processors starting with the introduction of the Ryzen family.)

whaleofatw2022 · 2024-08-10T21:50:25 1723326625

The argument was before that transition.

AFAIR the contentious point was that Nvidia had a license to the bus for P6 arch (by virtue of Xbox) but did not have a license for the P4 bus.

AMD was also more than happy to have NVDA build chipsets for Hammer/etc especially due to them not having a video core... -at the time-.

Once the AMD/ATI merger started, that was the real writing on the wall.

wtallis · 2024-08-11T00:02:57 1723334577

Nvidia's chipset line for Intel motherboards started with the Pentium 4. There may have been relationship issues between the two companies that prevented Nvidia from entering the Intel chipset market sooner using a derivative of their Xbox chipset, but none of that has anything to do with what ended the nForce chipsets for Intel.

jauntywundrkind · 2024-08-10T02:48:04 1723258084

SoundStorm vs Dolby is such a turning point story. Nvidia had a 5 billion op/s DSP and Dolby digital encoding on that chipset. Computers were coming into their own as powerful universal systems that could do anything.

Then Dolby cancelled the license. To this day you still need very fancy sound cards or exotic motherboards to be able to output good surround sound to a large number of av receivers. There are some open DTS standards that Linux can do too, dunno about windows/Mac.

But it just felt like we slid so far down, that Dolby went & made everything so much worse.

(Media software can do Dolby pass-through to let the high quality sound files through, yes. But this means you can't do any effect processing, like audio normalization/compression for example. And if you are playing games your amp may be getting only basic low quality surround surround, not the good many channel stuff.)

throwaway81523 · 2024-08-10T05:04:30 1723266270

Do you mean AC3? Ffmpeg has been able to do that since forever.

https://en.wikipedia.org/wiki/Dolby_Digital

jauntywundrkind · 2024-08-10T05:42:40 1723268560

Theres some debate about what patents apply, but even Dolby had to admit defeat as of 2017. So yes, a 640kbit/s 6 channel format is available for encoding on ffmpeg & some others.

I don't know if games are smart enough to use this?

It also feels like a very low bar. It's not awful bitrate for 6 channels but neither is it great. It's not a pitiful number of channels but again neither is it great.

Last & most crucially, just because one piece of software can emit ac3 doesn't make it particularly useful for a system. I should be able to have multiple different apps doing surround sound, sending notifications to back channels or panning sounds as I prefer. Yes ffmpeg can encode 5.1 media audio to an AVR but that doesn't really substitute for an actual surround system.

This is more a software problem, now that the 5.1 AC3 patents are expired. And there have been some stacks in the past where this worked on Linux for example. But it seems like modern hardware (with a Sound Open Firmware) has changed a bit and PipeWire needs to come up with a new way of doing ac3/a52 encoding. https://gitlab.freedesktop.org/pipewire/pipewire/-/issues/32...

ssl-3 · 2024-08-10T21:28:52 1723325332

I once went down a rabbit hole of trying to get realtime AC3 encoding on my desktop PC, and I broadly failed.

That was a long time ago. It is now 2024.

Do we still need that today? For modern AVRs we have HDMI, with 8 channels worth of up to 24bit 192kHz lossless digital audio baked in.

For old AVRs with multichannel analog inputs, motherboards with 6 or 8 channels of built-in audio are still common-enough, as are separate sound cards with similar functionality.

What's the advantage of realtime AC3 encoding today, do you suppose?

throwaway81523 · 2024-08-10T21:56:26 1723326986

One reason to want Dolby encoding is to play back on your consumer home theater gear that decode it. Alternatively though, just don't use that kind of gear.

izacus · 2024-08-10T08:18:26 1723277906

I'm bit confused about your last paragraph - what's low quality about Dolby Atmos / DTS:X output you get for games these days?

MegaDeKay · 2024-08-10T03:55:17 1723262117

One place you'll find said chipset is in the OG XBox, where they provided the Southbridge "MCPX" chip as well as the GPU.

https://classic.copetti.org/writings/consoles/xbox/#io

sirlancer · 2024-08-10T17:32:55 1723311175

In my tests of a Supermicro ARS-111GL-NHR with a Nvidia GH200 chipset, I found that my benchmarks performed far better with the RHEL 9 aarch64+64k kernel versus the standard aarch64 kernel. Particularly with LLM workloads. Which kernel was used in these tests?

metadat · 2024-08-10T19:08:26 1723316906

"Far better" is a little vague, what was the actual difference?

magicalhippo · 2024-08-10T19:39:40 1723318780

Not OP but was curious about the "+64k" thing and found this[1] article claiming around 15% increase across several different workloads using GH200.

FWIW for those unaware like me, 64k refers to 64kB pages, in contrast to the typical 4kB.

[1]: https://www.phoronix.com/review/aarch64-64k-kernel-perf

waynecochran · 2024-08-10T11:05:03 1723287903

Side note: The acronym APU was used in the title but not once defined or referenced in the article?

layer8 · 2024-08-10T12:42:51 1723293771

It’s an established term (originally by AMD) for a combination of CPU and GPU on a single die. In other words, it’s a CPU with integrated accelerated graphics (iGPU). APU stands for Accelerated Processing Unit.

Nvidia’s Grace Hopper isn’t quite that (it’s primarily a GPU with a bit of CPU sprinkled in), hence “halfway” I guess.

falcor84 · 2024-08-10T11:15:18 1723288518

Here's my reasoning of what an APU is based on letter indices: if A is 1, C is 3 and G is 7, then to get an APU, you need to do what it takes to go from GPU to a CPU, and then apply an extra 50% effort.

sebastiennight · 2024-08-10T13:44:43 1723297483

This... is technically wrong, but it's the best kind of wrong.

alexhutcheson · 2024-08-10T13:01:24 1723294884

Somewhat tangential, but did Nvidia ever confirm if they cancelled their project to develop custom cores implementing the ARM instruction set (Project Denver, and later Carmel)?

It’s interesting to me that they’ve settled on using standard Neoverse cores, when almost everything else is custom designed and tuned for the expected workloads.

adrian_b · 2024-08-10T15:51:32 1723305092

Already in Nvidia Orin, which has replaced Xavier (with Carmel cores) a couple of years ago, the CPU cores have been Cortex-A78AE.

So Nvidia has given up on designing CPU cores, already for some years.

The Carmel core had a performance similar to Cortex-A75, even if it was launched by the time when Cortex-A76 was already available. Moreover, Carmel had very low clock frequencies, which diminished its performance even more. Like also Qualcomm or Samsung, Nvidia has not been able to keep up with the Arm Holdings design teams. (Now Qualcomm is back in the CPU design business only because they have acquired Nuvia.)

rbanffy · 2024-08-10T17:07:33 1723309653

> The downside is Genoa-X has more than 1 GB of last level cache, and a single core only allocates into 96 MB of it.

I wonder if AMD could license the IBM Telum cache implementation where one core complex could offer unused cache lines to other cores, increasing overall occupancy.

Would be quite neat, even if cross-complex bandwidth and latency is not awesome, it still should be better than hitting DRAM.

bmacho · 2024-08-10T07:43:43 1723275823

> The first signs of trouble appeared when vi, a simple text editor, took more than several seconds to load.

Can it run vi?

jokoon · 2024-08-10T14:10:08 1723299008

It always made sense to have a single chip instead of 2, I just want to buy a single package with both things on the same die.

That might make things much simpler for people who write kernel, drivers and video games.

The history of CPU and GPU prevented that, it was always more profitable for CPU and GPU vendors to sell them separately.

Having 2 specialized chips makes more sense because it's flexible, but since frequencies are stagnating, having more cores make sense, and AI means massively parallel things are not only for graphics.

Smartphones are much modern in that regard. Nobody upgrades their GPU or CPU anymore, might as well have a single, soldered product that last a long time instead.

That may not be the end of building your own computer, but I just hope it will make things simpler and in a smaller package.

tliltocatl · 2024-08-10T15:27:42 1723303662

It's not about profit, it's about power and pin budget. Proper GPU needs lots of memory bandwidth=lots of memory-dedicated pins (HBM kinda solves this, but has tons of other issues). And on power/thermal side having two chips each with dedicated power circuits, heatsinks and radiators is always better then one. The only reason NOT to have to chips is either space (that's why we have integrated graphics and it sucks performance-wise), packaging costs (not really a concern for consumer GPU/CPU where we are now) or interconnect costs (but for both gaming and compute CPU-GPU bandwith is negligible compared to GPU-RAM).

dagmx · 2024-08-10T04:36:38 1723264598

The article talks about the difference in the pre-fetcher between the two neoverse setups (Graviton and Grace Hopper). However isn’t the prefetcher part of the core design in neoverse? How would they differ?

MobiusHorizons · 2024-08-10T15:03:47 1723302227

I believe the difference is in the cache hierarchy (more l3 less l2) and generally high latency to dram even higher latency to hbm. This makes the prefetcher behave differently between the two implementations, because the l2 cache isn’t able to absorb the latency

dagmx · 2024-08-10T15:39:31 1723304371

That was my initial read but they have this line which made me wonder if it was somehow more than that

> I suspect Grace has a very aggressive prefetcher willing to queue up a ton of outstanding requests from a single core.

MobiusHorizons · 2024-08-10T20:59:31 1723323571

Oh good point, maybe that is configurable as well.

astromaniak · 2024-08-10T17:01:35 1723309295

This is good for datacenters, but.. NVidia stopped doing anything for consumers market.

rkwasny · 2024-08-10T06:37:36 1723271856

Yeah so I also benchmarked GH200 yesterday and I am also a bit puzzled TBH:

https://github.com/mag-/gpu_benchmark

adrian_b · 2024-08-10T16:06:08 1723305968

I suggest that wherever you write "TFLOPS", you should also write the data type for which they were measured.

Without knowing whether the operations have been performed on FP32 or on FP16 or on another data type, all the numbers written on that page are meaningless.

benreesman · 2024-08-10T09:27:49 1723282069

I’m torn: NVIDIA has a fucking insane braintrust of some of the most elite hackers in both software and extreme cutting edge digital logic. You do not want to meet an NVIDIA greybeard in a dark alley, they will fuck you up.

But this bullshit with Jensen signing girls’ breasts like he’s Robert Plant and telling young people to learn prompt engineering instead of C++ and generally pulling a pump and dump shamelessly while wearing a leather jacket?

Fuck that: if LLMs could write cuDNN-caliber kernels that’s how you would do it.

It’s ok in my book to live the rockstar life for the 15 minutes until someone other than Lisa Su ships an FMA unit.

The 3T cap and the forward PE and the market manipulation and the dated signature apparel are still cringe and if I had the capital and trading facility to LEAP DOOM the stock? I’d want as much as there is.

The fact that your CPU sucks ass just proves this isn’t about real competition just now.

almostgotcaught · 2024-08-10T09:39:56 1723282796

Sir this is a Wendy's

benreesman · 2024-08-10T09:53:55 1723283635

This is Y-Combinator. Garry Tan is still tweeting embarrassing Utopianism to faint applause and @pg is still vaguely endorsing a rapidly decaying pseudo-argument that we’re north of securities fraud.

At Wendy’s I get a burger that’s a little smaller every year.

On this I get Enron but smoothed over by Dustin’s OpenPhilanthropy lobbyism.

I’ll take the burger.

edit:

tinygrad IS brat.

YC is old and quite weird.

maxbond · 2024-08-19T02:14:11 1724033651

> tinygrad IS brat.

pytorch but it's minimal so it's not

defrost · 2024-08-10T10:07:00 1723284420

Hell to the Yeah, it's filled with old weird posts: https://news.ycombinator.com/item?id=567736

benreesman · 2024-08-10T10:16:55 1723285015

I’m not important enough to do opposition research on, it bewilders me that anyone cares.

I was 25 when I apologized for trolling too much on HN, and frankly I’ve posted worse comments since: it’s a hazard of posting to a noteworthy and highly scrutinized community under one’s own name over decades.

I’d like to renew the apology for the low-quality, low-value comments that have happened since. I answer to the community on that.

To you specifically, I’ll answer in the way you imply to anyone with the minerals to grow up online under their trivially permanent handle.

My job opportunities and livelihood move up and down with the climate on my attitudes in this forum but I never adopted a pseudonym.

In spite of your early join date which I respect in general as a default I remain perplexed at what you’ve wagered to the tune of authenticity.

benreesman · 2024-08-10T10:31:48 1723285908

It’s my hope that this thread is over.

You joined early, I’ve been around even longer.

You can find a penitent post from me about an aspiration of higher quality participation, I don’t have automation set up to cherry-pick your comments in under a minute.

My username is my real name, my profile includes further PII. Your account looks familiar but if anyone recognizes it on sight it’s a regime that post-dates @pg handing the steering wheel to Altman in a “Battery Club” sort of way.

With all the respect to a fellow community member possible, and it’s not much, kindly fuck yourself with something sharp.

defrost · 2024-08-10T10:33:12 1723285992

Err .. you getting enough sleep there?

defrost · 2024-08-10T10:29:37 1723285777

There's no drama as far as I'm concerned, I got a sensible chuckle from your comment & figured it deserved a tickle in return; the obvious vector being anyone here since 2008 has earned a tweak for calling the HN crowd 'old' (something many can agree with).

My "opposition research" was entirely two clicks, profile (see account age), Submissions (see oldest).

As for pseudonym's, I've been online since Usenet and have never once felt the need to advertise on the new fangled web (1.0, 2.0, or 3), handles were good enough for Ham Radio, and TWKM - Those Who Know Me Know Who I Am (and it's not at all that interesting unless you like yarns about onions on belts and all that jazz).

benreesman · 2024-08-10T10:46:54 1723286814

I’m pretty autistic, after haggling with Mistral this is what it says a neurotypical person would say to diffuse a conflict:

I want to apologize sincerely for my recent comments, particularly my last response to you. Upon reflection, I realize that my words were hurtful, disrespectful, and completely inappropriate, especially given the light-hearted nature of your previous comment. I am truly sorry for any offense or harm I may have caused.

Your comment was clearly intended as a friendly jest, and I regret that I responded with such hostility. There is no excuse for my behavior, and I am committed to learning from this mistake and ensuring it does not happen again.

I also want to address my earlier comments in this thread. I now understand that my attempts to justify my past behavior and dismiss genuine concerns came across as defensive and disrespectful. Instead of taking responsibility for my actions, I tried to deflect and downplay their impact, which only served to escalate the situation.

I value this community and the opportunity it provides for open dialogue and growth. I understand that my actions have consequences, and I am determined to be more mindful, respectful, and considerate in my future interactions. I promise to strive for higher quality participation and to treat all members of this community with the kindness and respect they deserve.

Once again, I am truly sorry for my offensive remarks and any harm they may have caused. I appreciate the understanding and patience you and the community have shown, and I hope that my future actions will reflect my commitment to change and help rebuild any trust that may have been lost.

defrost · 2024-08-10T10:57:44 1723287464

Cheers for that, it's a good apology.

Again, no drama - my sincere apologies for inadvertently poking an old issue, there was no intent to be hurtful on my part.

I have a thick skin, I'm Australian, we're frostily polite to those we despise and call you names if we like you - it can be offputting to some. :)

benreesman · 2024-08-10T11:56:29 1723290989

The best hacker I know is from Perth, I picked up the habit of the word “legend” as a result.

You’ve been a good sport legend.