Hacker News new | past | comments | ask | show | jobs | submit login
AMD's Next GPU Is a 3D-Integrated Superchip (ieee.org)
183 points by viewtransform on Dec 15, 2023 | hide | past | favorite | 124 comments



I want one of these a lot. Spec https://www.amd.com/en/products/accelerators/instinct/mi300/....

It's 24 x64 cores and 228 CDNA compute units on a common 128GB of memory. Personally I want to run constraint solvers on one. The general approach of doing lots of integer/float work on the GPU and branchy work on the CPU, both hitting the same memory, feels like an order of magnitude capability improvement over the current systems.


Unfortunately the BoM on these things is probably super high. Like abdornally high even for a datacenter GPU.

AMD is coming out with a "strix halo" APU somewhat appropriate for compute.

https://hothardware.com/news/amd-strix-halo-cpu-rumors


> Unfortunately the BoM on these things is probably super high.,, even for a datacenter GPU

You would save money on improved yields of chiplets and on using cheaper nodes where appropriate. I imagine that would help offset the increased packaging costs.

However, they are moving to compete in a market where the profit margins are so obscene that a small increase in the cost of materials really wouldn't be an important factor.

> Nvidia Makes Nearly 1,000% Profit on H100 GPUs: Report

https://www.tomshardware.com/news/nvidia-makes-1000-profit-o...


And that's the consequence of giving the market leader a decade and a half head start. No one and I mean no one was preventing AMD or Intel from building a competitive hardware and software ecosystem other than themselves.


It was a David against two Goliaths situation until recently.

Five years ago when AMD was recovering from near bankruptcy, it was battling Intel (100K+engineers) in CPUs and Nvidia (20K+ engineers) in GPUs

It was doing that with just 10K engineers and a fraction of the R&D budget of the giants.

Things have changed now. AMD has more than doubled their headcount and R&D budgets. Within the last year the company has pivoted to AI as their main focus.


In the case of AI computing, my layman's perspective of the last decade is that NVIDIA created a market where one did not previously exist "at scale" as a long bet that is paying off significantly.

It seems unfair to criticize AMD or Intel for placing losing bets on other market segments when things could have just as easily turned out differently.


It could be argued that Nvidia sabotaged OpenCL. Part of the reason that nobody used OpenCL was because that it didn't work very well on Nvidia hardware, which was the dominant hardware. So unseating Nvidia was an impossible chicken/egg problem. To unseat Nvidia you needed OpenCL to succeed but OpenCL couldn't succeed without good support Nvidia.

OpenCL sucked for other reasons too, but it likely wouldn't have succeeded even if it didn't suck.


That'd be a tough argument. Intel and AMD aren't responding to CUDA by pumping resources into OpenCL, they're building completely new libraries. That suggests NVidia was frustrated by its partners in the OpenCL design process and just built their own system that worked.


Yeah, NVIDIA was frustrated by disabling memory channels (3 concurrent transfers for CUDA, 1 for OpenCL) and not providing real support while advertising support for newer OpenCL versions (builds failing with cryptic compile errors). Causing OpenCL code to perform poorly and preventing it from running on their own hardware in some cases.

Other than locking out the technologies they see threatening, they’re completely a fair company, if you want to believe in that.


Yeah, I can't reasonably buy one. But rent one to do some maths? Good odds.

There are some existing APU systems - I use low wattage ones as thin clients. Currently thinking it should be possible to write code that runs slowly on a cheap APU and dramatically faster on a MI300A system. Debug locally, batch compute on AWS (or wherever ends up hosting these things).


BoM?


Bill of Materials.

Also know as cost of the chip.


Bill of Materials should really only be used for things that require a list of items/labor, and aren't sold individually such as a datacenter, a building, a hardware integration project, a wedding, etc. For things that are sold individually, "cost" will suffice, and BoM is an example of incorrectly using a more complicated term for the sake of seeming smart.


> Also know as cost of the chip.

As a shorthand for expected retail cost, that is terribly misapplied. Or even for internal cost.

The chip would be one line item on the BOM for the entire thing you plan on shipping (but not packaged yet). Even in the case you are "just" selling a chip, the BOM is likely more complicated,and in this context primarily the manufacturing side cares about that. The COGS (cost of goods sold) is something the company as a whole will care more about - this is what it actually costs you to get it out the door. You will hear "BOM cost" referring to the elemental cost of one item on the BOM, but that's not the BOM itself.

None of these are related to the retail (or wholesale) cost in a simple way, either than forming a floors on long term sustainable price.

The GGG-whatever post is using this sloppily to suggest that the chips are going to be very expensive to produce, therefore the product is going to be expensive.

You'll also see BOM in a materials and labor type invoice, (like when you get your car serviced) but that's not relevant here.


Yeah fair, my comment was a terrible comment.

What I really meant was total production cost + reasonable amortization cost for the development/tape out, which is of course is huge even if this chip was mass produced.

The point kinda stands though, this thing is collectively way too hard to fab + assemble to sell to consumers at any reasonable cost, especially on top of the massive R&D AMD, TSMC and everyone along the huge chain put into it.

Its not like a 4090 or Apple Silicon where the production is reasonably cheap, but margins are super high because they can be super high.


Bill of Materials


If you can figure out how to effectively utilize this much parallelism for constraint solving, you are likely going to get a Turing award.


I want to run image stacking for my astrophotography habit. 12 hours to drizzle my current data sets is a long wait.


What tool can you advise me to find the best combination of dividers and multipliers in PLL clocks and dividers in the UART and SPI peripherals so that the deviation from the assumed baudrate is as small as possible and that the dividers meet a number of constraints, including e.g. those from the chip errata which says that one PLL must be 2 times faster than the other?


Write a half page long python script that iterate all combinations , it won't take long, and you have to do that only once


"When in doubt, use brute force." - Ken Thompson


Basically the underlying principle of Prolog.


Excel. Conditionally format based on constrain violations and difference from target baudrate. Brute force and scroll.


I've had good experiences with OptaPlanner:

https://www.optaplanner.org/


Try optlang. It mostly does ILP/LP though.


there's a good chance a m3 ultra will be competitive in that area and will be actually available to people


Oxide and Friends just did an episode talking about this: https://oxide.computer/podcasts/oxide-and-friends/1643335


I just discovered On the Metal earlier this year and finished it a week ago. I'll probably get to this episode of Oxide & Friends sometime next year.


How do they physically align all the parts? Do they have some kind of self-aligning mechanism or it's done with external manipulation?

(Or maybe that's TSMC's secret)


Yeah, it is TSMC, see:

https://3dfabric.tsmc.com/english/dedicatedFoundry/technolog...

Older explanation, but lots of this stuff is just now shipping: https://www.anandtech.com/show/16051/3dfabric-the-home-for-t...

Not that AMD doesn't deserve any credit, they have considerable multi chip experience under their belt and undoubtedly served as a guinea pig/pipe cleaner for TSMC's advanced package.


It doesn't sound very complicated compared to, say, aligning the masks.


I think the other parameters are way more difficult to get right than any optical system alignment considerations. Things like statistical process control make photolithography look less scary.

You can check if you correctly patterned the wafer almost immediately. You won't know if the layer is any good until many process steps later. Maybe not for sure until EDS. Tuning the interactions between manufacturing processes is the actual secret sauce that all the manufacturers are trying to protect. How much dose on the EUV machine depends a lot on how you intend to etch the wafer. Imagine iteration cycles measured in months for changing individual floating point variables.


The masks only need to be aligned during manufacturing.

The chiplets need to stay aligned as temperatures change. Much more difficult.


Not very difficult. Mask alignment is harder, soldering physics takes place once the oven temps begin and (usually) keeps things locked-down.


The MI300 chiplets don't use solder! Crazy stuff. Direct copper-to-copper bonding.


Ask ASML.

The contraints all depend on the scale of the alignment required.

I wonder what is the wavelength range used for their interferometers, and what kind of mecanical engines they use (probably piezo electric based engines).


Ping me when the software stack for the AMD hardware is as good as CUDA.


What exactly are you missing when hipifying your CUDA codebase? For most of the software I've looked at this has been a breeze, mostly consisting of srtting up toolchains.

Or do you mean the profiler tooling?

I hear everyone say that AMD doesn't have the software, but I'm a little confused --- have you tried HIP? And have you tried the automatic CUDA - HIP translation? What's missing?


I think they're referring to support in general. Technically HIP exists, but it's a pain to actually use, it has limited hardware support, is far less reliable in terms of supporting older hardware, needs a new binary for every new platform and so on.

CUDA runs on pretty much every NVIDIA GPU, this year they dropped support for GPUs released 9 years ago, and older binaries are very likely to be forward compatible.

Meanwhile my Radeon VII is already unsupported despite still being pretty capable (especially for FP64), and my 5700XT was never supported at all (I may be mixing this up with support for their math libraries), everyone was just led on with promises of upcoming support for 2 years. So "AMD has the software now" is not really convincing.


I suppose if you're talking about consumer cards, I agree, support is often missing.

But if we're talking datacenter GPUs, the software is there. Data centers is where most GPGPU computing happens after all.

It's not ideal when it comes to hobby development, but if you're working in a professional capacity I'm assuming you're working with a modern HPC or AI cluster.


Well, to provide my own experience bringing GPU acceleration to scientific computation software, AMD got passed on because even if the software was there at the time (it wasn't), there was no way to justify spending a bunch of money buying everyone workstation AMD GPUs when we could just start by picking up basic consumer NVIDIA cards to work on for anyone who didn't happen to have one already and then worry about buying more suitable cards if needed.

Of course the end goal was to run on a large HPC cluster we had access to, but for efficient development, support on personal machines was necessary. My personal dual 3090 setup has been invaluable for getting through debugging and testing before dealing with the queueing system on the cluster (Side note: it also ended up revealing another important benefit of consumer side support for GPGPU, a 3090 was easily matching the performance of a single node of the CPU-only version of the cluster, thus massively bringing down the cost of entry to an otherwise computationally restrictive topic).


This is a very valid point. 100% AMD need better support of consumer cards to get research, university and everyday folks using their SW. The high end consumer7900 xtx are supported now, and a lot of other cards not officially supported actually work. (I've a colleague who tried it on his home computer and got it working, despite that card not being on the official list) .. . Still, AMD need to be getting more cards working ASAP


There is a truly gigantic demand for this - I expect you won't be waiting too long.


Related (published yesterday): Intel CEO attacks Nvidia on AI: 'The entire industry is motivated to eliminate the CUDA market' https://www.tomshardware.com/tech-industry/artificial-intell...


The chip industry is sure. But are the customers? The customers who cared are jaded by nearly 15 years of intel and amd utterly failing to make a compelling alternative and likely have a large existing investment in CUDA based GPUs.


Yes, but no customer wants to give Nvidia monopoly money forever either. So like it or not they need alternatives.


> but no customer wants to give Nvidia monopoly money forever either.

From a consumer perspective, I agree. From a datacenter, edge and industrial application perspective though, I think those crowds are content funding an effective monopoly. Hell, even after CUDA gets dethroned for AI, it wouldn't surprise me if the demand continued for supporting older CUDA codebases. AI is just one facet of HPC application.

We'll see where things go in the long-run, but unless someone resurrects OpenCL it feels unlikely that we'll be digging CUDA's grave anytime soon. In the world where GPGPU libraries are splintered and proprietary, the largest stack is king.


I wish I could buy ML cards with Monopoly money.


I support this product that uses off the shelf nvidia GPUs to do <thing> on the computer they use for <big machine>. I see a lot of IT deps and MSPs asking if they can use AMD gpus because apparently a lot of dells come with em or something. I always have to tell them no and to just buy a P1000 or one of the other cheapo quadros instead and I hate it.


Whether it's Microsoft with Win32, AMD with x86-64, or Nvidia with CUDA, the winner is always the guy who enable people to do things with their computers.

Meanwhile, priests like Intel with Itanium, Microsoft with WinRT, FOSS nerds, or AMD with GPUs will continue failing because most people ain't got no time to be preached to about how they achieve something.



It feels like playing software catchup in this fast moving sector is very challenging. NVIDIA+CUDA is kind of a standard at this point. AMD's CPUs still ran Windows. AMD's GPUs still ran DirectX and OpenGL. This feels different.


I’ve been waiting sixteen years.


There has never been more money riding on eliminating the CUDA monopoly than now.


That’s been true for 16 years.


It's now a trillion dollar market (as measured by market cap). This has only been true for a few months.


well, that's with the 1000% profit on Nvidias monopoly.


Eh, 16 years ago CUDA was the cheap option, compared to other HPC offerings.

And there wasn't a parts shortage (modulo some cryptocurrency mining, but that impacted both GPU vendors)

And ML models weren't so large as to make 8GB of vram sound meagre.

And there weren't a bunch of venture capitalists throwing money at the work, because the state of the art models were doing uninspiring things. Like trying to tag your holiday photos, but doing it wrong because they couldn't tell a bicycle helmet and a bicycle apart.


HIP is a direct drop in for CUDA. It's ready, and many folks using it ported CUDA code with little to no effort.

The SW story has been bad for a long time, but it is perhaps right now better than you think


That's a lot of layers with different thermal expansion coefficients and conductivity. How do they cool all this?


The AMD announcement spent a surprising amount of time on this issue and the ways they're dealing with it.


Can't wait for n-layered chips.


Anyone have insight on:

Why don’t they position the SOC to have the HBM in the center with the CCDs and XCDs on the perimeter?

Seems to me that would yield lower “wire length” through the interconnect for each CCD/XCD to all of the memory.


The XCDs need to talk to each other very quickly, way faster than the HBM, to act like a single chip.

Also, the physical interconnect between the XCDs is different than the HBM.


The memory isn't designed to connect to a lot of different chips. So there's no reason to put it in the center.

The core systems of interconnect are at the base most center-most (well, there's a passive interposer too, but it's just wires): the IOD. These intermediate connections across the chips on top of them, the IOD next to them, and the memory.


HBM is designed for the wire lengths that you're talking about. The core protocol isn't so terribly different from good old DDR4.

The die-to-die links, on the other hand, have extremely short wire length limits (less than a centimeter iirc for UCIe). At the physical layer, you just waste a bunch of power and area driving a high voltage to make the signal go further. Do that enough and you basically wind up with a PCIe PHY.


Each memory chip is connected to a single DRAM controller. Those DRAM controllers are what you'd want to be close to each other to minimize NUMA effects.


complete and total unfounded guess... thermal expansion?


Maybe thermals? Idk


Priced in on the stock yet? Looks like team red starting to nip at Nvidia’s heels again


It’s around 12% up since their event but some of it is beta, everything is pumping


Is this actually a GPU? Ie. can it even render graphics and scan out to a monitor?


AMD's compute oriented cards used to come with displayport output, but I haven't seen one of those in a long time. These cards are definitely GPUs in that they can handle graphical workloads, but I don't think anyone is trying to make them work for video games or the like.


I heard a rumor somewhere that Stadia ran on MI25s - not sure if that's true but certainly there have been a lot of them floating around on eBay in the past year.


IIRC the graphics hardware was dropped in CDNA architectures (MI100, MI200, and MI300). For example, I don't think there are texture units.


AMD still sells firepros (EG the W7900) like that.


That's just a variant of the RX 7900. It's an entirely different architecture.


Does this mean some of the AMD-Xilinx FPGAs will start stacking DRAM on-chip?


They've had FPGAs with HBM DRAM for several years.


I do not have much clue about chips but when AMD bought ATI I thought they will come out with some superchip that combines CPU and GPU into one.

This is just for AI? So not for gaming PCs?


> when AMD bought ATI I thought they will come out with some superchip that combines CPU and GPU into one.

That is exactly what did happen. The year was 2006 and it was called the AMD Fusion project. AMD launched the chips, that they call "APUs", in 2011.

Nowadays, this configuration is common in both AMD and Intel "CPUs".


I imagine they'll go for the big money stuff first and it will trickle back to gamers.


That is literally what drives the PS5 and XSX.


"MI300 [is] three slices of silicon high and can sling as much as 17 terabytes of data vertically between those slices"

After you transfer 17 terabytes of data, it is worn out and you can't use it any more


Once the ones wear down below 0.5 they start to look like zeros.


Exactly. Moving vertically against gravity gradually squashes the bits down.


Per second


Probably :-)

The literal interpretation of the article is funnier though.


If not CUDA, then what can you use with those?


HPC admin here.

ROCm, which is AMD's equivalent of CUDA. The thing is you don't have to directly interface with CUDA or ROCm. Once the framework you want to use supports these, you're done.

AMD is consistently getting used on the TOP500 machines, and this gives them insane amounts of money to improve ROCm. CUDA has an ecosystem and hardware moat, but not because they're vastly superior, but because AMD both prioritized processors first, and NVIDIA played dirty (OpenCL support and performance shenanigans, anyone?).

This moat is bound to be eaten away by both Intel and AMD, and compute will be commoditized. NVIDIA foresaw this and bought Mellanox and wanted ARM to be a complete black box, but it didn't work out.

Ethernet consortium woke up and got agitated by the fact that only ultra-low latency fabric provider is not independent anymore, so they're started to build alternatives to Infiniband ecosystem.

Interesting times are ahead.

Also there's OneAPI, but I'm not very knowledgeable about it. It's a superAPI which can target many compute platforms, like OpenCL, but takes a different approach. It compiles to platform native artifacts (CUDA/ROCm/Intel/x86/custom, etc.) IIRC.


TOP500 has had all sorts of "odd" systems near the top. Unusual or custom architectures and fabrics are far more viable for "this is our nation's nukeputer" or "this is going to run the nation's weather forecasting model" type of systems compared to e.g. a commercial HPC system where you wanna run all sorts of commercial/proprietary software. AMD being successful in the former niche doesn't mean their software stack is viable for the latter niche.


The systems running AMD cards are not "Nukeputers", and not all "Nukeputers" and "Fusionputers" are running on custom silicon.

When you look to latest list [0], #1 system, Frontier, is a Nukeputer, but it's not only a Nukeputer. #5, LUMI is definitely not a Nukeputer. They are very close to us, we work together under a project with them (we're equals. They operate under a different consortium, we have stakes on another computer which is also very famous and also in top 10 in TOP500, and is not a Nukeputer). We also have our smaller systems in our own datacenter.

We operate mostly in the long tail of science, and this means we see heavy use of popular software packages, and many special software packages optimized for these long tail problems. ROCm was invisible in this niche before, but with Frontier and LUMI, and with AMD's announcements, this started to change very quickly.

ROCm libraries are more open w.r.t. their CUDA counterparts, and they started to land to mainstream distribution repositories directly. This is important for our niche. AMD is sponsoring integration of their libraries to popular packages and it started to payoff already.

Also, small-guy-programmers start to optimize LLM training routines for AMD cards, getting 99% of the performance of NVIDIA counterparts with way less power consumption.

As a result, AMD is already much more visible and more capable position when compared to last year.

[0]: https://top500.org/lists/top500/list/2023/11/


> they're started to build alternatives to Infiniband ecosystem.

Cool, that sounds interesting. Anything you can point at? :)


Discussion on ultrafast ethernet and partners at AMD's December presentation. https://youtu.be/tfSZqjxsr0M?t=5198


UltraEthernet


Thanks. :)


Everything is possible, especially what they have in mind, where they will do a custom implementation. CUDA is important for the existing ecosystem, but that doesn’t make it the only show in town.


Simulating nuclear explosions, effects of decay on nuclear weapon stockpiles, etc

(this is literally what the El Capitan supercomputer mentioned in the article is for)


Wrong reply, please ignore.


You can also delete your wrong comments .. (within 2 h I think)


I couldn’t see anything about deleting, only edit. Will check out the web version on a desktop next time, maybe it’s available there.


Not sure, if you have to have more karma for it (would not make sense to me), but for me the delete button is right next to the edit button.


Until you commented...


Within 2h, and if no one has replied.


CUDA is at the API level (ish, I know). There's plenty of room for new silicon that can expose different APIs.

Compare to switching from x86 to ARM.


why no CUDA rosetta?


AMD has released HIP and a tool called HIPIFY which kind of behaves like this but at the source level¹. Rather than try and just translate CUDA to work on AMD compute they are more focused on higher level tooling.

Currently they seem to have a particular focus on AI frameworks and tools like PyTorch/Tensorflow/ONNX. They have sponsored and helped with a lot of PyTorch development for example, so PyTorch support for AMD is much better than it was this time last year².

¹(https://github.com/ROCm/HIP)

²(https://pytorch.org/blog/experience-power-pytorch-2.0/)


Nvidia could build that but doesn't want it to exist. Outside of nvidia you'd have to reverse engineer their machine code which would be a massive undertaking. The ISA ISA is published by AMD and Intel, you could build tooling from the docs alone if you wish.


Games? Rendering? Transcoding video?


Video is getting a lot of direct ASIC blocks (look at the latest VPE block in AMD GPUs).

I guess those chips are for the movies/video industry, online or not. Because for the "consumer", the CPU is already very efficient, and I don't think we would save interesting about of battery usage in a "real usage" perspective. I may be very wrong, but I don't watch hours and hours in a row of ultra high quality videos on a small screen, that off the AC plug, the battery is unusable in a matter of a few years anyway... it not less.


Encoding video in real time is expensive. You can run your blog on a 5€/month machine. You'll most likely need ten times that money for a single real time stream using nothing but software decoding, rendering and encoding. If you think, "hey I'm going to use a GPU for this" then congratulations, you've increased your cloud bill by a factor of 20.


Lastest AMD GPU does realtime encoding with AV1. It has directly an ASIC block for that. Just mmap the video engine command circular buffer, get some dma buffers, get a interrupt circular buffer, and you are good to go (I think the dma-buffer sync framwork between drivers and user space is still a WIP, and I dunno how the interrupt circular buffer is handled).


But apple m1 chip has 1/6 the power budget of a 4090 and benchmarks show that they are similar. Will apple eat them?

https://owehrens.com/whisper-nvidia-rtx-4090-vs-m1pro-with-m...


I think AMDs shady marketing where they claimed 1.4x over H100 is enough to just steer clear of the hype and wait for results.

Summary: they cherrypicked legacy nvidia sdk's and used llama batch sizes that are not used often in production...

https://twitter.com/karlfreund/status/1735078641631998271

https://developer.nvidia.com/blog/achieving-top-inference-pe...


If you're looking for fair comparisons don't ask nVidias marketing department, those guys are worse than Intel.

What AMD did was a true comparison, while nvidia is applying their transformer engine which modifies & optimizes some of the computation to FP8 & they claim no measurable change in output. So yes, nvidia has some software tricks left up on their sleeve and that makes comparisons hard, but the fact remains that their best hardware can't match the mi300x in raw power. Given some time, AMD can apply the same software optimizations, or one of their partners will.

I think AMD will likely hold the hardware advantage for a while, nVidia doesn't have any product that uses chiplets while AMD has been developing this technology for years. If the trend continues to have these huge AI chips, AMD has a better hand to economically scale their AI chips.


Not my area, but isn't a lot of NVIDIA's edge over AMD precisely software? NVIDIA seem to employ a lot of software dev (for a hardware company) & made CUDA into the de facto standard for much ML work. Do you know if AMD are closing that gap?


They have improved their software significantly in the last year, but there is a movement that's broader than AMD that wants to get rid of CUDA.

The entire industry is motivated to break the nvidia monopoly. The cloud providers, various startups & established players like intel are building their own AI solutions. Simultaneously, CUDA is rarely used directly, typically a higher level (Python) API that can target any low-level API like cuda, PTX or rocm.

What AMD is lacking right now is decent support for rocm on their customer cards on all platforms. Right now if you don't have one of these MI cards or a rx7900 & you're not running linux you're not going to have a nice time. I believe the reason for this is that they have 2 different architectures, CDNA (the MI cards) and RDNA (the customer hardware).


> Right now if you don't have one of these MI cards or a rx7900 & you're not running linux you're not going to have a nice time.

Are you saying that having rx7900 + linux = happy path for ML? This is news to me, can you tell more?

I would love to escape cuda & high prices for nvidia gpus.


That's what I have (RX 7900XT on Arch), and ROCm with pytorch has been reasonably stable so far. Certainly more than good enough for my experimentation. Pytorch itself has official support and things are pretty much plug & play.


> Given some time, AMD can apply the same software optimizations, or one of their partners will.

Except they have been given time, lots of it, and yet AMD is not anywhere close to parity with CUDA. It's almost like you can't just snap your figures and willy-nilly replicate the billions of dollars and decades investment that went into CUDA.


That was a year ago. AMD is changing their software ecosystem at a rapid pace with AI software as a #1 priority. Experienced engineers have been reassigned from legacy projects to focus on AI software. They've bought a number of software startups that were already developing software in this space. It also looks like they replaced the previous AMD top level management with directors from Xilinx to reenergize the team.

To get a picture of the current state which has changed a lot, this MS Ignite presentation from three weeks ago may be of interest. The slides show the drop in compatibility they have for higher levels of the stack and the tools for translation at the lower levels. Finally there's a live demo at the end.

https://youtu.be/7jqZBTduhAQ?t=61


The transformer engine is a fairly recent development (april this year I think) so I don't think they're very far behind.


The audacity to claim AMD's marketing is "shady" and then show a plot that compares queries/sec with AMD at batch size 1 with Nvidia at batch size 14.

Rename it to batch_size/sec if you don't see the issue.


Nvidia’s response is the shady one, in their response they’re using a different batch size




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: