The graphs ranking GPUs may not be accurate, as they don't represent real-world results and have factual inaccuracies. For example:
> Shown is raw relative performance of GPUs. For example, an RTX 4090 has about 0.33x performance of a H100 SMX for 8-bit inference. In other words, a H100 SMX is three times faster for 8-bit inference compared to a RTX 4090.
RTX 4090 GPUs are not able to use 8-bit inference (fp8 cores) because NVIDIA has not (yet) made the capability available via CUDA.
> 8-bit Inference and training are much more effective on Ada/Hopper GPUs because of Tensor Memory Accelerator (TMA) which saves a lot of registers
Ada does not have TMA, only Hopper does.
If people are interested I can run my own benchmarks on the latest 4090 and compare it to previous generations.
Well every since I've read these papers doing FFTs faster than cuFFT using tensor cores (although FFT isn't supposed to be helped by more flops but only better memory bandwidth) and also fp32-level accuracy on convolutions with 3x tf32 tensor cores sweeps (available in cutlass) I'm quite ready to believe some hype about TMA.
Anything improving memory bandwidth for any compute part of the GPU is welcome.
Also I'd like for someone to crack open RT cores and get the ray-triangle intersection acceleration out of Optics. Have you seen the FLOPS on these things?
I'm not on my desktop, but I used nanoGPT extensively on my RTX-4090. I trained a GPT2-small but with a small context window (125 > 90M params) at batch sizes of 1536 using gradient accumulation (3*512). This runs at just a smidge over 1 it/s. Some notes
- Gradient checkpointing and 2048 batch size in a single go allows ~10% performance improvement on a per sample basis.
- torch.compile doesn't work for me yet (lowest cuda version I got my 4090 to run on was 11.8 but highest cuda version on which I got the model to compile is 11.7).
It's pretty clear from the article that 8-bit inference is referring to FP8, as the author claims int8 is similar to fp16 inference performance.
> Ada/Hopper also have FP8 support, which makes in particular 8-bit training much more effective. I did not model numbers for 8-bit training because to model that I need to know the latency of L1 and L2 caches on Hopper/Ada GPUs, and they are unknown and I do not have access to such GPUs. On Hopper/Ada, 8-bit training performance can well be 3-4x of 16-bit training performance if the caches are as fast as rumored. For old GPUs, Int8 inference performance for old GPUs is close to the 16-bit inference performance.
I find this article odd with its fixation on computing speed and 8bit.
For most current models, you need 40+ GB of RAM to train them. Gradient accumulation doesn't work with batch norms so you really need that memory.
That means either dual 3090/4090 or one of the extra expensive A100/H100 options. Their table suggests the 3080 would be a good deal, but it's not. It doesn't have enough RAM for most problems.
If you can do 8bit inference, don't use a GPU. CPU will be much cheaper and potentially also lower latency.
Also: Almost everyone using GPUs for work will join NVIDIA's Inception program and get rebates... So why look at retail prices?
> Gradient accumulation doesn't work with batch norms so you really need that memory.
Last I looked, very few SOTA models are trained with batch normalization. Most of the LLMs use layer norms which can be accumulated?
(precisely because of the need to avoid the memory blowup).
Note also that batch normalization can be done in a memory efficient way: It just requires aggregating the batch statistics outside the gradient aggregation.
I) My research involves biological data (protein-protein interactions) and my datasets are tiny (about 30K high-confidence samples). We have to regularize aggressively and use a pretty tiny network to get something that doesn't over-fit horrendously.
II) We want to accommodate many inferences (10^3 to 10^12) inferences on a personal desktop or cheap OVH server in little time so we can serve the model on an online portal.
I'm not sure any of this is accurate. 8 bit inference on a 4090 can do 660 Tflops and on an H100 can do 2 Pflops. Not to mention, there is no native support for FP8 (which are significantly better for deep learning) on existing CPUs.
The memory on a 4090 can serve extremely large models. Currently, int4 is started to become proven out. With 24GB of memory, you can serve 40 billion parameter models. That coupled with the fact that GPU memory bandwidth is significantly higher than CPU memory bandwidth means that CPUs should rarely ever be cheaper / lower latency than GPUs.
Depends on who you know, but I've seen as low as €799 per new 3090 TI. But you need to waive the right to resell them and there are quotas, for obvious reasons.
Consumer parts are dirt cheap compared to enterprise ones. Most companies are not able to use them at scale due to CUDA license terms. I don't think there is much of a need for rebates here. For hobbyists, it is somewhat of a steep price for the latest cards, but it's already way down from the height of ETH mining a year back.
Yeah I hear it’s common practice now to avoid synchronizing GPU training kernels in order to speed things up, and it has positive regularization benefits and little downside.
They nerfed the heck out of board memory in the 3000 series (3080 20GB was even made in limited quantity... going to miners in China :( ) so color me a bit skeptical.
3060 12GB are the best things you can buy right now. They are cheap, have a ton of memory--which seems to be the issue w/ image generation--and you can fit four of them into the cheapest motherboards.
3060ti 8GB, 3090 24GB, and 4000 series all have performance benefits, but for now this one is off the charts.
Another one to consider is the A4000 16gb. I recently bought an ex-miner card for ~$500 usd . They are around a 3070 with a decent amount of memory for training scenarios, and are single slot cards. I believe there are a lot of these workstation ex miner cards which are pretty heavily discounted.
Combine this with a second hand X99 / 2011-v3 platform like the Dell Precision T7910 dual socket Xeons and you can have a pretty decent homelab for ML workloads. The Dell can come with a 1300 watt PSU and can fit 4 of those cards comfortably (5 with reduced PCIe lanes on one) since they are 150W each.
For image inference 12GB is ok for now (but you may not be able to use all future models given that T5 language model is becoming popular), for training I'd consider 24GB the bare minimum.
Anyone know if high-RAM Apple silicon such as the 128 GB M1 Ultra is useful for training large models? RAM seems like the limiting factor in DL, so I'm hoping apple can put some pressure on nvidia to offer consumer GPUs with more than 24GB RAM.
Having large amounts of unified memory is useful for training large models, but there are two problems with using Apple Silicon for model training:
1. Apple doesn't provide public interfaces to the ANE or the AMX modules, making it difficult to fully leverage the hardware, even from PyTorch's MPS runtime or Apple's "Tensorflow for macOS" package
2. Even if we could fully leverage the onboard hardware, even the M1 Ultra isn't going to outpace top-of-the-line consumer-grade Nvidia GPUs (3090Ti/4090) because it doesn't have the compute.
The upcoming Mac Pro is rumored to be preserving the configurability of a workstation [1]. If that's the case, then there's a (slim) chance we might see Apple Silicon GPUs in the future, which would then potentially make it possible to build an Apple Silicon Machine that could compete with an Nvidia workstation.
At the end of the day, Nvidia is winning the software war with CUDA. There are far, far more resources which enable software developers to write compute intensive code which runs on Nvidia hardware than any other GPU compute ecosystem out there.
Apple's Metal API, Intel's SYCL, and AMD's ROCm/HIP platform are closing the gap each year, but their success in the ML space is dependent upon how many people they can peel away from the Nvidia Hegemony.
This basically seems to line up with what another tech community (outside of AI / ML) seem to agree on: that Apple hardware is not that great for 3D modeling or animation work. Their M1 and M2 chips rank terribly on the open-source Blender benchmark: https://opendata.blender.org/
This isn't really an Apple specific problem - GPUs with low power budgets are never going to beat GPUs intended to permanently run off of wall power.
(I know there's stuff like the Mac Studio and iMac, but those are effectively still just laptop guts in a (compact) desktop form factor rather than a system designed from the ground up for high performance workstation type use)
I'd love to see a dedicated PCIe GPU developed by Apple with their knowledge from designing the m1/m2, but it's not really their style.
The 4090 is able to draw 450 watts. Current apple silicon are all used on laptops, which cannot possibly sustain or cool that much power. It might change when they make a cooling-optimized desktop chip though.
Not exactly an answer to your question, but from a pytorch standpoint, there are still many operations that are not supported on MPS[1]. I can't recall a circumstance where an architecture I wanted to train was fully supported on MPS, so some of the training ends up happening in CPU at that point.
I have a MacStudio Ultra with 64 GPU cores and 128 GB of RAM which I use it to play around with neural networks (among other things). I can create models larger than what I would be able to on consumer GPU cards but it isn't very fast. On average, I get about 20% of the speed of an RTX 3090.
It is definitely useful but does not beat the top Nvidia cards. [1]
Would be very interesting to retest with the M2 and also in the months/years to come as the software reaches the level of optimisations we see on the PC side.
It's an interesting option. My gut instinct is that if you need need 128GB of memory for a giant model, but you don't need much compute - like fine tuning a very large model maybe - you might as well just use a consumer high core CPU and wait 10x as long.
All the frameworks work on CPU. At the time I tried it, the 5950X was about 10x slower than my GPU, which was a 1080Ti or 2080Ti. GAN not a transformer though.
I think they are saying train (or at least fine-tune) on a CPU.
This can work in some cases (years ago I certainly did it for CNNs - was slow but you are fine tuning so anything is an improvement) but I don't know how viable it would be on a transformer.
If you are training a model that requires this much memory, it will also require a lot of compute, so it would be too slow and not cost-effective. It may be useful for inference.
I've recently tried to play with Stable Diffusion. I have an RTX 3060, a base M1 MBP and a laptop with an AMD RX6800m.
- The RTX 3060 just works. Setup was straightforward, performance is good.
- The M1 has those neural engine thingies but it's not compatible with SD. It can run SD on CPU but if you want to make use of the NE you need coreml specific programs and models. Issue here is that it's just not the same stuff. Prompt structure is also different. It doesn't recognize weights and put a lot more emphasis on prompt order. Most of the time it seems to ignore most of what you wrote. On the positive side running stuff on the NE is very fast and doesn't seem to be taxing to the system.
- Finally the 6800M. It should be the powerhouse of the bunch considering its gaming performance ahead of the 3060. Problem is AMD toolkit kinda sucks. They have this obscure ROCm HIP stuff that acts as some kind of translation layer to the CUDA API. It's complicated, it simply doesn't work without a bunch of obscure environment variables and only works on fp32 mode, which means it uses twice amount of video RAM for the same thing. Support is iffy as they seem to only support workstation cards. Using it often throws lots of obscure compilation errors. Bugs celebrating anniversary.
To sum things up, Nvidia is far ahead of the curve in both usability and performance. Apple is trying to do its thing but it's too early yet while AMD is in a messy situation. Hope it helps.
AMD's driver is much better on Linux than Nvidia's. And ROCm HIP stuff isn't obscure. It's what's needed to bridge Nvidia's proprietary vendor lock ins.
> AMD's driver is much better on Linux than Nvidia's.
1. Agree AMDGPU is awesome.
2. All tests were done in Linux with the exception of coreml on MacOS.
> And ROCm HIP stuff isn't obscure.
It is from my novice perspective. While Nvidia stuff works out of the box (even on Linux) I had to install Rocm and a special version of PyTorch for rocm. Then it only kinda worked after an assortment of environment variables to set GFX version. And even then it uses as twice as much video memory as what torch uses on Nvidia, limiting what I can do.
When you say "out of the box", then it's distros like pop!_os? Or has it become more common to automatically install proprietary drivers? Asking out of curiosity.
> What is the carbon footprint of GPUs? How can I use GPUs without polluting the environment?
> I worked on a project that produced carbon offsets about ten years ago. The carbon offsets were generated by burning leaking methane from mines in China. UN officials tracked the process, and they required clean digital data and physical inspections of the project site. In that case, the carbon offsets that were produced were highly reliable. I believe many other projects have similar quality standards.
Crusoe Cloud (https://crusoecloud.com) does the same thing; powering GPUs off otherwise flared methane (and behind the meter renewables), to get carbon-reducing GPUs. A year of A100 usage offsets the equivalent emissions of taking a car off the road.
Methane that has been burned to convert it to CO2 is much less bad of a greenhouse gas than methane that has been allowed to just float off into the atmosphere.
Perhaps counterintuitive given how much we usually go around saying burning fossil fuels is bad for the environment, but the science is sound.
Methane has an 80x higher GWP than CO2 over the first 20 years, so ending routine flaring (even if the answer is "complete combustion") is an immensely important problem to help address the effects of climate change. Another source: https://www.edf.org/climate/methane-crucial-opportunity-clim...
I really hope AMD cleans up and invests some money into their software stack. Granted it's really hard to catch up to Nvidia, but I think it's doable in ~5 years. The barrier to entry into rocm compared to cuda is pretty high, and even accounting for the fact that things got better in the last years. AMD has potent hardware for AI/ML (see instinct), they just don't have it in the consumer space. However one of the key factors of getting adoption in the consumer space is the fore-mentioned software stack, which I recall was a pain to setup. The fact that they're going FOSS for rocm shows promise in this regard.
In current state Nvidia has no competition, and you buy AMD only because you hate Nvidia, not because it's better. I had tons of AMD cards my first being ATI 9000 but if you want to do more than game the hassle is not worth it. My last AMD was Radeon VII, once I got a bit serious into Blender there's just no comparison, even with enterprise drivers the crashes, random issues and slowness comparatively is just not worth it. What took me 3 minutes to render on Radeon VII takes 10s on 3090 Ti, StableDiffusion renders take 5-6s without any hassle playing with rocm, gaming is also no comparison with RTX (I don't even use DLSS). Fun fact I sold my RVII after 3 years and added $500 for a new 3090 Ti. Nvidia sux for their business practices but technically they dominate because they invested in software really early on and established themselves with Cuda, OptiX, RTX, DLSS. Older AMD cards are nice for hackintosh if you're into that though, Apple dropped Nvidia hard. Also the Linux driver blob thing if you want to be a purist, but IIRC it's supposedly changing (sorry don't have a source right now).
Is a 4090 practically better than a 3090? I just built a new home DL PC with two 3090s because I knew I could fit them both in the case, whereas with the 4090 it seems more than one could be difficult. Also wondering if I can pool the RAM somehow, nvlink won't work because the 3090s are different sizes, and apparently nvlink doesn't do much more than pcie anyway.
Basically 'it depends' is the answer to all your questions, but dual 3090s is a perfectly fine choice. Though ideally you would have NVLINK since it is an advantage over the 4090. In some specialized situations it is possible to have NVLINK act a lot like 48GB of memory, but if you don't already know if you can leverage NVLINK, you very likely aren't in that situation.
tangential but the size of 4090 seems to be a mistake and hindering usecases like this. I believe NVIDIA changed to the samsung process a bit late and it produces less heat, but they have communicated to OEMs about the cooling requirements so nobody wants to redesign their card. I expect some aftermarket brands to create "slimmer" 4090 coolers to enable aircooled dense 4090 workstations.
Gigabyte produced a dual slot 3090 for a little while before Nvidia pressured them to discontinue production[0].
I suppose we might see "slimmer" 4090s at some point but even if the design is (somehow) possible it's clear Nvidia won't allow their partners to manufacture dual slot versions of RTX cards that could possibly compete with higher end cards.
Flow chart was nice but I am not an organization and I like training multi billion param models. My next two cards were going to be the be the rtx 6000 ada. The memory capacity alone almost makes it necessary.
Interesting no mention or discussion of FPGAs for DL Neural networks.
"Our enhanced NPU on Stratix 10 NX delivers 24× and 12× higher core compute performance on average compared to the T4 and V100 GPUs at batch-6, despite the smaller NX die size."
"Results show that the Stratix 10 NX NPU running batch
6 inference achieves 12-16× and 8-12× higher average energy
efficiency (i.e. TOPS/Watt) on the studied workloads compared to the T4 and V100 GPUs, respectively."
FPGAs are awesome, but are even less usable than AMD GPUs for ML by comparison - you may have to write a kernel to get a new net to work, and that really limits adoption. Software is the #1 thing that will enable you to get research done.
Disclaimer - I work on the team that was originally behind Brainwave at Microsoft.
I'll take a retired crypto mining card over a random desktop/gamer card any day.
Crypto cards are almost always run with lower power limits, managed operating temperature ranges, fixed fan speeds, cleaner environments, etc. They're also typically installed, configured, and managed by "professionals". Eventual resell of cards is part of the profit/business model of miners so they're generally much better about care and feeding of the cards, storing boxes/accessories/packing supplies, etc.
Compared to a desktop/gamer card with sporadic usage (more hot/cold cycles), power maxed out for a couple more FPS, running in unknown environmental conditions (temperature control for humans, vape/smoke residue, cat hair, who knows) and installed by anyone.
Hard disagree, honestly. A mining card might be undervolted, but it will always have lived under sustained VRAM temps of 80c+. That's awful for the lifespan of the GPU (even relative to bursty gaming workloads) and once the memory dies, it's game over for the card. Used GPUs are always a gamble, but mostly because it depends on how used they are. No matter how you slice it, a mining card is more likely to hit the bathtub curve than a gaming one.
The common lore after the last mining crash (~2018) was to avoid mining cards at all costs. We're far enough along at this point where even most of the people at /r/pcmasterrace no longer subscribe to the "mining card bad" viewpoint - presumably because plenty of people chime in with "bought an ex-mining card in 2018, was skeptical, still running strong".
Time will tell again in this latest post-mining era.
While you're not wrong, PCMR is basically the single largest central of survivorship bias on the internet.
There probably are people with functional mining cards, but like I said in my original comment, hardware failure runs along a bathtub curve[0]. The chance of mechanical failure increases proportionally with use; leaving mining GPUs particularly affected. How strong that influence is can be debated, but people are right to highlight an increased failure rate in mining cards.
Do you know what also contributes to hardware failure? Thermal cycles, because over time they stress the solder joints.
If a component has been powered on and running continuously under load within it's rated temperature band, it'll have fewer heat cycles. This seems preferable to a second-hand card from a gamer which gets run at random loads and powered down a lot.
I'd think the most common failure point on a GFX card is the cooling fan, so if they have been run continuously for years the bearings on the fan are probably getting pretty worn out.
Hopefully manufacturers aren't still gluing the fans on the boards. I lost a couple of GPUs thanks to that, turning what should be an easy fix into a replace the whole damn card situation.
Not sure I'd go as far as "single largest central of survivorship bias on the internet" but I understand the sentiment.
As another commenter noted used RTX 3xxx series cards are at most two years old. Given the anticipated lifetime of a card as the Bathtub Curve applies I'd argue any surviving used card is pretty firmly in "escaped infant mortality" territory. This applies to gamer cards as well (of course) but this is where my other points apply.
Anecdotal but in 2017 I had 200 GTX 1xxx cards for a personal mining project (it was interesting). From what I remember I experienced <5 card failures within the first few months but smooth sailing otherwise. I kept a few for personal use and eventually sold the rest with a handful of them (small sample size, I know) going to personal friends. They're all still running strong.
For the remainder that were sold on eBay I only had one return (it was fine, of course) and all of the reviews were impeccable - along the lines of "I know this card was used to mine but I'd swear it's new 10/10". Note that I was very clear and transparent that they were used for mining in the item description and I even included pictures of the mining environment in the listings. My eBay account is still active and I haven't heard anything from a single customer over the past three years (I assume they'd come back in the event of a card failure). I think 200 cards is a large enough sample size to bolster my argument.
As another commenter also noted many of them are being sold as new. This is typical crypto ecosystem shadiness but for a miner to be able to even pass a card off as new speaks to the high level of care higher-volume professional mining operations employ.
For modern Nvidia GPUs (even going back to Pascal) I'd argue anything used for less than two years is essentially burn-in testing.
A 30xx series GPU is at most about 2 years old. Silicon processes are often qualified for a 10 year *continuous* operating lifetime, and that’s for junction temperatures that are much higher than 80C.
For example this TI report mentions a typical lifetime of 10 years at 105C.
I hope I'm not getting downvoted into complete white on white for asking:
Is there a good resource to learn myself a little ML for greater good, if I'm a complete math idiot? Something very practical, start with {this} and build an image recognition to tell birds from dogs. Or start with {that} and build a algo trading machine that will make me a trillionaire.
I do have a 4090 in a windows machine which I can turn into a linux machine.
If you're interested in Deep Learning specifically, the fast.ai "Practical Deep Learning for coders" course [1] is often recommended. It says "You don’t need any university math either — we’ll teach you the calculus and linear algebra you need during the course".
Thanks! I don't know if it's DL or not, but what I'm mostly interested is "finding patterns". Like I have this very long stream of numbers (or pair of numbers, like coordinates) and I use such streams to train. Then I have a shorter stream and the model infers the rest. Not sure if I'm talking complete nonsense or not :-)
Math isn't an absolute requirement for training an AI, even programming isn't a requirement, as there are quite a few pre-built tools. I would recommend browsing through some of the many, many books available and see which one(s) speak to your current experience level. You won't compete against world class systems without going deeper, but as long as you're willing to start small, you have to start somewhere just to see if it's something you like doing.
What I'm looking for is more like Rust Book. Concepts explained with examples of more or less real world problems. Like they are not just telling you there's this awkward thing like "interior mutability" but why you may need it.
What's very interesting to me if i read the "slowdown vs power limit" graph right, is you can get 74% of the performance for half the power consumption?
And since I know nothing about deep learning:
Can you do these neural networks on AMD GPUs at all, regardless of performance? Because the hive mind says AMD GPUs are better if you run Linux.
> Can you do these neural networks on AMD GPUs at all, regardless of performance?
In principle yes, in practice it's not worth the bother. Scroll to the FAQ section at the end of the article, they cover the AMD situation pretty well.
> Because the hive mind says AMD GPUs are better if you run Linux.
As usual the hivemind just regurgitates memes without understanding.
> Can you do these neural networks on Linux at all?
Almost all NN work is done on Linux. Recent WSL2 developments have supposedly made Windows viable as well, but I don't know how popular it is.
> > Because the hive mind says AMD GPUs are better if you run Linux.
> As usual the hivemind just regurgitates memes without understanding.
I can't edit the OP any more, but I meant 3d drivers wise generally, not ML wise.
I haven't cared about 3d on Linux in ages, but now I do (thanks Apple for the architecture change) and need to decide what video card to get in the next year as prices go back to a reasonable-ish level. As in AMD or NVidia, not what price/performance tier. And my general impression from the internets(tm) is amd drivers are more stable. Am I wrong?
AMD drivers are not more stable. They are however open-source and included in the kernel, which has both advantages and disadvantages:
- AMD has better support for the "blessed" Linux 3D stack. This manifests itself as more mature support in Wayland compositors for example, which is still rockier with nvidia.
- With AMD you need to care about running recent kernels, because improvements to the drivers are tied to the kernel, as is support for newly released cards. With nvidia it doesn't matter - I can run a stable distro like Ubuntu 20.04 LTS and still use the latest driver.
- Nvidia support for switching between discrete and integrated graphics in laptops used to be crap. I'm not sure if it's still the case as I don't follow this closely, but given historical record, it's likely still crap.
- Nvidia driver quality is historically better than AMD (across the board, not just in Linux). If gaming is your primary use, things are still more likely to "just work" on nvidia than AMD, although AMD will usually fix problems eventually.
Is the "Will AMD GPUs + ROCm ever catch up with NVIDIA GPUs + CUDA?" section accurate? It looks like at least partly an old version as it mentions upcoming 2020 hardware.
So how does training performance work when you don't have enough memory on the GPU? If you just use RAM instead, is the performance hit as simple as PCIE_Bandwidth divided by GPU_Memory_Bandwidth?
Example with 4090 would be 64GB/s and 1008GB/s respectively, so one could expect about 6% performance even when dealing with a model that is 48GB large (or more)?
Or are there second-order effects that cut this speed down even more?
A bit shock. I purchased two 1080ti based in his recommendation many years ago. I looked at the chart. Is it really these cards are still there above the middle. Is there something wrong. Even the memory is not that bad.
Yeah, I've got a dual 1080ti setup and was wondering about that myself. It seems the recommendation is "Used 3060" but then then this benchmark says 1080ti is better despite it being two generations older?? I was considering upgrading but now I don't know what to do...
Why are transformers and large language models (which AFAIK use transformers, e.g Generative Pre-trained Transformer) in different boxes in the decision tree? They yield the same conclusion so I guess it's fine.
Thanks to Tim for writing and sharing this awesome analysis. I stumbled upon the previously updated version this summer, and so glad to see it updated with new hardware. Really great stuff and I learned a ton.
is there more context for:
"Well, with the addition of the sparse matrix multiplication feature for Tensor Cores, my algorithm, or other sparse training algorithms, now actually provide speedups of up to 2x during training"
2x what? CPU, then it's hardly worth it. 2x per core, so #cores x 2. I'd imagine the level of sparsity is also important, but there is no mention of it.
I can't really judge the accuracy of the rest of the article, but if they can't even get the names of the GPUs they're recommending correct, it's tough to trust the rest of their supposedly thorough analysis. Throughout the whole article, the author refers to the "NVIDIA RTX 40 Ampere series", which is wrong. The RTX 30 series is Ampere [1] and the RTX 40 series is Ada [2].
It's not "throughout the whole article". It looks like he has updated a previous post and made one editing mistake, which he then copied incorrectly once. In all other mentions he correctly describes RTX 40 as Ada.
The analysis is thorough, even if the editing is imperfect.
The author also says that AMD doesn't have Tensor cores. No shit, Tensor core is an Nvidia trademark. AMD has matrix multiplication units in their CDNA GPUs.
Please don't complain about tangential annoyances—e.g. article or website formats, name collisions, or back-button breakage. They're too common to be interesting
Accessibility and usability issues can and should be reported. I'm not sure if the guideline applies in this case, as the issue has significant impact and the article author might want to be aware of it.
Sure, they should. The problem is, you're not reporting them to the site. You're commenting about them here. Virtually everybody who has to read your report can't do anything about it. Which is, in large part, why we have a rule against writing these kinds of comments.
By all means, report your accessibility problems with sites on HN... to the site owner. Email is a good first move!
I understand your point. But discussing these issues in a public forum has an additional benefit in that it also raises general awareness that they're a problem.
Also, it is an interesting question. Why do we keep seeing blog designs with thin gray text, even though people repeatedly complain about it, and even though text is the centerpiece of a blog? Its not really that difficult to pick a different weight or color. So what causes this problem again and again?
Is it really the case that the rule says "We don't want to hear about accessibility problems because they are boring"? That's just awful.
If they are truly "boring" and that's the only problem, why not let the downvotes take care of it? Why flag?
If the answer is "that's the kind of community we would like to create" let me just point out my third paragraph again and say that that's trully sad.
I'm not sure if the guideline applies in this case
It does, there are lots of mod comments about cases exactly like this. The problem with these is not that accessibility is unimportant or unreadable pages aren't annoying but that they're repetitive.
> Shown is raw relative performance of GPUs. For example, an RTX 4090 has about 0.33x performance of a H100 SMX for 8-bit inference. In other words, a H100 SMX is three times faster for 8-bit inference compared to a RTX 4090.
RTX 4090 GPUs are not able to use 8-bit inference (fp8 cores) because NVIDIA has not (yet) made the capability available via CUDA.
> 8-bit Inference and training are much more effective on Ada/Hopper GPUs because of Tensor Memory Accelerator (TMA) which saves a lot of registers
Ada does not have TMA, only Hopper does.
If people are interested I can run my own benchmarks on the latest 4090 and compare it to previous generations.