> Shown is raw relative performance of GPUs. For example, an RTX 4090 has about 0.33x performance of a H100 SMX for 8-bit inference. In other words, a H100 SMX is three times faster for 8-bit inference compared to a RTX 4090.
RTX 4090 GPUs are not able to use 8-bit inference (fp8 cores) because NVIDIA has not (yet) made the capability available via CUDA.
> 8-bit Inference and training are much more effective on Ada/Hopper GPUs because of Tensor Memory Accelerator (TMA) which saves a lot of registers
Ada does not have TMA, only Hopper does.
If people are interested I can run my own benchmarks on the latest 4090 and compare it to previous generations.
I think TMA may not matter as much for consumer cards given the disproportionate amount of fp32 / int32 compute that they have.
Would be interesting to see how close to theoretical folks are able to get once CUDA support comes through.
Anything improving memory bandwidth for any compute part of the GPU is welcome.
Also I'd like for someone to crack open RT cores and get the ray-triangle intersection acceleration out of Optics. Have you seen the FLOPS on these things?
And the fp32-gemm-with-tf32
- Gradient checkpointing and 2048 batch size in a single go allows ~10% performance improvement on a per sample basis.
- torch.compile doesn't work for me yet (lowest cuda version I got my 4090 to run on was 11.8 but highest cuda version on which I got the model to compile is 11.7).
- I did the optimalisations in https://arxiv.org/abs/2212.14034
Do you train in fp16/bf16?
Have you tried fp8?
> Ada/Hopper also have FP8 support, which makes in particular 8-bit training much more effective. I did not model numbers for 8-bit training because to model that I need to know the latency of L1 and L2 caches on Hopper/Ada GPUs, and they are unknown and I do not have access to such GPUs. On Hopper/Ada, 8-bit training performance can well be 3-4x of 16-bit training performance if the caches are as fast as rumored. For old GPUs, Int8 inference performance for old GPUs is close to the 16-bit inference performance.
For most current models, you need 40+ GB of RAM to train them. Gradient accumulation doesn't work with batch norms so you really need that memory.
That means either dual 3090/4090 or one of the extra expensive A100/H100 options. Their table suggests the 3080 would be a good deal, but it's not. It doesn't have enough RAM for most problems.
If you can do 8bit inference, don't use a GPU. CPU will be much cheaper and potentially also lower latency.
Also: Almost everyone using GPUs for work will join NVIDIA's Inception program and get rebates... So why look at retail prices?
Last I looked, very few SOTA models are trained with batch normalization. Most of the LLMs use layer norms which can be accumulated?
(precisely because of the need to avoid the memory blowup).
Note also that batch normalization can be done in a memory efficient way: It just requires aggregating the batch statistics outside the gradient aggregation.
It might not be as glamorous or make as many headlines, but there is plenty of research that goes on below 40Gb.
While I most commonly use A100s for my research, all my models fit on my personal RTX 2080.
I) My research involves biological data (protein-protein interactions) and my datasets are tiny (about 30K high-confidence samples). We have to regularize aggressively and use a pretty tiny network to get something that doesn't over-fit horrendously.
II) We want to accommodate many inferences (10^3 to 10^12) inferences on a personal desktop or cheap OVH server in little time so we can serve the model on an online portal.
The memory on a 4090 can serve extremely large models. Currently, int4 is started to become proven out. With 24GB of memory, you can serve 40 billion parameter models. That coupled with the fact that GPU memory bandwidth is significantly higher than CPU memory bandwidth means that CPUs should rarely ever be cheaper / lower latency than GPUs.
They need to advertise it better. First time I hear about it.
What are the prices like there? GPUs/workstations?
There's a decision tree chart in the article that addresses this - as it points out there are plenty of models that are much smaller than this.
Not everything is a large language model.
So maybe they were including information for the hobbyists/students which do not need or cannot afford the latest and greatest professional cards?
Good advice. Does that mean that I can install like 64 gb ram on a PC and run those models in comparable time?
And for a datacenter, a few $100 AMD CPUs will beat a single $20k NVIDIA A100 at throughput per dollar.
Out of curiosity, does that also apply for consumer grade GPUs?
3060ti 8GB, 3090 24GB, and 4000 series all have performance benefits, but for now this one is off the charts.
Combine this with a second hand X99 / 2011-v3 platform like the Dell Precision T7910 dual socket Xeons and you can have a pretty decent homelab for ML workloads. The Dell can come with a 1300 watt PSU and can fit 4 of those cards comfortably (5 with reduced PCIe lanes on one) since they are 150W each.
1. Apple doesn't provide public interfaces to the ANE or the AMX modules, making it difficult to fully leverage the hardware, even from PyTorch's MPS runtime or Apple's "Tensorflow for macOS" package
2. Even if we could fully leverage the onboard hardware, even the M1 Ultra isn't going to outpace top-of-the-line consumer-grade Nvidia GPUs (3090Ti/4090) because it doesn't have the compute.
The upcoming Mac Pro is rumored to be preserving the configurability of a workstation . If that's the case, then there's a (slim) chance we might see Apple Silicon GPUs in the future, which would then potentially make it possible to build an Apple Silicon Machine that could compete with an Nvidia workstation.
At the end of the day, Nvidia is winning the software war with CUDA. There are far, far more resources which enable software developers to write compute intensive code which runs on Nvidia hardware than any other GPU compute ecosystem out there.
Apple's Metal API, Intel's SYCL, and AMD's ROCm/HIP platform are closing the gap each year, but their success in the ML space is dependent upon how many people they can peel away from the Nvidia Hegemony.
Despite Apple engineers contributing hardware support to Blender: https://9to5mac.com/2022/03/09/blender-3-1-update-adds-metal...
So looks like NVidia (and AMD to some extent) are winning the 3D usage war as well for now.
But... the GTX by itself has a 145W TDP, whereas the whole laptop has a TDP of 20W!
This is super impressive imho, but of course not in the raw power department.
This isn't really an Apple specific problem - GPUs with low power budgets are never going to beat GPUs intended to permanently run off of wall power.
(I know there's stuff like the Mac Studio and iMac, but those are effectively still just laptop guts in a (compact) desktop form factor rather than a system designed from the ground up for high performance workstation type use)
I'd love to see a dedicated PCIe GPU developed by Apple with their knowledge from designing the m1/m2, but it's not really their style.
1 = https://github.com/pytorch/pytorch/issues/77764
Would be very interesting to retest with the M2 and also in the months/years to come as the software reaches the level of optimisations we see on the PC side.
5950X CPU ($500) with 128GB of memory ($400).
AFAIK it flat out won't work with the DL frameworks.
If I'm mistaken please do speak up.
Edit: Thank you JonathanFly, I didn't know this!
This can work in some cases (years ago I certainly did it for CNNs - was slow but you are fine tuning so anything is an improvement) but I don't know how viable it would be on a transformer.
I've recently tried to play with Stable Diffusion. I have an RTX 3060, a base M1 MBP and a laptop with an AMD RX6800m.
- The RTX 3060 just works. Setup was straightforward, performance is good.
- The M1 has those neural engine thingies but it's not compatible with SD. It can run SD on CPU but if you want to make use of the NE you need coreml specific programs and models. Issue here is that it's just not the same stuff. Prompt structure is also different. It doesn't recognize weights and put a lot more emphasis on prompt order. Most of the time it seems to ignore most of what you wrote. On the positive side running stuff on the NE is very fast and doesn't seem to be taxing to the system.
- Finally the 6800M. It should be the powerhouse of the bunch considering its gaming performance ahead of the 3060. Problem is AMD toolkit kinda sucks. They have this obscure ROCm HIP stuff that acts as some kind of translation layer to the CUDA API. It's complicated, it simply doesn't work without a bunch of obscure environment variables and only works on fp32 mode, which means it uses twice amount of video RAM for the same thing. Support is iffy as they seem to only support workstation cards. Using it often throws lots of obscure compilation errors. Bugs celebrating anniversary.
To sum things up, Nvidia is far ahead of the curve in both usability and performance. Apple is trying to do its thing but it's too early yet while AMD is in a messy situation. Hope it helps.
Nvidia is literally cancer in this space.
1. Agree AMDGPU is awesome.
2. All tests were done in Linux with the exception of coreml on MacOS.
> And ROCm HIP stuff isn't obscure.
It is from my novice perspective. While Nvidia stuff works out of the box (even on Linux) I had to install Rocm and a special version of PyTorch for rocm. Then it only kinda worked after an assortment of environment variables to set GFX version. And even then it uses as twice as much video memory as what torch uses on Nvidia, limiting what I can do.
For e.g. compare rocFFT to cuFFT. They're worlds apart still.
> I worked on a project that produced carbon offsets about ten years ago. The carbon offsets were generated by burning leaking methane from mines in China. UN officials tracked the process, and they required clean digital data and physical inspections of the project site. In that case, the carbon offsets that were produced were highly reliable. I believe many other projects have similar quality standards.
Crusoe Cloud (https://crusoecloud.com) does the same thing; powering GPUs off otherwise flared methane (and behind the meter renewables), to get carbon-reducing GPUs. A year of A100 usage offsets the equivalent emissions of taking a car off the road.
Disclosure: I run product at Crusoe Cloud
probably not the first thing that comes to mind to people when they hear about carbon offsets. Things are gone further than I thought.
Perhaps counterintuitive given how much we usually go around saying burning fossil fuels is bad for the environment, but the science is sound.
Some other info we've published: https://www.crusoeenergy.com/digital-flare-mitigation
I suppose we might see "slimmer" 4090s at some point but even if the design is (somehow) possible it's clear Nvidia won't allow their partners to manufacture dual slot versions of RTX cards that could possibly compete with higher end cards.
 - https://www.crn.com/news/components-peripherals/gigabyte-axe...
"Our enhanced NPU on Stratix 10 NX delivers 24× and 12× higher core compute performance on average compared to the T4 and V100 GPUs at batch-6, despite the smaller NX die size."
"Results show that the Stratix 10 NX NPU running batch
6 inference achieves 12-16× and 8-12× higher average energy
efficiency (i.e. TOPS/Watt) on the studied workloads compared to the T4 and V100 GPUs, respectively."
Disclaimer - I work on the team that was originally behind Brainwave at Microsoft.
Crypto cards are almost always run with lower power limits, managed operating temperature ranges, fixed fan speeds, cleaner environments, etc. They're also typically installed, configured, and managed by "professionals". Eventual resell of cards is part of the profit/business model of miners so they're generally much better about care and feeding of the cards, storing boxes/accessories/packing supplies, etc.
Compared to a desktop/gamer card with sporadic usage (more hot/cold cycles), power maxed out for a couple more FPS, running in unknown environmental conditions (temperature control for humans, vape/smoke residue, cat hair, who knows) and installed by anyone.
The common lore after the last mining crash (~2018) was to avoid mining cards at all costs. We're far enough along at this point where even most of the people at /r/pcmasterrace no longer subscribe to the "mining card bad" viewpoint - presumably because plenty of people chime in with "bought an ex-mining card in 2018, was skeptical, still running strong".
Time will tell again in this latest post-mining era.
There probably are people with functional mining cards, but like I said in my original comment, hardware failure runs along a bathtub curve. The chance of mechanical failure increases proportionally with use; leaving mining GPUs particularly affected. How strong that influence is can be debated, but people are right to highlight an increased failure rate in mining cards.
If a component has been powered on and running continuously under load within it's rated temperature band, it'll have fewer heat cycles. This seems preferable to a second-hand card from a gamer which gets run at random loads and powered down a lot.
Hopefully manufacturers aren't still gluing the fans on the boards. I lost a couple of GPUs thanks to that, turning what should be an easy fix into a replace the whole damn card situation.
As another commenter noted used RTX 3xxx series cards are at most two years old. Given the anticipated lifetime of a card as the Bathtub Curve applies I'd argue any surviving used card is pretty firmly in "escaped infant mortality" territory. This applies to gamer cards as well (of course) but this is where my other points apply.
Anecdotal but in 2017 I had 200 GTX 1xxx cards for a personal mining project (it was interesting). From what I remember I experienced <5 card failures within the first few months but smooth sailing otherwise. I kept a few for personal use and eventually sold the rest with a handful of them (small sample size, I know) going to personal friends. They're all still running strong.
For the remainder that were sold on eBay I only had one return (it was fine, of course) and all of the reviews were impeccable - along the lines of "I know this card was used to mine but I'd swear it's new 10/10". Note that I was very clear and transparent that they were used for mining in the item description and I even included pictures of the mining environment in the listings. My eBay account is still active and I haven't heard anything from a single customer over the past three years (I assume they'd come back in the event of a card failure). I think 200 cards is a large enough sample size to bolster my argument.
As another commenter also noted many of them are being sold as new. This is typical crypto ecosystem shadiness but for a miner to be able to even pass a card off as new speaks to the high level of care higher-volume professional mining operations employ.
For modern Nvidia GPUs (even going back to Pascal) I'd argue anything used for less than two years is essentially burn-in testing.
For example this TI report mentions a typical lifetime of 10 years at 105C.
Basically "go in MSI Afterburner and crank it up until the machine crashes".
If you’re a bit more comfortable with command line stuff the fast.ai is good.
See, then, eg.,
And since I know nothing about deep learning:
Can you do these neural networks on AMD GPUs at all, regardless of performance? Because the hive mind says AMD GPUs are better if you run Linux.
Can you do these neural networks on Linux at all?
In principle yes, in practice it's not worth the bother. Scroll to the FAQ section at the end of the article, they cover the AMD situation pretty well.
> Because the hive mind says AMD GPUs are better if you run Linux.
As usual the hivemind just regurgitates memes without understanding.
> Can you do these neural networks on Linux at all?
Almost all NN work is done on Linux. Recent WSL2 developments have supposedly made Windows viable as well, but I don't know how popular it is.
> As usual the hivemind just regurgitates memes without understanding.
I can't edit the OP any more, but I meant 3d drivers wise generally, not ML wise.
I haven't cared about 3d on Linux in ages, but now I do (thanks Apple for the architecture change) and need to decide what video card to get in the next year as prices go back to a reasonable-ish level. As in AMD or NVidia, not what price/performance tier. And my general impression from the internets(tm) is amd drivers are more stable. Am I wrong?
- AMD has better support for the "blessed" Linux 3D stack. This manifests itself as more mature support in Wayland compositors for example, which is still rockier with nvidia.
- With AMD you need to care about running recent kernels, because improvements to the drivers are tied to the kernel, as is support for newly released cards. With nvidia it doesn't matter - I can run a stable distro like Ubuntu 20.04 LTS and still use the latest driver.
- Nvidia support for switching between discrete and integrated graphics in laptops used to be crap. I'm not sure if it's still the case as I don't follow this closely, but given historical record, it's likely still crap.
- Nvidia driver quality is historically better than AMD (across the board, not just in Linux). If gaming is your primary use, things are still more likely to "just work" on nvidia than AMD, although AMD will usually fix problems eventually.
> Nvidia support for switching between discrete and integrated graphics in laptops used to be crap.
That's okay, I'm staying with Apple for laptops. I'm just setting up a Linux/x86 box in addition to the Apple desktop.
Example with 4090 would be 64GB/s and 1008GB/s respectively, so one could expect about 6% performance even when dealing with a model that is 48GB large (or more)?
Or are there second-order effects that cut this speed down even more?
It's cheaper to rent AWS instances and diagonalize a working set than throwing down coin for the latest/greatest GPU.
Seems to move to 24G meant 3090 …
The analysis is thorough, even if the editing is imperfect.
By all means, report your accessibility problems with sites on HN... to the site owner. Email is a good first move!
Also, it is an interesting question. Why do we keep seeing blog designs with thin gray text, even though people repeatedly complain about it, and even though text is the centerpiece of a blog? Its not really that difficult to pick a different weight or color. So what causes this problem again and again?
Is it really the case that the rule says "We don't want to hear about accessibility problems because they are boring"? That's just awful.
If they are truly "boring" and that's the only problem, why not let the downvotes take care of it? Why flag?
If the answer is "that's the kind of community we would like to create" let me just point out my third paragraph again and say that that's trully sad.
I hope you re-asses this...
It does, there are lots of mod comments about cases exactly like this. The problem with these is not that accessibility is unimportant or unreadable pages aren't annoying but that they're repetitive.
With a little bit of search, looks like the reader mode extension does what I want (in fewer step)