ChatGPT Hardware: A look at 8x NVIDIA A100 powering the tool

Const-me · on Feb 14, 2023

A100 has 1555 GB/sec of memory bandwidth. 50% more than Radeon 7900 XTX, but the AMD GPU costs 10x less.

nVidia’s top management did amazing job with their strategy. They have spent decades, and millions of dollars, developing and promoting CUDA. Now this motivates people to spend that amount of money on nVidia’s hardware.

I don’t believe the trend continues for ML inference, at least not for long. Unlike traditional HPC stuff like FEM or fluid dynamics, the GPU code of ML inference is not that complicated. The complexity is limited to the data being processed, these ML models.

It’s not terribly hard to port the inference to other hardware architecture or software stack, reusing the models. I think it’s only a matter of time until AI companies like OpenAI will start porting their code from CUDA to VulkanCompute or some other vendor-agnostic stack, to avoid buying or renting such $100k computers. Don’t get me wrong, these computers are awesome, just very expensive.

ericpauley · on Feb 14, 2023

There's also a problem of dataset size. the 7900XTX maxes out at 24GB when a lot of new models require the higher-end 80GB A100 variants. The A100 is also a server-grade card, so even spec-for-spec it's not a fair comparison.

Const-me · on Feb 14, 2023

> a lot of new models require the higher-end 80GB A100 variants

For training, yeah. But at least in some cases, inference can be optimized to use less memory.

For their large Whisper model, OpenAI says they need 10GB of VRAM: https://github.com/openai/whisper#available-models-and-langu...

My DirectCompute port of that uses slightly over 4GB of VRAM, see the last column in that table: https://github.com/Const-me/Whisper/blob/master/SampleClips/... I haven’t actually optimized memory usage I only did the DirectCompute port, the memory savings were achieved by Georgi Gerganov in whisper.cpp project, which I ported to Windows.

anentropic · on Feb 14, 2023

apparently GPT-J needs ~25GB RAM min (for inference only), i.e. a 32GB card ...and GPT-3 is an order of magnitude bigger again

https://nlpcloud.com/gpt-3-open-source-alternatives-gpt-j-gp...

though TBH just making cards with more RAM to fit these kind of models seems easier than being at absolute cutting edge of performance

I also wonder how this metric translates to something like Apple Silicon 'unified memory'... latest MBP have 32GB RAM, I wonder if anyone managed to run GPT-J there yet?

anentropic · on Feb 14, 2023

There's a float16 version of GPT-J rewritten in C++ here that apparently fits into 16GB...

https://github.com/ggerganov/ggml/tree/master/examples/gpt-j

it's CPU-only and runs inference at "about ~6 words per second" on an M1 MBP

smoldesu · on Feb 14, 2023

That's for the max-size model (6B parameters). GPT-Neo has pruned models 125m and 1.3b that fit into 1.5gb and 6gb of memory respectively.

These models are obviously nerfed, but you can run them on budget ARM hardware and get legible, paragraph-length responses in less than 10 seconds.

flangola7 · on Feb 14, 2023

There's a 20B model now

smoldesu · on Feb 14, 2023

The 20B model was the original I believe, and the other models were all pruned from it.

atylerrice · on Feb 15, 2023

This is incorrect. Thought would be interesting. Pruning usually removes some parameters from the model (setting certain ones to 0 for instance). The smaller models all have less layers and therefore weren’t pruned but were smaller to begin with.

atylerrice · on Feb 15, 2023

You can also use int8 quantization which comes for free with huggingface and isn’t a huge cost to accuracy. Which cuts the memory usage almost in half again. (almost in half because some layers might stay 32 bit if they make a tiny part of the model)

mark_l_watson · on Feb 14, 2023

They have a link to the bottom for something similar, but with GPT-2. The Hugging Face docs have hints for using FP16, and avoiding double memory requirements for model and checkpoint data.

trifurcate · on Feb 14, 2023

I am running GPT-J-6B inference on 13GB of VRAM. FP16 is a must at this scale. I've also loaded GPT-NeoX-20B using int8 quantization onto a single RTX 3090 before (24GB VRAM) however inference latency was quite high.

andruby · on Feb 15, 2023

Actually, the MacBook Pro with M2 Max CPU can be configured with 96GB of ram. The Mac Studio with M1 Ultra can have up to 128GB of ram.

anentropic · on Feb 15, 2023

Ah great, I hadn't seen that... seems like that should be plenty for some of these models!

TOMDM · on Feb 14, 2023

I've been running the stock implementation that OpenAI put out, I didn't know people had made so many improvements!

Thank you for your work.

godelski · on Feb 14, 2023

As someone who trains large models, this is pretty much the only number I care about. Batching saves more time, and gets better results, than clock speed.

For the A100, A6000s, and H100s you are paying for the memory and the upcharge for being a server card. AMD isn't really competing in this space, but they are moving in that direction if you look at the exascale machines[0,1]

[0] https://www.top500.org/system/180047/

[1] https://www.amd.com/en/products/server-accelerators/instinct...

erichocean · on Feb 15, 2023

> Batching saves more time, and gets better results, than clock speed.

How much time is saved?

If you've got a 12GB card at 5x speed vs. a 96GB card at 1x speed (so 5x slower), is the former still competitive? How much faster does the 12GB card need to be to get the same quality results with less memory (and smaller batches)?

What if the faster card has 24GB of memory?

godelski · on Feb 15, 2023

Hard to say tbh. Could be infinite, could be zero.

In classification batch is going to help a ton (including stability) and you may not even get top results without a high batch, so infinite in that case. More realistically, consider that batch also helps with GPU I/O. I/O slows things down quite a lot, so if you're moving stuff in and out of the accelerator a lot then you can easily be I/O bound.

For generation the conversation gets more complicated. For GANs the smaller card could do better if you can fit a proper batch size there (probably not tbh but often closer to 16 or 24 so that may be better here). Here distributed tends to help more. For VAEs, Diffusion, Autoregressive, etc are more stable and can also hugely benefit from large batches.

So the truth is an unsatisfying "it depends." I know that may not be the answer you're looking for, but it is the real one. It takes a lot of work and tricks to scale (as anyone working in non-ML HPC will tell you very similar stories about scaling). But also note that the clock speeds aren't as big of a gap as you note though the memory gaps are.

atylerrice · on Feb 15, 2023

Well the batch size also helps with making training more stable. So a larger batch size computes the gradient across more examples so it ends up being more stable than only using a batch size of day one. A batch size of one is only fixing the error on one example rather than many.

a larger Batch size can help keep the gpu fed but usually that’s not a problem with these larger models.

TallGuyShort · on Feb 14, 2023

Could you educate me on what's really different about a card that is "server-grade" vs. one that is not?

ericpauley · on Feb 14, 2023

For these GPUs, it partially comes down to form factor (being able to drop 8 in a chassis with unified cooling). I’d say that’s the primary difference here. This applies maybe a little more noticeably to the rest of the computer around the a100s, with ECC ram, hot swappable components, and redundant power supplies. The 8x a100 rigs are absolute beasts, difficult to move even by two people without removing some components. There is also price discrimination here where manufacturers aim for higher margins on server components (as nvidia clearly does).

Between the two mentioned cards, though, it really does come down to performance. The cards are simply not interchangeable for large scale model training.

jmalicki · on Feb 14, 2023

Two of the biggest differences are ECC memory and virtualization support.

crabbone · on Feb 14, 2023

Yeah, also while ECC is mostly transparent to the software layer, virtualization support is in itself a huge software layer that comes with this kind of equipment. MIG ( https://docs.nvidia.com/datacenter/tesla/mig-user-guide/ ), there's also software made to expose GPU inside containers.

This software is mostly supplied by vendor, and it mostly cares about vendor's tech, and will usually require special features exposed by the hardware, so, won't work with consumer-grade GPUs.

m463 · on Feb 14, 2023

well, and it's a specialized die. A100 ~ 2x size of 3090

mambru · on Feb 14, 2023

You should probably compare the H100:

https://www.techpowerup.com/gpu-specs/h100-sxm5-96-gb.c3974

With the MI300, which has 128GB and more compute power (at least on paper):

https://www.techpowerup.com/gpu-specs/radeon-instinct-mi300....

A100 can be compared with MI250, perhaps. It also has 128GB.

Patrick-STH · on Feb 14, 2023

The MI300 will compete more with Grace Hopper than with the H100 SXM5. We will be covering and showing both later this year.

mhio · on Feb 14, 2023

An Azure 8x MI250X OAM enclosure was out and about at the SC22 conference late last year (second half of the linked article).

https://www.servethehome.com/microsoft-azure-at-sc22-and-the...

metadat · on Feb 14, 2023

What is the pricing like for the MI300's?

BizarroLand · on Feb 14, 2023

I would guess $12k-$16k, but they haven't released the price yet.

6gvONxR4sf7o · on Feb 14, 2023

> 50% more than Radeon 7900 XTX, but the AMD GPU costs 10x less

You could say that about some of nvidia's other GPUs too. A better comparison would be AMD's server offerings, which are priced comparably.

> Unlike traditional HPC stuff like FEM or fluid dynamics, the GPU code of ML inference is not that complicated.

It's complicated enough relative to the problem ML people are trying to solve. That's why in ML, all these tools have emerged specifically to let practitioners not care about the GPU code. People writing pytorch code don't want to think about CUDA vs vulkan at all. It's more of a question for the people writing the foundational libraries like pytorch.

I sincerely hope that tools like openai's triton make it easier to support a variety of GPUs.

Const-me · on Feb 14, 2023

> You could say that about some of nvidia's other GPUs too.

Technically, yes. Practically, nVidia forbids to deploy their other GPUs in data centers, in the EULA of their drivers: https://www.datacenterdynamics.com/en/news/nvidia-updates-ge...

May or may not be enforceable, probably depends on jurisdiction, but I’m pretty sure that EULA gonna scare some people.

> People writing pytorch code don't want to think about CUDA vs vulkan at all.

Well, there’re also other people. OpenAI’s CEO doesn’t necessarily write pytorch code, but he described the compute costs of ChatGPT as “eye-watering”.

> I sincerely hope that tools like openai's triton make it easier to support a variety of GPUs.

I’ll believe it when I see it. Performance portability across GPU architectures ain’t trivial. For instance, in my Whisper implementation, the most expensive compute shader is the one which computes product of 2 matrices.

nVidia version: https://github.com/Const-me/Whisper/blob/master/ComputeShade...

AMD version: https://github.com/Const-me/Whisper/blob/master/ComputeShade...

These two compute shaders have little in common, they even use different memory layout for the arguments, despite both compute shaders are doing the same math.

GistNoesis · on Feb 14, 2023

This TB/sec of memory bandwidth is only when the data is on the GPU. The real problem is that GPU memory is expensive and models are getting bigger.

What we would like to do is what has usually been done with the cache hierarchy where you bring the data as needed to the GPU. As this would allow to only need a lot less memory, as we could stream parameters from RAM or disk as needed.

But there is a bottleneck at the pcie interface ( 32GB/s for pcie 4.0x16 ), and you can't feed the GPU fast enough. Which means you can't leverage cheap memory directly.

synergy20 · on Feb 14, 2023

indeed the CUDA software is the moat for Nvidia, though its hardware is still great.

fancyfredbot · on Feb 14, 2023

At the scale of investment OpenAI are making the software shouldn't be the moat anymore - you can hire a lot of very good software engineers with ten percent of that hardware budget.

synergy20 · on Feb 14, 2023

it's way more than that, building an ecosystem takes time, like a decade or more, lots of tooling, libraries, optimizations etc have to be done, and they have to be done in a way users think it's either similar to CUDA, or is simpler, the moat is very wide and deep as far as I can tell.

Const-me · on Feb 14, 2023

I don’t think OpenAI needs a new ecosystem. I think they just need a cost-efficient way to run inference of their ML models, for large userbase.

That smaller task is way more manageable. It shouldn’t take a decade, maybe a month or two. Vulkan ain’t some new tech, the API arrived years ago and is widely used by game developers.

Thanks to Valve work on DXVK 2.0, there’s even an implementation of MS DirectCompute GPU API over Vulkan on Linux. I’m not saying doing that is necessarily the best approach for the job, just one of the many options.

fancyfredbot · on Feb 14, 2023

I feel like very few ML researchers actually use CUDA. You'd likely get a long way with an optimised backend for XLA to cover tf/jax, probably something similar for pytorch. That's a lot of work but not sure it's a whole decade.

synergy20 · on Feb 14, 2023

Not sure if that's entirely tree. CUDA has 85% or more of the market share last time I checked, intel's OneAPI and AMD's ROMA? are so behind that made their Machine learning hardware totally uninteresting for ML related purchases, sadly.

irusensei · on Feb 15, 2023

What are the chances someone pulls a Toy Story and Nvidia gets SGI’d by clusters of consumer grade graphic cards? Knowing greedy nvidia they’d probably try to nerf their gaming cards for that task but its not like they are a monopoly.

Const-me · on Feb 15, 2023

I don’t know what the chances are, but over time it might happen. Economy of scale should not be underestimated.

30 years ago, there was a consensus x86 architecture was only good for small computers, while servers need faster specially-designed processors like IBM Power or Dec Alpha. Then AMD64 was introduced, after a while they took server market too.

15 years ago, Intel was making specialized HPC co-processors called Xeon Phi. The chips were actually good, exceeded 1 TFlop of FP64, and weren’t terribly hard to program. I think it were consumer-grade graphic cards which made it obsolete. Despite much harder to program, the raw performance numbers were better, and cost efficiency in flops/$ was an order of magnitude better.

2OEH8eoCRo0 · on Feb 14, 2023

> It’s not terribly hard to port the inference to other hardware architecture or software stack

Sounds like a job for AI. Universal translator and all that.

cma · on Feb 14, 2023

With nvidia's sparsity support memory bandwidth can be effectively doubled for inference, if you can sparsify.

mk_stjames · on Feb 14, 2023

I recently saw a rack server of 8x SXM2 P100's (so, the 2016 equivalent of this) for sale for $3600 on ebay. And you can buy individual P100 pci-e versions for about $300 now. They retailed for about $7600 each back in 2016.

I wonder how long it'll be until A100's depreciate that amount.

I see articles from "Serve The Home" pop up now often and I've yet to see one discussing hardware that I've ever come across in... someone's home.

Patrick-STH · on Feb 14, 2023

STH has been around for 14 years and so we have tested these big servers from the P100 days (see our DeepLearning12 build) and newer. Home was the /home/ directory in Linux.

We do have some content for homes. Check out our TinyMiniMicro series as an example.

metadat · on Feb 14, 2023

Direct link: https://www.servethehome.com/introducing-project-tinyminimic...

trvr · on Feb 14, 2023

>> Home was the /home/ directory in Linux.

I have never (until reading this) thought of it in that context. Was that the original meaning for the name, or did it shift over time?

Patrick-STH · on Feb 14, 2023

It is in the "About" page :-)

I always get confused by why folks ask why we have server content. The Wall Street Journal covers more than just one road in NYC.

trvr · on Feb 14, 2023

Understood. But what about this? https://web.archive.org/web/20110201124803im_/http://www.ser...

Patrick-STH · on Feb 14, 2023

Now that is a throwback! That old logo was created with GIMP at a table outside NetApp. STH started at a time when the SMB market (and VERY high-end homes) were using Dell, HP, and IBM rackmount gear off-lease. The original STH idea came from me learning Linux as an alternative to Windows for that segment. STH was what I used to learn the rest of the market outside of work. The first product we were sent for a review was a rackmount case and that started us on the path of reviewing new gear ~2010-2011 when it was a part time blog instead of having a team of folks working on the site. I still try to keep 15-20% of our content in the SMB/ high-end home arena.

smoldesu · on Feb 14, 2023

Those P100s are so cheap because they're more trouble than they're worth. They don't mount easily in most cases, will consume a lot of power at idle, generate lots of heat without active cooling and require a Linux machine to use properly.

It's probably possible to engineer a home-scale inferencing network like that, but I've had plenty of success just running GPT-Neo on off-the-shelf ARM chips. It's not worth the extra $800, exploding power bill and engineering effort to get diminishing returns.

mk_stjames · on Feb 14, 2023

Oh I know for sure, it's still just amusing to me to look up these just slightly older cards and see them going for so little compared to what they were new;

I do also look at various benchmarks and comparisons over and over because I would like to build a home gpu server for some machine learning fun and I find it interesting that the TFLOPS/$ numbers always come out very much linear... A card that gives 2x performance costs 2x more.... if cards depreciate 50%, it's because new cards have been launched that have double performance for the same price... etc... Even simpler, the CUDA cores-per-dollar is pretty much a constant.

What isn't taken into account there is various optimizations new cards have that speed up training or inference, like better use of the tensor cores. At some point incompatibility with newer CUDA compute levels makes the older cards much less useable.

dzdt · on Feb 14, 2023

Not directly applicable for large language models, I was doing some searching for cards with lots of double-precision calc performance. Nvidia has been leaving that out of all of the consumer versions of hardware; cheap P100's seem to be by far the highest DP FLOPS per dollar solution available right now.

mk_stjames · on Feb 14, 2023

I have an interest in doing lattice Boltzmann fluid dynamics simulation on GPUs, and this is something I've noticed with the P100's as well. That and memory bandwidth is super important for such calcs, along with total memory. Commercial codes I've seen aren't using any ML ops compute optimizations or tensor cores and anything like that, it's mostly just OpenCL based and thus the P100s throughput per dollar is pretty high.

bioemerl · on Feb 14, 2023

MI25

Faster, cheaper, crippled by AMDs poor software support.

kamikaze675 · on Feb 14, 2023

Agree, I have 14 V100's in my home but its pretty rare. I use it with fiber backhaul to supply services for marketplaces like vast.ai

asciimike · on Feb 14, 2023

With residential power costs, is it economically viable to do this? What's your utilization look like?

buildbot · on Feb 14, 2023

In Washington power can be 2.95¢ per kWh so yes, very!

lostmsu · on Feb 14, 2023

Where? Is there also 1G or better 5G+ fiber?

buildbot · on Feb 15, 2023

Chelan county: https://www.chelanpud.org/my-pud-services/rates-and-policies

lostmsu · on Feb 15, 2023

Dang, you are making me deeply regret not buying a house on the lake there a few years ago.

buildbot · on Feb 16, 2023

It’s a beautiful place too. It’s never too late ;)

They even have community fiber! https://www.chelanpud.org/my-pud-services/residential-servic...

kamikaze675 · on Feb 15, 2023

You can get 1Gb, 2GB and 5Gb fiber to the home from ATT, TDS, Google etc. for backhaul.

lostmsu · on Feb 15, 2023

I mean in the area with 3c/kWh electricity that the above poster mentioned.

asciimike · on Feb 14, 2023

From the utility? That's crazy.

buildbot · on Feb 15, 2023

Yep! Actually 2.7c: https://www.chelanpud.org/my-pud-services/rates-and-policies

kamikaze675 · on Feb 14, 2023

Depends on how many marketplaces and the performance of the entire rig, power is $0.139kwh where I am at.

rsync · on Feb 14, 2023

"I see articles from "Serve The Home" pop up now often and I've yet to see one discussing hardware that I've ever come across in... someone's home."

You're just not visiting the right homes ...

rldjbpin · on Feb 15, 2023

it might land in someone's home...after 7 years of launch like the p100s

hexomancer · on Feb 14, 2023

Insanely misleading title. The author is basically showing off their A100 GPUs. The ChatGPT part is just clickbait, the only relevance ia that ChatGPT supposedly is also using A100s (definitely not only eight of them though).

yieldcrv · on Feb 14, 2023

the real question is whether ChatGPT needs that kind of hardware for a single user. a client side version.

speedgoose · on Feb 14, 2023

Yes it does.

jpeter · on Feb 14, 2023

When stable diffusion came out, my 8gb gpu could handle 1 image at 512x512. Now i can generate 1000x800 with batch size 4. I wonder if Chatgpt has the same unused optimisation potential

speedgoose · on Feb 14, 2023

Stable Diffusion is a lot smaller than GPT3. It has a bit less than a billion parameters while GPT3 has 175 billion parameters.

I would bet that we can do better than GPT3 with less computing resources but that would need a different model architecture.

fpgaminer · on Feb 14, 2023

> I wonder if Chatgpt has the same unused optimisation potential

Doubtful. As an example, OpenAI released Triton, a programming language explicitly for optimizing models. For the scale they're at they'd be crazy not to throw an engineer or two at optimizing the models they're going to send to production.

flangola7 · on Feb 14, 2023

Make a Google of "hardware overhang"

Our brains outperform GPT-3 with only 30w of power. The potential for software optimization may be many orders of magnitude.

ngcc_hk · on Feb 14, 2023

Wonder where is that 30w comes from. I found some said chess player used 6000 calories per day and that does not seem to tie in the number.

flangola7 · on Feb 15, 2023

The brain mostly uses the same amount of calories all the time. It's the large muscles that can influence energy use.

yieldcrv · on Feb 14, 2023

what packages are you using?

I think I have DiffusionBee installed but I haven't updated it for months

sxp · on Feb 14, 2023

Are there articles detailing the tradeoffs between an A100 system and a consumer system in terms of perf/$? E.g, if you're a startup trying to be scrappy and your choice is between a $100k+ A100 system and an off-the-shelf RTX 40x0 networked cluster, what would be the tradeoff?. For embarrassingly parallel use cases like ray-tracing or a compute backend that services many clients and doesn't need FP64, the consumer system would win. For other uses cases that have a hard requirement for the NVLinked access to 80GB of RAM per card, the A100 system would win.

Has anyone done benchmarks to show what the tradeoff calculations are for various use cases?

Patrick-STH · on Feb 14, 2023

There are a few big ones: - The CUDA license does not allow you to use GeForce in the data center. In the US it has become less popular, but if you look at our Inspur AIStation piece, that was a cluster located in China with GeForce cards. So it still happens, but less so. - The memory capacity is another big challenge. Newer models have 80GB which dwarfs the 24GB on a 4090. We just got the RTX 6000 Ada in, so that is an option for more memory. - For higher-end training, one of the big challenges is interconnect, so having NVLink and Infiniband or 100GbE+/ Infiniband NICs is important. The HGX A100 platform is designed for that with its NVSwitch and PCIe switch topology.

With all of that said, you are 100% right that many startups have used consumer cards for years. For example, Andrej Karpathy talked about how our DeepLearning11 build (8x 1080 Ti's) had a ~3 month payback period versus AWS https://twitter.com/karpathy/status/924340245478256640

p1esk · on Feb 14, 2023

Andrej Karpathy talked about how our DeepLearning11 build (8x 1080 Ti's) had a ~3 month payback period versus AWS

In 2017. Currently you can rent 8xA100 server for $8.8/hr: https://lambdalabs.com/service/gpu-cloud

At this price the payback stretches to about 3 years (taking into account average energy costs in US, and assuming 24/7 operation for the whole 3 years).

TOMDM · on Feb 14, 2023

A bit of extra context, that's $8.8/hr for the 40GB A100's

The 80GB A100's will run you $12.0/hr for 8 of them.

fancyfredbot · on Feb 14, 2023

That's 24k over 3 months which would buy and power 8 4090 GPUs by my reckoning. Of course those 8 4090s wouldn't have enough memory to run chatGPT so maybe AWS is good value after all.

TOMDM · on Feb 14, 2023

This is Lambda Labs' pricing, which is significantly cheaper than AWS.

0x008 · on Feb 14, 2023

Not for whole systems but the single GPUs (A100/A40) are significantly better in performance per watt (almost by a factor of 2) than RTX cards for Machine Learning workloads.

You can watch this LTT video to get an idea: https://youtu.be/zBAxiQi2nPc

Probably if you run them full time in a enterprise environment the power consumption will be the bulk of your cost?

KaiserPro · on Feb 14, 2023

Unless you are planning to have a full system on all the time, renting a loaded 8xA100 machine is almost certainly the way to go for a startup.

As others have pointed out, vRAM is the killer here. this big of a machine is mostly for vRAM coherence, which is a specific usecase.

For almost every other type of job lots of single GPUs with shared disk storage will do (particle sim, rendering, oil and gas, etc.)

synergy20 · on Feb 14, 2023

I was trying to build a ML training machine myself these days. I don't really need the RTX FP64 and its 4K graphic rendering with gaming stuff, all I need is ML training(FP16 mostly with lots of CUDA cores).

There is no such thing, you either use multiple RTX GPU cards and 'bend' them for ML training, or you buy A100 40GB at about 125K or A100 80GB at about 250K, which is far beyond my budget.

Google's TPU could be used to assist ML training, but then it does not have the CUDA ecosystem for software, plus, it kept its new TPU for inhouse use only as far as I can tell.

AWS ML EC2 are very expensive to me which is why I want to build one for mid-sized ML trainings.

paulpan · on Feb 14, 2023

Historically Nvidia's professional-grade and consumer products were artificially segmented. The only difference was in their amount of VRAM, with recent example is the Pascal-generation GeForce 1080 vs. the P100 that someone else has compared in detail: https://medium.com/@alexbaldo/a-comparison-between-nvidias-g...

But in recent years Nvidia has added harder gating in between their products, the biggest being SLI or NVLink support reduced starting with Pascal and now entirely dropped in the latest Ada Lovelace generation. Effectively forcing you to buy the professional chips.

A different angle - how long before Google, Microsoft, and Amazon design and deploy their own "GPUs" instead of buying from Nvidia? They've all started with their own ARM SOCs so surely it's inevitable.

monkmartinez · on Feb 14, 2023

Tim Dettmers [0] has been updating the linked article for a while now. It is a great primer -> deep dive into the world of GPU's for ML/DL tasks. I don't need 4x RTX 4090's but I really want them. As it stands now, I use an old M40 that I modified for water cooling. The only tricky part was configuring Win10 to run the card in WDDM mode as opposed to TCC mode without onboard graphics. To be honest, I am not entirely sure how it works as the pass-through is an older quadro card.

How do I justify the purchase of DGX-1 as a hobby programmer? Call it my mid-life crisis purchase? I mean, its not that much more than a mid-range 2023 Corvette Stingray.

[0] https://timdettmers.com/2023/01/30/which-gpu-for-deep-learni...

Valgrim · on Feb 14, 2023

Is this the hardware used to build the model, or to run the model? Does this single 100000$ computer run the entire chatGPT service concurrently for all users around the world? If so this seems incredibly cheap.

HPsquared · on Feb 14, 2023

Supercomputer used for training, top 5 in the world apparently:

https://news.microsoft.com/source/features/ai/openai-azure-s...

anentropic · on Feb 14, 2023

I would guess that's the minimum needed to run an 'instance' of the model for inference

not sure how many concurrent users that could support, but I'd imagine the public service has a whole bunch of these racks

from what I read... GPT-J at 6B params is too big to load on any consumer GPU (needs ~25GB)

and then GPT-3 is 175B params or ~29x larger, so maybe needing 7-800GB RAM

8x 80GB A100s would give 640GB RAM, so its approx the right ballpark

erichocean · on Feb 15, 2023

> GPT-J at 6B params is too big to load on any consumer GPU (needs ~25GB)

You can load that on M1/M2 Macs, which have unified GPU memory up to 128GB.

0x008 · on Feb 14, 2023

The magnificent part is that the 8xA100 is only a blade and you can fill a whole rack with them. Yet one is already incredibly powerful.

Also the successor H100 is already released and at least 2x as powerful.

kjs3 · on Feb 15, 2023

Also the successor H100 is already released and at least 2x as powerful.

Checks eBay for used A100s...will not be getting one soon. :-)

paulluuk · on Feb 14, 2023

In machine learning, generally the expensive part is training the model, not running the model.

bobleeswagger · on Feb 14, 2023

At this point the costs of hosting the trained model probably surpass the training cost.

paulluuk · on Feb 14, 2023

It might, I'm not sure. It depends on many factors: how much cloud discount are they getting, for example? And do you count the cost of training all previous models before chatGPT, or just the cost of training chatGPT itself?

But you might be right, chatGPT is insanely popular right now.

delijati · on Feb 14, 2023

has anyone numbers on how many queries/second in inference time chatGPT can serve?

jhoelzel · on Feb 14, 2023

Now we just need to let the models get big enough until they can make themselves smaller again ;)

pcdoodle · on Feb 14, 2023

5KW Peak Power per system pictured. I wonder, how many watt hours per query are used on average? Is this 100k investment enough to roll your own ChatGPT level bot using open source software?

HPsquared · on Feb 14, 2023

Not exactly... OpenAI apparently have the 5th most powerful supercomputer in the world. This is probably what they used for training:

https://news.microsoft.com/source/features/ai/openai-azure-s...

https://www.zdnet.com/article/microsoft-builds-a-supercomput...

pcdoodle · on Feb 14, 2023

Thanks for the links. I wonder what percentage of the energy expenditure is the initial training vs. operating the service?

HPsquared · on Feb 14, 2023

That depends how much it's used, I suppose. Ever-increasing, it looks like. At some point it might be worth investing in more training to reduce operating expense. Pretty standard CAPEX vs OPEX calculation.

On the other hand, we're not counting energy saved by people using it.. What if some scientist uses it and ends up solving fusion power? Or, say, making copywriters and programmers redundant (who I assume would go on to become subsistence farmers) - many energy savings to be had!

brianjking · on Feb 15, 2023

Soon nvidia & OpenAI will be the only ones with any money so that's fun :)

hyperific · on Feb 14, 2023

"Over 80% of this assembly is a heatsink to dissipate massive power."

Not what heatsinks do.

Patrick-STH · on Feb 14, 2023

Good catch. Fixed