Hacker News new | past | comments | ask | show | jobs | submit login

The biggest llama model has near 100% fidelity (its like 99.3%) at 4 bit quantization, which allows it to fit on any 40GB or 48GB GPU, which you can get for $3500.

Or at about a 10x speed reduction you can run it on 128 GB of RAM for only around $250.

The story is not anywhere near as bleak as you paint.




A $3500 GPU requirement is far from democratization of AI.


$3500 is less than one week of a developer salary for most companies. It woudn't pay a month's rent for most commercial office space.

It's a lower cost of entry than almost any other industry I can think of. A cargo van with a logo on it (for a delivery business or painting business, for example) would easily cost 10-20x as much.


1. I don't know what kind of world you live in to think that USD 3500 is "less than one week of a developer salary for most companies." I think you really just mean FAANG (or whatever the current acronym is) or potentially SV / offices in cities with very high COL.

2. The problem is scaling. To support billions of search queries you would have to invest in a lot more than a single GPU. You also wouldn't only need a single van, but once you take scaling into account even at $3500 the GPUs will be much more expensive.

That said, costs will come down eventually. The question in my mind is whether OpenAI (who already has the hardware resources and backed by Microsoft funding to boot) will be able to dominate the market to the extent that Google can't make a comeback by the time they're able to scale.


> 1. I don't know what kind of world you live in to think that USD 3500 is "less than one week of a developer salary for most companies." I think you really just mean FAANG (or whatever the current acronym is) or potentially SV / offices in cities with very high COL.

I live in the real world, at a small company with <100 employees, a thousand miles away from SV.

$3200 * 52 == $180k a year, and gives $120k salary and $60k for taxes, social security, insurance, and other benefits, which isn't nearly FAANG level.

Even if you cut it in half and say it's 2 weeks of dev salary, or 3 weeks after taxes, it's not unreasonable as a business expense. It's less than a single license for some CAD software.

> 2. The problem is scaling. To support billions of search queries you would have to invest in a lot more than a single GPU. You also wouldn't only need a single van, but once you take scaling into account even at $3500 the GPUs will be much more expensive.

Sure, but you don't start out with a fleet of vans, and you wouldn't start out with a "fleet" of GPUs. A smart business would start small and use their income to grow.


You are correct sir!


I'm only going to comment on the salary bit.

GP lives in an company world. The cost of a developer to a company is the developer's salary as stated in the contract, plus some taxes, health insurance, pension, whatever, plus the office rent for the developer's desk/office, plus the hardware used, plus a fraction of the cost of HR staff and offices, cleaning staff, lunch staff... it adds up. $3500 isn't a lot for a week.


Most of these items are paid for by the company, and most people would not consider the separate salary of the janitorial or HR staff to be part of their own salary.


I agree, most people wouldn't. This leads to a lot of misunderstandings, when some people think in terms of what they earn and others in terms of what the same people cost their employers.

So you get situations where someone names a number and someone else reacts by thinking it's horribly, unrealistically high: The former person thinks in employer terms, the latter in employee terms.


1 - Yes, I agree on this, but even so, most developers already are investing in SOTA GPU's for other reason (so not as much of a barrier as purported)

2 - Scaling is not a problem in other industries? If you want to scale your food truck, you will need more food trucks, this doesn't seem to really do anything for your point.

GGML and GPTQ have already revolutionised the situation, and now there are tiny models with insane quality as well, that can run on a conventional CPU.

I don't think you have any idea what is happening around you, and this is not me being nasty, just go and take a look at how exponential this development is and you will realise that you need to get in on it before its too late.


You seem to be in a very particular bubble if you think most developers can trivially afford high end GPUs and are already investing in SOTA GPUs. I know a lot of devs from a wide spectrum of industries and regions and I can think of only one person who might be in your suggested demographic


Perhaps I should clarify, that when I say SOTA GPU, I mean, rtx 3060 (midrange), which has 12gb vram, and is a good starting point to climb into the LLM market. I have been playing with LLM's for months now, and for large periods of time had no access to GPU due to daily scheduled rolling blackouts in our country.

Even so, I am able to produce insane results locally with open source efforts on my RTX3060, and now I am starting to feel confident enough that I could take this to the next level by either using cloud (computerender.com for images) or something like vast.ai to run my inference (or even training if I spend more time learning). And if that goes well I will feel confident going to the next step, which is getting an actual SOTA GPU. But that will only happen once I have gained sufficient confidence that the investment will be worthwhile. Regardless, apologies for suggesting the RTX3060 is SOTA, but to me in a 3rd World Country, being able to run vicuna13b entirely on my 3060 with reasonable inference rates is revolutionary.


* in the US


For reference, a basic office computer in the 1980s cost upwards of $8000. If you factor in inflation, a $3500 GPU for cutting tech is a steal.


And hardly anyone had them in the 1980s.


I think a more relevant comparison may be a peripheral: the $7,000 LaserWriter which kicked off the desktop publishing revolution in 1985.


Yes but moore's law ain't what it used to be


Moore is not helping here. Software and algorithms will fix this up, which is already happening at a frightening rate. Not too long ago, like months, we were still debating if it was ever even possible to run LLMs locally.


There is going to be a computational complexity floor on where this can go, just from a Kolmogorov complexity argument. Very hard to tell how far away the floor is exactly but things are going so fast now I suspect we'll see diminishing returns in a few months as we asymptote towards some sort of efficiency boundary and the easy wins all get hoovered up.


Yes indeed and it’ll be interesting to see where that line is.

I still think there is a lot to be gained from just properly and efficiently composing the parts we already have (like how the community handled stable diffusion) and exposing them in an accessible manner. I think that’ll take years even if the low hanging algorithm fruits start thinning out.


We've reached an inflection point, the new version would be: Nvidia can sell twice as many transitors for twice the price every 18 months.


This is very true, however there is a long way to go in terms of chip design specific to DL architectures. I’m sure we’ll see lots of players release chips that are an order of magnitude more efficient for certain model types, but still fabricated on the same process node.


Moore's law isn't dead. Only Dennard's law. See slide 13 here[0] (2021). Moore's law stated that the number of transistors per area will double every n months. That's still happening. Besides, neither Moore's law nor Dennard scaling are even the most critical scaling law to be concerned about...

...that's probably Koomey's law[1][3], which looks well on track to hold for the rest of our careers. But eventually as computing approaches the Landauer limit[2] it must asymptotically level off as well. Probably starting around year 2050. Then we'll need to actually start "doing more with less" and minimizing the number of computations done for specific tasks. That will begin a very very productive time for custom silicon that is very task-specialized and low-level algorithmic optimization.

[0] Shows that Moore's law (green line) is expected to start leveling off soon, but it has not yet slowed down. It also shows Koomey's law (orange line) holding indefinitely. Fun fact, if Koomey's law holds, we'll have exaflop power in <20W in about 20 years. That's equivalent to a whole OpenAI/DeepMind-worth of power in every smartphone.

The neural engine in the A16 bionic on the latest iPhones can perform 17 TOPS. The A100 is about 1250 TOPS. Both these performance metrics are very subject to how you measure them, and I'm absolutely not sure I'm comparing apples to bananas properly. However, we'd expect the iPhone has reached its maximum thermal load. So without increasing power use, it should match the A100 in about 6 to 7 doublings, which would be about 11 years. In 20 years the iPhone would be expected to reach the performance of approximately 1000 A100's.

At which point anyone will be able to train a GPT-4 in their pocket in a matter of days.

There's some argument to be made that Koomey himself declared in 2016 that his law was dead[4], but that was during a particularly "slump-y" era of semiconductor manufacturing. IMHO, the 2016 analysis misses the A11 Bionic through A16 Bionic and M1 and M2 processors -- which instantly blew way past their competitors, breaking the temporary slump around 2016 and reverting us back to the mean slope. Mainly note that now they're analyzing only "supercomputers" and honestly that arena has changed, where quite a bit of the HPC work has moved to the cloud [e.g. Graviton] (not all of it, but a lot), and I don't think they're analyzing TPU pods, which also probably have far better TOPS/watt than traditional supercomputers like the ones on top500.org.

0: (Slide 13) https://www.sec.gov/Archives/edgar/data/937966/0001193125212...

1: "The constant rate of doubling of the number of computations per joule of energy dissipated" https://en.wikipedia.org/wiki/Koomey%27s_law

2: "The thermodynamic limit for the minimum amount of energy theoretically necessary to perform an irreversible single-bit operation." https://en.wikipedia.org/wiki/Landauer%27s_principle

3: https://www.koomey.com/post/14466436072

4: https://www.koomey.com/post/153838038643


You're right, it's been debunked and misquoted for decades.


virtually no office had them in 1980

by mid 1980s personal computers costed less than $500


This is false. Created account just to rebut.

While it may have been true that it was technically possible to assemble a PC for $500... good luck. In the real world people were spending $1500-$2500+ for PCs, and that price point held remarkably constant. By the time you were done buying a monitor, external drives, printer etc $3000+ was likely.

https://en.m.wikipedia.org/wiki/IBM_Personal_Computer#:~:tex....

Or see Apple Mac 512 introduced at approx $2800? One reason this was interesting (if you could afford it) was the physical integration and elimination of "PC" cable spaghetti.

https://en.m.wikipedia.org/wiki/Macintosh_512K

But again having worked my way thru college with an 8 MB external hard drive... which was a huge improvement over having to swap and preload floppies in twin external disk drives just to play stupid (awesome) early video games, all of this stuff cost a lot more than youre saying. And continued to well into the 90s.

Of course there are examples of computers that cost less. I got a TI-99/4a for Christmas which cost my uncle about $500-600. But then you needed a TV to hook it up to, and a bunch of tapes too. And unless you were a nerd and wanted to program, it didn't really DO anything. I spent months manually recreating arcade video games for myself on that. Good times. Conversely, if you bought an IBM or Apple computer, odds were you were also going to spend another $1000 or more buying shrinkwrap software to run on it. Rather than writing your own.

Source: I remember.


> This is false.

The CPC 464 is the first personal home computer built by Amstrad in 1984. It was one of the bestselling and best produced microcomputers, with more than 2 million units sold in Europe.

Price

£199 (with green monitor), £299 (with colour monitor)

> But again having worked my way thru college with an 8 MB external hard drive

that was a minicomputer at the time, not a PC (personal computer)

> Source: I remember.

Source: I owned, used and programmed PCs in the 80s


The Commodore 64 launched in 1982 for 595 bucks. Wikipedia says "During 1983, however, a trickle of software turned into a flood and sales began rapidly climbing, especially with price cuts from $595 to just $300 (equivalent to $1600 to $800 in 2021)."

I believe this was the PC of the ordinary person (In the "personal computer" sense of the word.)


Yeah, I bet people won't get cars, either, they're a lot more expensive than that.


and that 3500 worth of kit will be a couple of hundred bucks on ebay in 5 years.


I haven't seen any repos or guides to using llama on that level of RAM, which is something I do have. any pointers?


Run text-generation-webui with llama.cpp: https://github.com/oobabooga/text-generation-webui



Something I haven't figured out: should I think about these memory requirements as comparable to the baseline memory an app uses, or like per-request overhead? If I needed to process 10 prompts at once, do I need 10x those memory figures?


It’s like a database, I imagine - so the answer is probably “unlikely,” that you need memory per-request but instead that you run out of cores to handle requests?

You need to load the data so the graphics cards - where the compute is - can use it to answer queries. But you don’t need a separate copy of the data for each GPU core, and though slower, cards can share RAM. And yet even with parallel cores, your server can only answer or process so many queries at a time before it runs out of compute resources. Each query isn’t instant either given how the GPT4 answers stream in real-time yet still take a minute or so. Plus the way the cores work, it likely takes more than one core to answer a given question, likely hundreds of cores computing probabilities in parallel or something.

I don’t actually know any of the details myself, but I did do some CUDA programming back in the day. The expensive part is often because the GPU doesn’t share memory with the CPU, and to get any value at all from the GPU to process data at speed you have to transfer all the data to GPU RAM before doing anything with the GPU cores…

Things probably change quite a bit with a system on a chip design, where memory and CPU/GPU cores are closer, of course. The slow part for basic replacement of CPU with GPU always seemed to be transferring data to the GPU, hence why some have suggested the GPU be embedded directly on the motherboard, replacing it, and just put the CPU and USB on the graphics card directly.

Come to think of it, an easier answer is how much work can you do in parallel on your laptop before you need another computer to scale the workload? It’s probably like that. It’s likely that requests take different amounts of computation - some words might be easier to compute than others, maybe data is local and faster to access or the probability is 100% or something. I bet it’s been easier to use the cloud to toss more machines at the problem than to work on how it might scale more efficiently too.


Does that mean an iGPU would be better than a dGPU? A beefier version than those of today though.


Sort of. The problem with most integrated GPUs is that they don’t have as many dedicated processing cores and the RAM, shared with the system, is often slower than on dedicated graphics cards. Also… with the exception of system on a chip designs, traditional integrated graphics reserved a chunk of memory for graphics use and still had to copy to/from it. I believe with newer system-on-a-chip designs we’ve seen graphics APIs e.g. on macOS that can work with data in a zero-copy fashion. But the trade off between fewer, larger system integrated graphics cores vs the many hundreds or thousands or tens of thousands of graphics cores, well, lots of cores tends to scale better than fewer. So there’s a limit to how far two dozen beefy cores can take you vs tens of thousands of dedicated tiny gfx cores.

The theoretical best approach would be to integrate lots of GPU cores on the motherboard alongside very fast memory/storage combos such as Octane, but reality is very different because we also want portable, replaceable parts and need to worry about silly things like cooling trade offs between placing things closer for data efficiency vs keeping things spaced apart enough so the metal doesn’t melt from the power demands in such a small space. And whenever someone says “this is the best graphics card,” someone inevitably comes up with a newer arrangement of transistors that is even faster.


You need roughly (model size + (n * (prompt + generated text)) where n. Is the number of parallel users/ request.


It should be noted that that last part has a pretty large factor to it that also scales with model size, because to run transformers efficiently you cache some of the intermediate activations from the attention block.

The factor is basically 2 * number of layers * number of embeddings values (e.g. fp16) that are stored per token.


That's just to run a model already trained by a multi-billion dollar company. And we are "lucky" a corporation gave it to the public. Training such a model requires tons of compute power and electricity.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: