Hacker News new | past | comments | ask | show | jobs | submit login
The Inference Cost of Search Disruption – Large Language Model Cost Analysis (semianalysis.com)
78 points by lxm on Feb 10, 2023 | hide | past | favorite | 35 comments



Few things to note:

- Google qps is closer to 100k then 320k [1]

- Not every query has to run on LLM. Probably only 10% would benefit from it

- This means 10,000 queries per second, each needing 5 A100s to run, so 50,000 A100s are sufficient. Cost for that is $500MM, quadruple that to $2B with CPU/RAM/storage/network. That is peanuts for Google.

- Let's say AI unlocks new, never seen before queries ('do more people live in madrid or tel aviv') and those are 10x in volume. So capex is now $20B, still peanuts.

- Latency, not cost, is a bigger issue. This should be addressed soon by H100 and newer chips.

- The main issue for the consumer remains the business model. Who will pay for LLMs (user or the advertiser) and will Google and Microsoft stuff ads into LLM responses?

[1] https://www.internetlivestats.com/one-second/#google-band


- Google qps is closer to 100k then 320k [1]

That number is wrong. I have a number from googler, not livestats which cannot have google internal data

- Not every query has to run on LLM. Probably only 10% would benefit from it

Agree, i have something different coming up that looks into this more, 10% may be too low. I know i used 100% which isn't right, and explicitly say that

- This means 10,000 queries per second, each needing 5 A100s to run, so 50,000 A100s are sufficient. Cost for that is $500MM, quadruple that to $2B with CPU/RAM/storage/network. That is peanuts for Google.

50k A100s networking ramp cost way more than $2B HW utilization rate

- Latency, not cost, is a bigger issue. This should be addressed soon by H100 and newer chips.

Thats discussed in the subscriber section. It's both, but yes latency is bigger issue. H100 helps but doesn't solve.


> Who will pay for LLMs (user or the advertiser) and will Google and Microsoft stuff ads into LLM responses?

Capex: $20B

Driving Google's margin to zero and feasting on their remains: priceless

https://www.ft.com/content/2d48d982-80b2-49f3-8a83-f5afef98e...

> “From now on, the [gross margin] of search is going to drop forever,” Nadella said in an interview with the Financial Times.

> “There is such margin in search, which for us is incremental. For Google it’s not, they have to defend it all,” he added, referring to the competition against Google as “asymmetric”.


The continued decay of search quality will be painful, as the economics make "good" consumer-oriented results uneconomical. I also fear whatever monetization approaches Google comes up with in search of margin. Corporations do really dumb things in service of monetization (see Hasbro / Wizards of the Coast with the D&D controversy of late).


I would expect specific hardware acceleration for LLM if it becomes very relevant. That would improve latency and costs.

I use ChatGPT throughout the day as a widely knowledgeable coworker that periodically eats a bag of mushrooms, and have paid the $20 / month. I’d note this description matches all the most useful coworkers I’ve had, except that they cost a ton more per month.


Before spending $$$ on the hardware accelerators, there is still a lot to achieve using software optimizations for inference and trainings workloads. We optimized GPT-J, and others, for real. https://centml.ai/benchmarks/


You could also port that code from nVidia CUDA somewhere else. If you wanna run on servers, Vulkan Compute will probably do the trick. I have ported ML inference to Windows desktops with DirectCompute https://github.com/Const-me/Whisper

These NVIDIA A40 GPUs mentioned on the page you’ve linked cost $4000 (Amazon) to $10000 (Dell), deliver similar performance to Intel A770 which cost $350. That’s 1 order of magnitude difference in cost efficiency.


What specific hardware acceleration is possible? I think TPUs are already as good as you can get for accelerating large MLPs (half of what a LLM is). Is there that much overhead for the attention blocks?


TPU is a pretty conservative design, there are 2x...5x speedups available with full-stack optimization, and then some.


No TPUs are getting better architecturally


Better to treat it as a devious and eloquent coworker commiting corporate sabotage for a competing company, who is paid by the factoid to get you to believe incorrect things. Google failed at this by being not nearly paranoid enough when putting their Bard demo together. Everyone laughed at them but ChatGPT does exactly the same thing.

(Someone will probably claim ChatGPT would be less likely than Bard to commit the sort of fabrication spotted in the demo, but it's not. I gave ChatGPT the exact same question but replaced JWST with CERN and it told me antimatter was discovered at CERN when according to CERN's own website [1] it was both predicted and observed decades before CERN existed. I didn't have to fish for this. It was the first result for basically the exact same question and was a highly believable fabrication of the exact same kind, attributing a discovery that had been made earlier. I switched JWST for CERN just in case there has been any special training regarding JWST in light of the publicity - no idea if there has been, of course.)

[1] https://timeline.web.cern.ch/timeline-header/86


Did you read the response carefully? I know there’s a randomness factor in the response but the response I got to your prompt (helpful to provide the exact prompt) said that CERN studies antimatter, not discover it. I also asked it how was antimatter discovered and it accurately explained the timeline.

What new discoveries from the CERN can I tell my 9 year old about?

Here are a few recent discoveries from CERN that you can explain to your 9-year-old:

Higgs Boson: CERN scientists discovered the Higgs Boson particle in 2012, which helps explain how other particles have mass. You can explain it to your child as a particle that helps give everything in the universe its size and weight. Antimatter: CERN scientists study antimatter, which is a type of matter that has the opposite charge of normal matter. This helps us understand more about the basic building blocks of the universe. Dark Matter: CERN scientists are searching for evidence of dark matter, which is a mysterious substance that makes up most of the universe but cannot be seen directly. Scientists believe that dark matter could help explain the behavior of galaxies and other celestial objects. Neutrinos: CERN scientists study neutrinos, which are tiny particles that are produced in the sun and other celestial objects. These particles can travel through the universe and can help us understand more about how the universe works. Remember, it's important to keep the explanations simple and use examples that your child can relate to, like using a ball as a representation of a particle.


I always read what ChatGPT says carefully :-D The conversation is copied below. (In case it's unclear, when I mentioned it was the "first response" I meant I didn't cheat by getting it to regenerate the answer until I saw an untruth. This the output it produced on the first try. The erroneous item is the second one on its list. Its mini-eli5 of what antimatter is is also misleading though technically not untrue.)

What CERN discoveries can I tell my 9yo about?

Here are some CERN discoveries that you can explain to your 9-year-old:

1. Higgs boson particle: This particle was discovered at CERN and helps explain why other particles have mass. Think of it like an invisible glue that holds everything together.

2. Antimatter: CERN is also known for discovering antimatter, a type of matter that has the opposite charge of normal matter. When matter and antimatter meet, they annihilate each other and release a huge amount of energy.

3. The Large Hadron Collider (LHC): This is a huge machine that CERN built to study particles and the fundamental nature of the universe. It's like a giant microscope for particles!

4. The internet: CERN played a major role in the development of the internet, which allows people to communicate and share information all over the world.

5. Dark matter and dark energy: CERN scientists are working to understand these mysterious substances that make up most of the universe. Dark matter and dark energy don't interact with light, so they are invisible to us. But scientists can detect their presence by observing their effects on other objects.

I hope this information helps you explain some of the exciting discoveries that CERN has made!


So, I use chatgpt as a wingman while I program to give me explanations and snippets. I don’t think it’s malicious, and I’d say it’s not a whole lot worse than my human coworkers in terms of coming up with malarkey. Humans say and do all sorts of wacky shit with full confidence. I stay credulous and I’m not asking it to do my work, I’m asking it to explain edges I’m unfamiliar with. If I find it hallucinated an answer it usually means what I want to do isn’t actually possible but it contrived a plausible way to answer that could be true. If I push around the edges and correct it but it keeps coming up with garbage, I can be pretty certain my line of reasoning is at a dead end and I need to backtrack or accept I need a different solution to a different question. I’ll verify this using Kagi and it’s usually pretty trivial to verify I’m right. This feels exactly like my experience over the last 30 years developing software with human coworkers.


A service that gets 100kqps average can expect to get 200kqps at peak due to human diurnal cycle. Services also need regional and global failover capacity and generally would not be run right at the redline all the time, so you can figure in more slop factors. Now you are closer to 320k than to 100k.

Also, the main factor in your analysis, 10%, was just completely made up by you.


Fair points.


> The main issue for the consumer remains the business model. Who will pay for LLMs (user or the advertiser)

I had a really weird idea the other day. What if the government invested in computer infra like it does for roads. Give each consumer X compute credits that they can use directly at cloud providers or trade in to services to help offset (or completely pay for) their own use.

I would use it to run my open source infra. I imagine my parents would use it to decrease their Netflix bill (for example).

I think it could unlock interesting non-marketing based “free” services. But it’s a hard concept to pitch to the average person who wouldn’t know why they would want 1000 CPU hours (for example).


Government should be building this infra for the benefit of citizens, for vital services such as search. There should be a public search engine. Public libraries were good enough for that role but not anymore.


I think public competition for digital services would be amazing. I also think they should be a player, not take over the game.

I think the answer would be for them to host a competing service AND allow people to use their credits to choose that service or a different vendor.


This article is based on wild speculations of how much things cost, and doesn’t account for ways to make things cheaper over time. We already have papers that suggest most big models are undertrained and smaller models can get the same accuracy.

Google should easily spend the $33B if it keeps them on top while finding ways to make things cheaper, more accurate, smarter.


> This article is based on wild speculations of how much things cost

Huh it literally uses real throughput figures

> doesn’t account for ways to make things cheaper over time

It does in the subscriber section and it says it does say that in the free section.

> We already have papers that suggest most big models are undertrained and smaller models can get the same accuracy.

Why assume 2023 model is the same as 2020 GPT-3 175B parameter


How much is OpenAI able to cache query results? There's a major difference in speed on some queries and in my experience the queries on which it is slow to respond are almost always very unusual and difficult to answer queries. I feel that caching responses would be very difficult and the answers chatgpt gives vary quite a bit for the same query, but it also doesn't make any sense to me why query response times should differ so significantly based on query difficulty if all its doing is inference in a language model.


It's not about caching. Beam searching through a search space with no obvious "answers" is naturally hard.


Dang, Satya Nadella isn't pulling any punches here. I love those quotes. "I want people to know that we made [Google] dance"! Can you imagine Sundar ever saying something like that?

Anyway, this is the reason Google invested in TPUs early on. They may have to start spending more on building out TPU clusters, but at least they don't have to pay Nvidia's profit margins to do it.


This is all too soon to decide:

-We don't know how costs will look like after intensive optimization

-The results of chatGPT are not trustworthy. If someone solves or greatly improves on that, he'll win.

-Even if it would mean reduction in gross margin, Google has no choice. They'll accept that fact sooner or later.


Fascinating article. Take away for me above everything is that search is going to change, significantly. Deploying the lighter model sounds like a mistake to me regardless of margins.


Agreed, deploying the lighter model seems like the quickest way to leave a bad impression. Wonder if doing a slow "exclusive" rollout with invites (like with gmail) would be better from a marketing perspective


I think the mistake is actually conflating the use cases here. Google has a nice lucrative business model with search. AI offers new business models on top of that. Some of these are going to be extremely valuable to users. To the point where companies might charge users for that and get lots of users to pay up.

The mistake here is assuming the business models will stay the same (i.e. free usage supported by ads). I think that's a mistake.

The second mistake is assuming people use search engines to answer questions or generate content. They might. Or they might just be looking for a particular web page or article. The former is kind of a new use case and business model. The latter is what Google is already doing quite well.

If I were Google, I would make this a paid feature with an initially generous freemium layer to lure people in and try before they buy. That would allow them to control the pace of rolling this out and allow them to focus on not having to cripple the experience.

MS is doing this indirectly via office365 and teams, which are already paid features and by partnering with OpenAI. Bing is a relatively minor business for them at this point so they can experiment there without risking much. MS is not going to be able to absorb most of Googles users anytime soon. My guess is that even MS does not believe that is actually going to happen. What might happen is that a lot of early adopters will come along and then find their way to their paid products. From a revenue point of view that seems like a good plan to me.

The issue with Google is that right now they are just responding to what MS is doing. They don't seem to yet have a coherent strategy of their own beyond just coasting along on ad revenue.


I feel like the majority of queries won't benefit from LLM responses and a huge chunk that do could be easily pre-computed and don't need to use the model on each search, pretty much all queries about historical figures and events just need to be summarized once.


I think questions will be different once we can pose human questions rather than weird cryptic search terms to trick an information retrieval system into finding text blobs that might be relevant. For instance instead of searching for “George Washington” I could ask “What did George Washington write in his journals about the continental congress?” I wouldn’t expect this answer to be precomputed. FYI ChatGPT answered that he didn’t keep journals about this, which I know to be true despite being a prolific journal keeper. It answered he did however write letters and they were about his personal affairs but also his inspiration about American independence d. I then asked it who he wrote to. It provided a bunch of folks, including General Gates, which is all correct. I then asked it details on what Gates and Washington discussed. Etc. The value isn’t that it can summarize a biography of George Washington. The value is I can get a direct answer to my direct questions, then refine my questions to the depth and subjects I’m interested in. Information retrieval systems can just give me potentially relevant text blobs I need to sift, parse, and analyze with iterative queries until I give up in disgust at the state of SEO.


"and $1 cost per GPU hour."

"We estimate the cost per query to be 0.36 cents."

"<2000ms latency"

"50% hardware utilization rates"

Assuming all queries take 2s this implies that ( 1 * 3600 / ( 2 / 0.5 ) ) * 0.36 = 324 GPUs would be used for each query. This seems implausibly high.


Your math is completely wrong dude.

It's 2000ms per token not for the whole query.

Hardware utilization rates and MFU are not the same thing, you forgot the latter.

You are pretending its perfectly parelelized on 1 GPU too. I use 8x GPU box throughput.


0.36 cents or $0.0036. Your math should be 3.24 GPU/query.


You’re 100% right, my bad!

And that puts the estimate entirely in the realm of the possible. Whoops!


I think we got something right here. Deep analysis tech journalism is back.

Great piece.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: