Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: How does LLM serving infra work?
26 points by Maro 3 months ago | hide | past | favorite | 2 comments
I've been wondering about this for a while. Although building a competitive production-grade LLM [like ChatGPT-4o] is very far from trivial and the devil is in the details, the basic ideas and architecture is fairly well understood.

For a while, I've been wondering how production serving of very large LLMs [like ChatGPT-4o] works. How many copies/instances of LLMs are running? How big are the GPU clusters? How many GPUs go into one instance? Is execution somehow interleaved, so different layers can execute different steps of different user's queries at the same time, to avoid K-1 layers of a K layer NN idling? How is the conversation state stored and re-applied each time a new query comes in? What kind of replication and fault-tolerance considerations are there?




> How is the conversation state stored and re-applied each time

It's not, they're almost all stateless. Your entire chat history is fed in for each prompt. Any conversational memory comes just from it being in the prompt.

OpenAI does have it memories and system prompt instructions you can setup. These are just injected into the prompt somewhere as well.


Generally, having to unload data from GPU ram, and load a new set of weights in is quite expensive, so my guess is that the backend is built out where an input gets reservation to some cluster based on some ordering, and the batch is ran through.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: