I've been wondering about this for a while. Although building a competitive production-grade LLM [like ChatGPT-4o] is very far from trivial and the devil is in the details, the basic ideas and architecture is fairly well understood.
For a while, I've been wondering how production serving of very large LLMs [like ChatGPT-4o] works. How many copies/instances of LLMs are running? How big are the GPU clusters? How many GPUs go into one instance? Is execution somehow interleaved, so different layers can execute different steps of different user's queries at the same time, to avoid K-1 layers of a K layer NN idling? How is the conversation state stored and re-applied each time a new query comes in? What kind of replication and fault-tolerance considerations are there?
It's not, they're almost all stateless. Your entire chat history is fed in for each prompt. Any conversational memory comes just from it being in the prompt.
OpenAI does have it memories and system prompt instructions you can setup. These are just injected into the prompt somewhere as well.