From some experiments with a discord bot, the non-deterministic responses you get from an LLM are the big killer for that use case. Ask the model to summarize 100 lines of chat 10 times and you will get 10 different outputs, all worded subtly differently, with different headlines missing each time. (Even on state-of-the-art GPT-4)