There is no latency, because the inference is done locally. On a server at the customer with a big GPU
Every chat bot I was ever forced to use has built-in latency, together with animated … to simulate a real user typing. It’s the worst of all worlds.
The models return a realtime stream of tokens.
We sell our users a strong server, where he has all his data and all his services. The LLM is local, and trained by us.
There is no latency, because the inference is done locally. On a server at the customer with a big GPU