While not actually speeding things up, I wonder if it's possible to run a very small autocomplete model to guess what the next sentence ahead of time from the LLM on the client side. To at least keep the perceived latency low.
Thus if there is a delay, you still get the stream of text coming at a steady rate, but you may have to backtrack and correct for mispredictions.
Yes, that sort of speculative execution has long existed for NNs, and has enjoyed a recent resurgence in popularity as people speculate that OA has been using that heavily for its APIs & services. It can work very well for outputs that vary greatly in their predictability (like natural language), but can also backfire - nobody likes seeing the predictions thrash around unpredictably, but if you aren't careful, you accept bad predictions or wind up not saving latency or compute at all.
Also noticed that mlc+Vulkan has near zero CPU use while Llama+opencl on gpu still end up with a fair bit of cpu use. So seems to be taking a fundamentally different route
This was on a rock chip 3588 so not exactly conventional AI hardware
Edit: I suppose I actually haven't documented my prompt template usage with the OpenAI API endpoint...
There are some special roles that will enable some extra features with this model. Using Langchain (but works with OpenAI API directly):
ChatMessagePromptTemplate.from_template(role='meta-task_name', template='summary')
ChatMessagePromptTemplate.from_template(role='meta-current_date', template=datetime.now().strftime('%Y-%m-%d'))
ChatMessagePromptTemplate.from_template(role='system', template='You are a helpful assistant...')
ChatMessagePromptTemplate.from_template(role="user_context", template="Some long document that you want summarized...")
ChatMessagePromptTemplate.from_template(role="human", template="Please create a summary of the document")
ChatMessagePromptTemplate.from_template(role="ai", template="Okay, first let's gather some key information about the document:")
The last "ai" message is optional. The endpoint will complete this message if you include it. You could just not include this and the model will start the response however it wants.
Llama.cpp has faster single thread inference, but I actually haven't tried their batched inference yet, so I don't know what throughput I can get through it.
Thus if there is a delay, you still get the stream of text coming at a steady rate, but you may have to backtrack and correct for mispredictions.