Hacker News new | past | comments | ask | show | jobs | submit login
Optimizing LLM Latency (hamel.dev)
40 points by freediver 8 months ago | hide | past | favorite | 11 comments



While not actually speeding things up, I wonder if it's possible to run a very small autocomplete model to guess what the next sentence ahead of time from the LLM on the client side. To at least keep the perceived latency low.

Thus if there is a delay, you still get the stream of text coming at a steady rate, but you may have to backtrack and correct for mispredictions.


Yes, that sort of speculative execution has long existed for NNs, and has enjoyed a recent resurgence in popularity as people speculate that OA has been using that heavily for its APIs & services. It can work very well for outputs that vary greatly in their predictability (like natural language), but can also backfire - nobody likes seeing the predictions thrash around unpredictably, but if you aren't careful, you accept bad predictions or wind up not saving latency or compute at all.


> mlc is the fastest

Also noticed that mlc+Vulkan has near zero CPU use while Llama+opencl on gpu still end up with a fair bit of cpu use. So seems to be taking a fundamentally different route

This was on a rock chip 3588 so not exactly conventional AI hardware


Keen to see how this compares with Llama.cpp on the same system.


Yeah, seems like a glaring omission?


Article discussing LLM performance omits the two most performant libraries: exllama and llama.cpp??


They aren't fastest. I get over 1k tokens/sec on llama2 13b on my 2x 3090 with vLLM.

They both need to fix their batch inference and it'll be more competitive. They are though.


Any more information how are you running this?


Nothing special, other than I am running my fork of vLLM which supports HF prompt templates: https://github.com/vllm-project/vllm/pull/1365

Using --tensor-parallel-size 2 since I have 2 GPUs on my machine.

Running my llama2 13b finetune for summarization and knowledge graph tasks: https://huggingface.co/Tostino/Inkbot-13B-8k-0.2

Edit: I suppose I actually haven't documented my prompt template usage with the OpenAI API endpoint...

There are some special roles that will enable some extra features with this model. Using Langchain (but works with OpenAI API directly):

  ChatMessagePromptTemplate.from_template(role='meta-task_name', template='summary')
  ChatMessagePromptTemplate.from_template(role='meta-current_date', template=datetime.now().strftime('%Y-%m-%d'))
  ChatMessagePromptTemplate.from_template(role='system', template='You are a helpful assistant...')
  ChatMessagePromptTemplate.from_template(role="user_context", template="Some long document that you want summarized...")
  ChatMessagePromptTemplate.from_template(role="human", template="Please create a summary of the document")
  ChatMessagePromptTemplate.from_template(role="ai", template="Okay, first let's gather some key information about the document:") 
The last "ai" message is optional. The endpoint will complete this message if you include it. You could just not include this and the model will start the response however it wants.


Thanks, thats awesome. Would you say vLLM on one GPU is as fast or faster than llama.cpp? have you tried mlc-llm?


Llama.cpp has faster single thread inference, but I actually haven't tried their batched inference yet, so I don't know what throughput I can get through it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: