Hacker News new | past | comments | ask | show | jobs | submit login

Could someone please summarize the differences (or similarities) of the LLM part against TGWUI+llama.cpp setup with offloading layers to tensor cores?

Asking because 8x7B Q4_K_M (25GB, GGUF) doesn't seem to be "ultra-low latency" on my 12GB VRAM + RAM. Like, at all. I can imagine running 7-13GB sized model with that latency (cause I did, but... it's a small model), or using 2x P40 or something. Not sure what the assumptions they make in the README. Am I missing something? Can you try it without TTS part?




The video example is using Phi-2 which is a 2.7bn param network. I think that's part of how they're achieving the low latency here!

Has anybody fine-tuned Phi-2? I haven't found any good resources for that yet.


We tested https://huggingface.co/cognitivecomputations/dolphin-2_6-phi... as well, in some tasks it performs better. That said, you can use Mistral as well, we support a few models through TensorRT-LLM.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: