Could someone please summarize the differences (or similarities) of the LLM part against TGWUI+llama.cpp setup with offloading layers to tensor cores?
Asking because 8x7B Q4_K_M (25GB, GGUF) doesn't seem to be "ultra-low latency" on my 12GB VRAM + RAM. Like, at all. I can imagine running 7-13GB sized model with that latency (cause I did, but... it's a small model), or using 2x P40 or something. Not sure what the assumptions they make in the README. Am I missing something? Can you try it without TTS part?
Asking because 8x7B Q4_K_M (25GB, GGUF) doesn't seem to be "ultra-low latency" on my 12GB VRAM + RAM. Like, at all. I can imagine running 7-13GB sized model with that latency (cause I did, but... it's a small model), or using 2x P40 or something. Not sure what the assumptions they make in the README. Am I missing something? Can you try it without TTS part?