Super excited to test these out.
The benchmarks from 20B are blowing away major >500b models. Insane.
On my hardware.
43 tokens/sec.
I got an error with flash attention turning on. Cant run it with flash attention?
31,000 context is max it will allow or model wont load.
no kv or v quantization.
Super excited to test these out.
The benchmarks from 20B are blowing away major >500b models. Insane.
On my hardware.
43 tokens/sec.
I got an error with flash attention turning on. Cant run it with flash attention?
31,000 context is max it will allow or model wont load.
no kv or v quantization.