I don't need exact results. FP8 quantization is almost lossless and even 6-bit q...

mmoskal · 2024-05-05T06:03:37

Yes. It's speculative decoding but instead of generating just a few sequential tokens with the draft model they generate a whole tree of some sort of optimal shape with hundreds of possible sequences.

It ends up being somewhat faster than regular speculative decoding in normal setting (GPU only). If you are doing CPU offloading it's massively faster.

Edit typo

dimask · 2024-05-05T07:47:35

> Can this be combined with quantization?

It is in their TODO part in https://github.com/Infini-AI-Lab/Sequoia/tree/main

alecco · 2024-05-05T12:02:19

INT8, not FP8