I should note that our linear layers are not the same as Microsoft's, in fact, we think Microsoft made a mistake in the code they uploaded. When I have time later today, I'll link to where I think they made a mistake.
I've been following TriLLM. They've achieved great results, and I'm really impressed with the llama.cpp contributors already getting the models integrated.
Funnily enough, our ML engineer, Eddy, did a hackathon project working with Procyon to make a neural network with a photonic chip. Unfortunately, I think Lightmatter beat us to the punch.
Edit: I don't think the company exists in its current form anymore
Have you sat in on my conversations with my cofounder?
The end plan is to have a single chip and flush all weights onto the chip at initialization. Because we are a single line of code that is Torch compatible (hence HF compatible), every other part of the codebase shouldn't change.
I've not but that sounds cool!
I would point out though, in terms of mind share, how memorable, and how relatable and useful the products are: it might help to have ways that directly show the application for the kinds of people buying GPUs for inference and training or using cloud for this that would love to not have to fight their ATX case in a hot sweaty corner while repeatedly dropping screwdrivers and calculating how much RAM they need to buy for the 405B while llama.cpp is recompiling again... I think people would throw money at that.
I'd be happy to listen in or have a chat some time!
We don't achieve peak compression efficiency because more complex weight unpacking mechanisms kill throughput.
To be more explicit, the weight matrix's values belong to the set of -1, 0, and 1. When using two bits to encode these weights, we are not effectively utilizing one possible state:
10 => 1,
01 => 0,
00 =>-1,
11 => ?
I think selecting the optimal radix economy will have more of a play on custom silicon, where we can implement silicon and instructions to rapidly decompress weights or work with the compressed weights directly.
It's always possible, but transformers have been around since 2017 and don't seem to be going anywhere. I was bullish on Mamba and researched extended context for structured state-space models at Dartmouth. However, no one cared. The bet we're taking is that Transformers will dominate for at least a few more years, but our bet could be wrong.