Hacker News new | past | comments | ask | show | jobs | submit | areddyyt's comments login

Our CPU implementation for X86/AMD64 utilizes AVX-512 or AVX-2 instructions where possible. We're experimenting with support for ARM with NEON.


I should note that our linear layers are not the same as Microsoft's, in fact, we think Microsoft made a mistake in the code they uploaded. When I have time later today, I'll link to where I think they made a mistake.

I've been following TriLLM. They've achieved great results, and I'm really impressed with the llama.cpp contributors already getting the models integrated.


We do quantization-aware training, so the model should minimize the loss w.r.t. the ternary weights, hence no degradation in performance.


There was another founder that said this exact same thing. We'll definitely look into it especially as we train more ViTs.


Funnily enough, our ML engineer, Eddy, did a hackathon project working with Procyon to make a neural network with a photonic chip. Unfortunately, I think Lightmatter beat us to the punch.

Edit: I don't think the company exists in its current form anymore


Have you sat in on my conversations with my cofounder?

The end plan is to have a single chip and flush all weights onto the chip at initialization. Because we are a single line of code that is Torch compatible (hence HF compatible), every other part of the codebase shouldn't change.


I've not but that sounds cool! I would point out though, in terms of mind share, how memorable, and how relatable and useful the products are: it might help to have ways that directly show the application for the kinds of people buying GPUs for inference and training or using cloud for this that would love to not have to fight their ATX case in a hot sweaty corner while repeatedly dropping screwdrivers and calculating how much RAM they need to buy for the 405B while llama.cpp is recompiling again... I think people would throw money at that. I'd be happy to listen in or have a chat some time!


We don't achieve peak compression efficiency because more complex weight unpacking mechanisms kill throughput.

To be more explicit, the weight matrix's values belong to the set of -1, 0, and 1. When using two bits to encode these weights, we are not effectively utilizing one possible state:

10 => 1, 01 => 0, 00 =>-1, 11 => ?

I think selecting the optimal radix economy will have more of a play on custom silicon, where we can implement silicon and instructions to rapidly decompress weights or work with the compressed weights directly.


Thank you, and good catch.

We recently acquired deepsilicon.com, and it looks like the forwarding hasn't been registered yet. abhi@deepsilicon.net should work.


We actually were thinking about this to flush the weights in at initialization


Cool.... if you want to make a general purpose compute engine out of it, you could go full BitGrid[1]. ;-)

[1] https://bitgrid.blogspot.com/2005/03/bitgrid-story.html


This seems super cool. I'll have my cofounder look into it :)


It's always possible, but transformers have been around since 2017 and don't seem to be going anywhere. I was bullish on Mamba and researched extended context for structured state-space models at Dartmouth. However, no one cared. The bet we're taking is that Transformers will dominate for at least a few more years, but our bet could be wrong.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: