Hey Hacker News! We’re Abhi and Alex from deepsilicon (
https://deepsilicon.com). We are building software and hardware for training and inferencing ternary transformer models. Here's a video of the software:
https://www.youtube.com/watch?v=VqBn-I5D6pk.
Transformer-based models are getting bigger every generation, making the inference hardware requirements more and more expensive. Running large transformer models on device is even more challenging. Usually, they require trillions of FLOPs to run at decent speeds and use too much energy and space.
Our solution is to train ternary transformer models. There are two advantages to using ternary values. The first is that the weights can now be stored in two bits (or even less) from 16 bits. This represents an almost 8x compression ratio for every weight matrix in the transformer model (slightly less because of the float16 scaling value and extra norm, but that’s negligible). The second advantage is a reduction in the arithmetic intensity. If we do a dot product between ternary values and INT8 values, we either add the INT8 if the ternary value is 1, subtract the INT8 if the ternary values is -1, or do nothing if the ternary value is 0. There are numerous ways to take advantage of this change in arithmetic, from look up tables to bit mask reductions. As for why ternary and not quaternary/binary, ternary hits a sweet spot of compression and (symmetric) representational value for weights in our experiments.
Currently, hardware is not really optimized for extreme low bit-width matrix operations (whether multiplication or otherwise). We’ve tried various implementations of kernels on both CPUs/GPUs (really only NVIDIA GPUs). We don’t even come close to the theoretical maximum speed for our kernels, and a large part of the failure is because the architecture of existing hardware isn’t optimized for the operations we want them to do. Creating custom silicon for ternary LLMs can accelerate inference by implementing and designing algorithms/circuits that only work for ternary LLMs. Unlike most hardware companies, which need silicon to show improvements, we can already show improvements to active VRAM usage and throughput with our custom kernels on existing hardware. This sets pretty impressive lower bounds for custom silicon.
We originally started working on this after reading the BitNet paper from Microsoft, and were disenchanted that we couldn't run SOTA models on our consumer hardware (3090 and 3070M). Both Alex and I worked on research at Dartmouth, I worked more on the ML/model architecture side, while Alex worked on randNLA CUDA kernels to accelerate training. The research experience, and opportunity to talk to professors, made us realize that if we could pull off ternary transformers, it could solve the large scale inference problem on the edge and cloud.
First, we must either retrain or pretrain a model with our custom linear layers based on the Bitnet 1.58 layers (we’re working on open sourcing our framework for training, data labelling, and synthetic data generation here: https://github.com/deepsilicon/Sila). The model is trained with FP16 weights, but the weights are quantized and the quantization function is detached from the computational graph to allow gradients to flow, and the loss is measured w.r.t. the quantized weights. Once the model converges, we can inference the model with our custom kernels written for CPUs or GPUs (we are working on Inferentia and TPU support). The end goal is to create purpose-built custom silicon to work with the ternary weights, where we can have better compression, throughput, latency, and energy improvements compared to our kernels on existing hardware.
We know this is a highly challenging problem due to technical and market difficulties. Plenty of hardware companies have tried to accelerate inference, but most are not profitable. The biggest problem in the ML hardware market, perplexingly, is software. It's challenging to convince companies to switch to some new hardware when their entire infrastructure and software stack has been configured for some other hardware. On the technical side, we must support various deployment options and model architectures to make large-scale custom silicon production worthwhile. This is compounded by the fact we want to have a single line of code handle everything, abstracting what we're doing away from the ML engineers. So, we need to handle everything on the technical side: compiling the right kernels for your platform, generating the right bindings for ONNX/TensorRT, tuning the kernels, setting the mode to training or inference, etc.
We’d love to hear your opinions about ASICs for transformer inference - and if you know anyone who might be interested in deploying these models, my email is abhi@deepsilicon.com. We can’t wait to hear what you all think!