I have often mused that, in some ways, it seems like the transistor is really being wasted in AI applications. We use binary states in normal computing to reduce entropy. In AI this is less of a concern, so why not use more of the available voltage range? Basically, re-think the role of the transistor and re-design from the ground up - maybe NAND gates are not the ideal fundamental building block here?
People are working on that [1]. In some sense, it's a step back to analog computing. Add/multiply is possible to do directly in memory with voltages, but it's less versatile (and stable) than digital computing. So you can't do all calculations in a neural network that way, meaning some digital components will always be necessary. But I'm pretty sure analog will make a comeback for AI chips sooner or later.
Trinary however is an interesting middle; people have built trinary hardware long ago; it feels like you could make natively trinary hardware for something like this; it might even be quite a win.
People haven't built reliable ternary electronics, though. Soviets tried with Setun, but they eventually had to resort to emulating each trit with two hardware bits (and wasting one state out of the possible four).
My friend was working on this in the mid-90s at Texas Instruments. Not sure what the underlying semiconductors were, but it did involve making ternary logic via voltage levels. Just searched a bit and found this TI datasheet which might be an example of it (high logic, low logic, high impedance), but maybe not:
https://www.ti.com/lit/ds/symlink/sn74act534.pdf
Hadn't thought about it this way before, but given that LLMs are auto regressive (use their own data for next data), they're sensitive to error drift in ways that are rather similar to analog computers.
Analog computing for neural networks is always very tempting.
> We use binary states in normal computing to reduce entropy. In AI this is less of a concern, so why not use more of the available voltage range?
Transistors that are fully closed or fully open use basically no energy: they either have approximately zero current or approximately zero resistance.
Transistors that are partially open dissipate a lot of energy; because they have some current flowing at some resistance. They get hot.
In addition, modern transistors are so small and so fast that the number of electrons (or holes..) flowing through them in clock cycle is perhaps in the range of a few dozen to a hundred. So that gives you at most 7 bits (~log_2(128)) of precision to work with in an analog setting. In practice, quite a bit less because there's a lot of thermal noise. Say perhaps 4 bits.
Going from 1 bit per transistor to 4 bits (of analog precision) is not worth the drastically higher energy consumption nor the deviation from the mainstream of semi-conductor technological advances.
As someone who knows almost nothing about electronics I assume you’d want a transistor which can open in two ways: with positive and negative voltage. I’ve seen TNAND built out of normal transistors, not sure if such exotic ones would help even if they were physically possible.
the reason why digital/numeric processing won is the power loss in the analog world.
when design an analog circuit the next processing stage you add at the end has impact on the ones before it.
this then require a higher skill from the engineers/consumers.
if you want to avoid that you need to add op-amps with a gain of 1 at the boundary of each one, this also that care of the power loss at each stage.
the other part is that there's a limit of to the amount of useful information/computation you can do with analog processing too once you take into account voltage noise. when you do a comparison there are stages where analog win but also place where where digital wins.
I'll edit later this with a link to some papers that discuss these topics if I manage to find them in my mess.
Good explanation. When I was working at a semiconductor manufacturer, our thresholds were like 0 - 0.2V to 0.8 - 1.0V. Additionally, if you look at QLC SSDs, their longevity is hugely degraded. Analog computing is non-trivial, to say the least.
They visit Texas company Mythic AI to discuss how they use flash memory for machine learning. There's a California company named Syntiant doing something similar.
It would be something of a full circle I feel went back to dedicated circuits for NNs - that's how they began life when Rosenblatt built his Perceptron.
I remember reading a review on the history in grad school (can't remember the paper) where the author stated that one of the initial interests in NNs by the military was their distributed nature. Even back then, people realized you could remove a neuron or break a connection and they would still work (and even today, dropout is a way of regularizing the network). The thinking was that being able to build a computer or automated device that could be damaged (radiation flipping bits, an impact destroying part of the circuit, etc) and still work would be an advantage given the perceived inevitably of nuclear war.
Compared to a normal von Neumann machine which is very fault intolerant - remove the CPU and no processing, no memory=no useful calculation, etc. One reason people may have avoided further attempts at physical neural networks is it's intrinsically more complex than von Neumann, since now your processing and memory is intertwined (the NN is the processor and the program and the memory at the same time).
The US military’s interest in network robustness led to the internet if I’m not mistaken.
Also preceding the perceptron was the McCulloch & Pitts neuron, which is basically a digital gate. NNs and computing indeed have a long history together.
>maybe NAND gates are not the ideal fundamental building block here?
It's my long held opinion that LUTs (Look Up Tables) are the basis of computation for the future. I've been pondering this for a long time since George Gilder told us that wasting transistors was the winning strategy. What could be more wasteful than just making a huge grid of LUTs that all interconnect, with NO routing hardware?
As time goes by, the idea seems to have more and more merit. Imagine a grid of 4x4 bit look up tables, each connected to its neighbors, and clocked in 2 phases, to prevent race conditions. You eliminate the high speed long lines across chips that cause so much grief (except the clock signals, and bits to load the tables, which don't happen often).
What you lose in performance (in terms of latency), you make up for with the homogenous architecture that is easy to think about, can route around bad cells, and be compiled to almost instantly, thanks to the lack of special cases. You also don't ever have to worry about latency, it's constant.
No routing, no fast lines that cut across the chip, which cut way down on latency, but make FPGAs harder to build, and especially hard to compile to once you want to use them.
All that routing hardware, and the special function units featured in many FPGAs are something you have to optimize the usage of, and route to. You end up with using solvers, simulated annealing, etc... instead of a straight compile to binary expressions, and mapping to the grid.
Latency minimization is the key to getting a design to run fast in an FPGA. In a BitGrid, you know the clock speed, you know the latency by just counting the steps in the graph. BitGrid performance is determined by how many answers/second you can get from a given chip. If you had a 1 Ghz rack of BitGrid chips that could run GPT-4, with a latency of 1 mSec per token, you'd think that was horrible, but you could run a million such streams in parallel.
I have heard of people trying to build analog AI devices but that seems like years ago, and no news has come out about it in recent times. Maybe it is harder than it seems. I bet it is expensive to regulate voltage so precisely and it's not a flexible enough scheme to be support training neural networks like we have now, which are highly reconfigurable. I've also heard of people trying to use analog computing for more mundane things. But no devices have hit the market after so many years so I'm assuming it is a super hard problem, maybe even intractible.
Perhaps another variation on the idea is to allow a higher error rate. For example, if a 0.01% error rate was acceptable in AI, perhaps the voltage range between states could be lowered (which has a quadratic relationship to power consumption) and clock speed could increase.
That only prevents analog copy degradation. It doesn't give you reproducibility.
Reproducibility means running the same process twice with the same inputs and getting the same outputs.
E.g. to later prove that something came from an LLM and not a human you could store the random seed and the input and then reproduce the output. But that only works if the network is digital.
This reminds me of this article[1] recently linked on HN, talking about how Intel had an analog chip for neural nets in the 90s, if I understood correctly