But in dedicated hardware you can just gang your operations into dataflows where the output of one stage feeds into the physically adjacent next stage with no need to make a trip through the register file or bypass network.
A lot of the benefit of hardware vector operations over scalar operations is in the dispatch cost, but most of the benefit from hardware matrix operations over hardware vector operations is from reduced data movement.
EDIT: Of course, the post is from 2012 back when nobody was doing hardware matrix multiplication so it's understandable.
Still not as fast as doing 16-bit fixed-point maths, which I used back in the day for a toy 3D system. https://github.com/pjc50/ancient-3d-for-turboc
>> the post is from 2012 back when nobody was doing hardware matrix multiplication so it's understandable.
Missed that first time round: off by a decade or two! https://en.wikipedia.org/wiki/3dfx_Interactive / https://en.wikipedia.org/wiki/Silicon_Graphics /
The difference for the purpose of this discussion is in the dispatch (data movement) cost per useful operation.
Both GEMV and GEMM can be described as performing (m,k,n) matrix multiplication of an mxk matrix by a kxn matrix. GEMV is simply the case n=1.
The number of useful operations is m * k * n, while the size of the input data is m * k + k * n. So a (4,4,4) GEMM does 64 useful operations while moving 32 input values. Implementing the same GEMM as 4xGEMV also does 64 useful operations, but at the cost of moving 20 input values per GEMV, or 80 overall.
That's where the benefit of hardware GEMM comes from.
Computer graphics doesn't require high throughput of matrix-matrix multiplication. You might need a few matrix-matrix multiplications to set up your transformation matrices, but you do that once for matrices that are then applied to many vertices, so there's not much to be gained by optimizing those. The high throughput matrix-vector multiplies happen on the GPU, but you don't need GEMM for that and so GPUs traditionally didn't offer it.
I guess you could argue that if you multiply one matrix by many vectors, for processing many vertices of a model, then you do in fact have an implied GEMM if you group your vertices accordingly. It seems that for some reason, the computer graphics folks never quite saw it that way, maybe because you also do stuff like animation blending which breaks the GEMM analogy.
PS: Got a real chuckle out of that 1908s typo. I would probably keep it.
Yep! I have a sad 586 (Pentium) with an empty motherboard slot labelled "CACHE MODULE" sitting behind me in the office. (It's L2 — the 586 has a small on-die L1.) You can theoretically still buy these "Cache on a stick" modules on ebay and the likes; who knows if they work, and they're definitely not worth it in an economic sense.
I missed the on-die L2 by one generation (Pentium Pro was the 686)!
I read assay on the failure of a RISC design. The reason was they didn't have separate registers for the floating point unit. The extra bus loading from the floating point unit limited speed at which the registers could be accessed and thus the clock speed. So it was never able to meet it's performance spec's and there was no way to fix it.
I hadn't realty thought until then that there is a trade off between clock speed and the number of and usage of registers. PDP11 and 68000's have a lot of flat generic registers. Where 0x86 had a limited number of specialized registers. 0x86 could probably clock faster at the expense of higher register pressure.
You were a teen for at least 82 years. Talk about a protracted adolescence!
In addition to TFA's reason, bluntly, because we have the transistors. Dennard scaling has ended which has meant that we can't continue to increase clock frequencies. However, transistor counts have continued to increase. This has basically forced CPU manufacturers to focus on multicore because we have the transistors.
Also, big/little, gating off unused silicon and other approaches can save energy even as they use more transistors.
Depending on what your savings are for idle power it can either make sense to run long and slow or try and burst to a deep sleep state.
If you look at specialized processors for say, convolutions, the majority of the benefits are coming from data locality being exploited.
(And no I've never heard the term "dispatch" be used for data movement)
Something not done directly in hardware is done in software. That means it's done using more hardware resources compared to directly in hardware.
QED; directly in hardware is cheaper.
Cheaper to operate, anyway, not necessarily cheaper to produce. You have to move a decent volume before it becomes economic to optimize a solution into hardware. Also, a mistake discovered in the field in hardware is more costly than a mistake in upgradable software.
As an extreme case, to do a simple 32-bit add, you light up tens of millions of transistors if the addition goes through a CPU pipeline. The adder itself of course only requires a few transistors...
Saying that "specialization saves dispatching costs" is minimizing the savings by orders of magnitude. Of course, the article is correct in pointing out that hardware doesn't make things free.
[source: my day job]
Source: my day job
My point stands that from a silicon area perspective, 99% of the CPU is overhead when all you need is a fixed function.
Based on your background, I know you know that. A lot of details in the article show that you know what you are talking about and have specific use cases in mind. I can guess those and for those the article is correct.
You can use google search to add 0x9 + 0x2 and get hexadecimal 0xb... however that involves dozens of layers of abstraction and endless formatting and parsing that are fundamentally useless in the long run for something like a GPU display.
The 4th reason hardware is vastly cheaper is it needs less testing.
In the example above you can either trust your FPGA/ASIC software to implement a byte-wide full adder properly, because thats kinda a basic task for that technology, or you can whack the byte-wide adder with all possible test cases in a couple ns on real hardware, all possible binary inputs and outputs are quite well known and trivial. When you ask google, or worse, alexa, to add two hexadecimal digits there are uncountable number of theoretical buffer overflows and MITM attacks possible spyware/virus infections and similar such nonsense at multiple layers you probably are not even aware of.
The 5th reason hardware is vastly cheaper is environmentalism and energy costs. I have trouble estimating the energy cost of a byte-wide adder in a ASIC or a CPU, surely it can't be more than charging and discharging a couple sub-pF capacitors... takes billions of transistors switching like crazy to dump the 100 watts a server motherboard can dump and a full adder doesn't take many transistors. On the other hand the infrastructure and environmental damage required to ask Alexa to add two hex digits is very high. You can piggy back on it by passing the buck; well, we need that environmental damage and economic cost to enable netflix at which point asking alexa questions is a drop in the bucket, but people have polluted for centuries on the same argument (well, its just a little extra lost plutonium and compared to above ground nuclear testing its a drop in the bucket, etc)
This goes against all experience I had with hardware, and everything I have ever heard from every single embedded/electronics engineer.
Troubleshooting everything that could theoretically go wrong when asking Alexa to add two numbers is simply impossible or incredibly expensive.
Space shuttle computers for autolanding are possible although testing suites are expensive, as you state. Implementing that process to the same level of reliability using a vast distributed software technology like Alexa would be essentially an infinite cost.
Its a given level of reliability problem; can't compare the apples to oranges of extremely unreliable software or even worse, networked, solutions to something as relatively cheaply reliable as hardware.
That said, I suspect it is also a bit misleading. If you are relying on functionality of the hardware for esoteric things, you test them heavily.
At the application level, less is needed. Spec will be met.
But, humans are big and laggy, and I don't know if I could type in the question to Google or even a terminal faster than getting the answer from Alexa.