Hacker News new | past | comments | ask | show | jobs | submit login
“It's done in hardware so it's cheap” (2012) (yosefk.com)
132 points by bibyte 9 days ago | hide | past | web | favorite | 38 comments

Another benefit of specialized hardware, besides dispatch costs, is the cost of moving data around. As your chip gets wider you want more physical (rather than architectural) registers with more ports meaning super linear growth. And your bypass network also grows quadratically in transistor terms. And as your core gets bigger physically you lose more power moving data longer distances.

But in dedicated hardware you can just gang your operations into dataflows where the output of one stage feeds into the physically adjacent next stage with no need to make a trip through the register file or bypass network.

A lot of the benefit of hardware vector operations over scalar operations is in the dispatch cost, but most of the benefit from hardware matrix operations over hardware vector operations is from reduced data movement.

EDIT: Of course, the post is from 2012 back when nobody was doing hardware matrix multiplication so it's understandable.

A lot of people including "me" (TFA author, of course "me" = larger team) were doing hardware matrix multiplication for many years before 2012 [let's say before deep learning.] I count "moving data around" as "dispatching costs" [dispatching = figuring out what to do, on what data and where to put results as opposed to actually doing it.]

It's entirely possible my terminology is wrong here. I was thinking of the cost of decode, register renaming, and scheduling but those aren't really part of dispatch, are they? So it looks like I more got things backwards but I think the point that there are different sorts of cost savings worth distinguishing through using more capable instructions is still salvageable.

I recall a demo app from my teen years (late 1908s to early 90s) that demonstrated, after installing your coprocessor, improved 3D performance by rotating at "high" speed a 3D sprite on the screen. This was an Intel 80386 with an 80387 coprocessor - I thought it was doing matrix math. Nope, the 8037 was an external floating-point unit that arrived two years after the processor with which it was paired. Cringey.

It probably was doing matrix math - the fact that the FPU was external is somewhat irrelevant, it was still decoding out of the same instruction stream and could inter-communicate with the integer unit registers.

Still not as fast as doing 16-bit fixed-point maths, which I used back in the day for a toy 3D system. https://github.com/pjc50/ancient-3d-for-turboc

>> the post is from 2012 back when nobody was doing hardware matrix multiplication so it's understandable.

Missed that first time round: off by a decade or two! https://en.wikipedia.org/wiki/3dfx_Interactive / https://en.wikipedia.org/wiki/Silicon_Graphics /

GPUs traditionally don't do matrix multiplication, though. They do matrix-vector multiplications only, which is related but still a somewhat different beast.

GEMVs can be used to do GEMMs


The difference for the purpose of this discussion is in the dispatch (data movement) cost per useful operation.

Both GEMV and GEMM can be described as performing (m,k,n) matrix multiplication of an mxk matrix by a kxn matrix. GEMV is simply the case n=1.

The number of useful operations is m * k * n, while the size of the input data is m * k + k * n. So a (4,4,4) GEMM does 64 useful operations while moving 32 input values. Implementing the same GEMM as 4xGEMV also does 64 useful operations, but at the cost of moving 20 input values per GEMV, or 80 overall.

That's where the benefit of hardware GEMM comes from.

Computer graphics doesn't require high throughput of matrix-matrix multiplication. You might need a few matrix-matrix multiplications to set up your transformation matrices, but you do that once for matrices that are then applied to many vertices, so there's not much to be gained by optimizing those. The high throughput matrix-vector multiplies happen on the GPU, but you don't need GEMM for that and so GPUs traditionally didn't offer it.

I guess you could argue that if you multiply one matrix by many vectors, for processing many vertices of a model, then you do in fact have an implied GEMM if you group your vertices accordingly. It seems that for some reason, the computer graphics folks never quite saw it that way, maybe because you also do stuff like animation blending which breaks the GEMM analogy.

Math coprocessors worked fairly well due to much slower clock speed minimizing the latency issues. Off chip level 2/3 cache was still a thing into 486 days. And x86 ASM still treats floating point as if it’s on a separate processor.

PS: Got a real chuckle out of that 1908s typo. I would probably keep it.

> Off chip level 2/3 cache was still a thing into 486 days.

Yep! I have a sad 586 (Pentium) with an empty motherboard slot labelled "CACHE MODULE" sitting behind me in the office. (It's L2 — the 586 has a small on-die L1.) You can theoretically still buy these "Cache on a stick"[1] modules on ebay and the likes; who knows if they work, and they're definitely not worth it in an economic sense.

I missed the on-die L2 by one generation (Pentium Pro was the 686)!

[1]: https://en.wikipedia.org/wiki/Cache_on_a_stick

> ASM still treats floating point as if it’s on a separate processor

I read assay on the failure of a RISC design. The reason was they didn't have separate registers for the floating point unit. The extra bus loading from the floating point unit limited speed at which the registers could be accessed and thus the clock speed. So it was never able to meet it's performance spec's and there was no way to fix it.

I hadn't realty thought until then that there is a trade off between clock speed and the number of and usage of registers. PDP11 and 68000's have a lot of flat generic registers. Where 0x86 had a limited number of specialized registers. 0x86 could probably clock faster at the expense of higher register pressure.

L2 cache only got packaged with the chip in the Pentium 2, and even then, wasn't on chip, hence the weird packaging.

They were packaged closely together in the Pentium Pro (two dies in the same processor package). The Pentium II was a cheaper downgrade that preserved the separate bus but just put the cache next to the CPU in the weird packaging.

> I recall a demo app from my teen years (late 1908s to early 90s)

You were a teen for at least 82 years. Talk about a protracted adolescence!

Ha! Keyboard lysdexia strikes again.

My teenage years were bad enough. I don't think I could have handled 82 years of them.

Reminds of when I was doing firmware development and the ASIC team would ask if they could toss in a extra Cortex-M3 core to solve specific control problems. Those cores would be used as programmable state machines. For the ASIC team tossing in a extra core was free compared to custom logic design. However for the firmware team it would be another job to write and test that firmware. We had designs with upwards of 10 Cortex-M3 cores. I've heard from a friend at another employer had something like 32 such cores and it was a pain to debug.

Why have multicore chips if it saves no energy?

In addition to TFA's reason, bluntly, because we have the transistors. Dennard scaling has ended which has meant that we can't continue to increase clock frequencies. However, transistor counts have continued to increase. This has basically forced CPU manufacturers to focus on multicore because we have the transistors.

Also, big/little, gating off unused silicon and other approaches can save energy even as they use more transistors.

There is also a heat aspect to this. You can run many cores at lower frequency to save on heat. This makes perf more consistent as you can avoid thermal throttle vs a single bursty core.

Yeah, to do higher speeds(switching cost, F) you need higher voltages (V) so your power/heat goes up by non-linear variant of F*V.

Depending on what your savings are for idle power it can either make sense to run long and slow or try and burst to a deep sleep state.

Imho the article misses a key point which is data locality.

If you look at specialized processors for say, convolutions, the majority of the benefits are coming from data locality being exploited.

(And no I've never heard the term "dispatch" be used for data movement)

The "extract bits 3 to 7 and multiply by 13" example is about data locality, to some extent. It's cheaper to keep data in a local circuit, than to ship it around between general-purpose registers.

"Done in hardware" means "done directly in hardware"; the directly part is understood, because everyone knows that everything is ultimately done in hardware.

Something not done directly in hardware is done in software. That means it's done using more hardware resources compared to directly in hardware.

QED; directly in hardware is cheaper.

Cheaper to operate, anyway, not necessarily cheaper to produce. You have to move a decent volume before it becomes economic to optimize a solution into hardware. Also, a mistake discovered in the field in hardware is more costly than a mistake in upgradable software.

I immediately thought of Nvidia RTX when I saw the title. "Faster" does not always mean "fast enough".

I don't agree with the tone of the article. Doing complex functions in hardware is a lot cheaper, often by 100X, compared to doing them in software.

As an extreme case, to do a simple 32-bit add, you light up tens of millions of transistors if the addition goes through a CPU pipeline. The adder itself of course only requires a few transistors...

Saying that "specialization saves dispatching costs" is minimizing the savings by orders of magnitude. Of course, the article is correct in pointing out that hardware doesn't make things free.

[source: my day job]

TFA author. How does the word "dispatching" hint at the order of magnitude of anything? What orders of magnitude do depend on is your competition. Outdoing a CPU at operation X is easier than outdoing a GPU at X [assuming the GPU does X reasonably well] which is easier than outdoing a DSP at X [assuming the DSP does X reasonably well]. If your competition is reasonably optimized programmable accelerators, your opportunities to beat it start shrinking.

Source: my day job

For me, dispatching refers to a very specific piece of the CPU micro-architecture, by far not the most complex or largest area. So by focusing on "dispatching", you make it sound like the overhead is small. Maybe you meant to use another word.

My point stands that from a silicon area perspective, 99% of the CPU is overhead when all you need is a fixed function.

Based on your background, I know you know that. A lot of details in the article show that you know what you are talking about and have specific use cases in mind. I can guess those and for those the article is correct.

Its done in hardware because its faster, and tech support can just switch it on and off in case of panic, and flicking the switch is a moment of anxiety relief, for those with a tendancy to flick a switch in times of absolute pannick.

Its a good article but the author missed the third reason hardware can be vastly cheaper, which is lack of abstraction.

You can use google search to add 0x9 + 0x2 and get hexadecimal 0xb... however that involves dozens of layers of abstraction and endless formatting and parsing that are fundamentally useless in the long run for something like a GPU display.

The 4th reason hardware is vastly cheaper is it needs less testing.

In the example above you can either trust your FPGA/ASIC software to implement a byte-wide full adder properly, because thats kinda a basic task for that technology, or you can whack the byte-wide adder with all possible test cases in a couple ns on real hardware, all possible binary inputs and outputs are quite well known and trivial. When you ask google, or worse, alexa, to add two hexadecimal digits there are uncountable number of theoretical buffer overflows and MITM attacks possible spyware/virus infections and similar such nonsense at multiple layers you probably are not even aware of.

The 5th reason hardware is vastly cheaper is environmentalism and energy costs. I have trouble estimating the energy cost of a byte-wide adder in a ASIC or a CPU, surely it can't be more than charging and discharging a couple sub-pF capacitors... takes billions of transistors switching like crazy to dump the 100 watts a server motherboard can dump and a full adder doesn't take many transistors. On the other hand the infrastructure and environmental damage required to ask Alexa to add two hex digits is very high. You can piggy back on it by passing the buck; well, we need that environmental damage and economic cost to enable netflix at which point asking alexa questions is a drop in the bucket, but people have polluted for centuries on the same argument (well, its just a little extra lost plutonium and compared to above ground nuclear testing its a drop in the bucket, etc)

> The 4th reason hardware is vastly cheaper is it needs less testing.

This goes against all experience I had with hardware, and everything I have ever heard from every single embedded/electronics engineer.

It may be because, outside of a few domains, we just accept software failing all the time, and hence do little testing of it.

Testing every possible failure mode of a simple full byte wide adder is in fact pretty non-trivial and hard work.

Troubleshooting everything that could theoretically go wrong when asking Alexa to add two numbers is simply impossible or incredibly expensive.

Space shuttle computers for autolanding are possible although testing suites are expensive, as you state. Implementing that process to the same level of reliability using a vast distributed software technology like Alexa would be essentially an infinite cost.

Its a given level of reliability problem; can't compare the apples to oranges of extremely unreliable software or even worse, networked, solutions to something as relatively cheaply reliable as hardware.

I think the implication was that there is less testing needed from the software side.

That said, I suspect it is also a bit misleading. If you are relying on functionality of the hardware for esoteric things, you test them heavily.

It moves the testing around: you don't have to test it yourself, because a huge amount of effort went into that before starting to cut IC masks and each unit is individually tested off the production line.

Gotta test the shit out of hardware features. True.

At the application level, less is needed. Spec will be met.

The Alexa bit really shows that processing speed or "cheapness" doesn't always matter. The VM in AWS that eventually does that hex addition could have done 10^N more of those same additions in the time it takes Alexa to hear the question and respond.

But, humans are big and laggy, and I don't know if I could type in the question to Google or even a terminal faster than getting the answer from Alexa.

I don't think it misses abstraction. It is covered under specialization costs.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact