That churn also means optimization from hardware --> program execution doesn't happen. Instead we plow through layers upon layers of both conceptual abstraction layers and actual software execution barriers.
Also, why the hell are standardized libraries more ... standardized? I get lots of languages are different in mechanics and syntax... But a standardized library set could be optimized behind the interface repetitively, be optimized at the hw/software level, etc.
Maybe that's what he means with "DSLs", but to me "DSLs" are an order of magnitude more complex in infrastructure and coordination if we are talking about dedicated hardware for dedicated processing tasks while still having general task ability. DSLs just seem to constrain too much freedom.
LLVM/Clang makes it much easier to bootstrap support for a new architecture, and to add the necessary architecture-specific intrinsics for your new architecture, but it doesn't really make it possible to make architecture-agnostic code work well on weirder architectures.
The problem with developing a good processor architecture is you have to always maintain legacy compatibility without sacrificing performance - because you know software.
This adds layers of extra HW with each passing generation of the processor lying around for some legacy code.
When I buy a machine, I am now perfectly happy buying an old CPU, and I think this shows you why. You can buy something from as far back as 2012, and you're okay.
However, I do look for fast memory. SSDs at least, and I wish he had added a slide about the drop in memory speed. Am I at inflection?
Perhaps the future is: you buy an old laptop with specs like today and then you buy one additional piece of hardware (TPU, ASIC, Graphics for gaming etc).
In niches, of course, CPU speed matters. I want to compile something in half of the time. I want to train my AI model faster. But what really bugs me is when I have to wait for a refresh on some site, this really makes me loose focus (and then I come to HN to see some news and time goes by).
Only because most software is conceptually stuck in the 50s and Intel fell asleep at the wheel. Having extra cores can allow you to move away from cludgy and insanely complicated solutions to much simpler code. In turn, this enables faster development and experimentation.
I personally am working on a toy actor library to test if easy but still reasonably efficient parallelization is possible even with heavy communication needs. I see a 2.3x slowdown when I run it on a single core vs the baseline that uses locks but it reaches parity at higher levels of parallelism and congestion when you have 10 million actors each 32 bytes in size running on 4 threads.
As soon as you need to acquire two locks that are only known at runtime it starts to become increasingly difficult to find a solution. It will usually be highly specific to the problem at hand and not generalize at all when the problem changes slightly.
If you limit yourself to only acquiring one lock at a time then generalizing becomes simpler but you now have to implement something very similar to transaction rollback. It isn't the end of the world and it is definitively possible but compared to not worrying about it at all in single threaded code it's a massive headache.
>As soon as you need to acquire two locks that are only known at runtime
Why would you need to acquire two locks that are only known at runtime when using the actor model?
That said, moore's law is irrelevant. I don't care about transistor density, or even CPU count. I care about processing power. Give me a 4-slot motherboard to mount for AMD threadrippers ... I would consider that as an option. What matters to me is price.
I understand your poing, but when I think of "average user", I mean users who wouldn't dare to open a case, and wouldn't even know where is the CPU and why there are so many coolers inside.
This is nonsense pushed forward by large corporations who want to own all your data and computational capacity.
Point one: Indeed, untold millions of dollars can be saved by improving data center efficiency. But if the processors are in desktops instead, where the price pain isn't actually enough to drive improvements in efficiency, then we as a society just pay a higher power bill... Furthermore, data center power tends to be greener than general purpose city grids.
Point two: they could have as easily said GPUs to avoid the fear of data centre only proprietary lock in... The point is the same: we seem to have hit the point where specialized linear algebra coprocessors make a huge difference.
You have something reversed here.
Hardware companies that make consumer devices are pumping barrels of money in improving CPU power usage to enable longer battery life on mobile devices.
Google, on the other hand, promotes insanely power-hungry neural nets as the future of computing, because that's the kind of future that gives them the deciding edge in the market.
His chart shows TPU with ~80x the performance/watt of a CPU, indicating the potential advantages of domain-specific architectures over general purpose for specific applications.
None of this should be surprising to anyone who uses a GPU.
What limits cpu speed? =>
More efficiency means more overclocking. Also, better cooling/dissipation methods and less gate delay.
We may be hitting the limits of wire thinness, but I get a strong feeling (ie not an expert) we've got a decent ways to go before we hit the limits of clock speed.
Circuits with respect to frequency, past a given point capacitance starts to look like an inductor at high frequencies. The charge and discharge sloshing happens fast enough that power really gets wasted to heat or other leaks (but mostly heat for the things we build).
My personal gut-feeling is that faster Hz probably isn't happening without some kind of radically different / exotic design or process. Further that is the case it's unlikely to scale down. That kind of system might be useful for AIs, large governments/corporations, and maybe game consoles that need a lot of power and are wall-powered so they don't care.
SSDs are by far the most cost-effective upgrade you can get nowadays. HDDs tend to be the bottleneck for boot times and general "snappiness" nowadays.
IIRC the sandy bridge generation could only do PCIe 2.0.
NVMe is the only thing that will cause a noticeable improvement for me though, seeing as I still game on a 1080p60 monitor and generally don't need that sort of speed from any USB peripheral.
Still, the processor itself kicks ass and I think the only reason why most people would need to upgrade are for newer peripherals.
or something to that effect. Not sure where I read it tho. Maybe on here
If Google has its way, the future will be: you buy a Chromebook and do all the real work on Google Cloud. Note that John Hennessy is a chair of Alphabet. And in case someone here forgot, TPUs were developed by Google and so was Tensorflow.
And it doesn't scale:
And in many cases, if you normalize all the metrics, e.g. precision, process node, etc. You'll find that the advantage of ASICs is greatly exaggerated in most cases and is often within ~2-4X of the more general purpose processor. E.g. small GEMM cores in the Volta GPU actually beat the TPUv2 on a per chip basis. Anton 2, normalized for process, is within 5x ish of manycore MIMD processors in energy efficiency.
In other cases, e.g. the marquee example of bitcoin ASICs, that only works because of extremely low memory and memory bandwidth requirements.
During WWII, it was observed that each doubling of output reduced labour costs (or increased labour efficiency) by 20%.
Moore's law is dependent on the density of transitors (the count doubles for a given cost every two years). Increased density => increased computing power and efficiency, and speed.
Chip design is dependent on numerous factors: die size (e.g., 14 vs. 9nm photolithography dies), silicon purity, fab cleanliness (much as with cascade refrigeration, chip fabs now have multiple concentric zones of increased cleanliness), and the power and capacity of the software that's used in chip modelling itself.
The law is also not entirely exogenous as it relies on market forces and demand: need for increased computing power tends to proceed at a predictable rate, and the ability to make use of more capacity is also constrained by existing practices, software, programmer skill, etc.
Then there are the other non-CPU bottlenecks. Disk and memory have long been the foundations of that, increasingly it's networking. The tendency of old technology and layers not to die but to be buried in ever deeper levels of encapsulation means that efficiencies which might be gained aren't due to multiple transitions and translations -- the reason why a 1980 CPM or Apple II system had a faster response than today's digital wireless Bluetooth keyboards talking to a rendered graphical display. Bufferbloat, at the network stack, is another example.
But: the main driver for Moore's law is increased density leading to increased efficiency (the same centralising tendency present in virtually all networks), bound and limited by the ability to get power in and heat out (Amdahl's observation that all problems ultimately break down to plumbing).
The transistor can only get so small before it stops working.
There are many issues with required extreme ultraviolet light sources (lasers) and allowed amount of impurities in silicon waffer. And R&D cost for each iteration of lithography is getting higher while bringing less benefits.
 From slide 21
Function Energy in Pj
8-bit add 0.03
32-bit add 0.1
FP Multiply 16-bit 1.1
FP Multiply32-bit 3.7
Register file *6
L1 cache access 10
L2 cache access 20
L3 cache access 100
Off-chip DRAM access 1,300-2,600
Scaling certainly isn’t dead. There will still will be chips developed at 5nm and 3nm, primarily because you need to put more and different types of processors/accelerators and memories on a die. But this isn’t just about scaling of logic and memory for power, performance and area reasons, as defined by Moore’s Law. The big problem now is that some of the new AI/ML chips are larger than reticle size, which means you have to stitch multiple die together. Shrinking allows you to put all of this on a single die. These are basically massively parallel architectures on a chip. Scaling provides the means to make this happen, but by itself it is a small part of total the power/performance improvement. At 3nm, you’d be lucky to get 20% P/P improvements, and even that will require new materials like cobalt and a new transistor structure like gate-all-around FETs. A lot of these new chips are promising for orders of magnitude improvement—100 to 1,000X, and you can’t achieve that with scaling alone. That requires other chips, like HBM memory, with a high speed interconnect like an interposer or a bridge, as well as more efficient/sparser algorithms. So scaling is still important, but not for the same reasons it used to be.
The raw computational capabilities of the TPU don't really prove anything. Of course co-design wins. Whether it is vison or NLP -- NN training has dominant characteristics. The arithmetic is known: GeMM. The control is known: SGD. Tailoring control and memory-hierarchy to this is a no-brainer and of course the economic incentives at Google push them in this direction and of course the expertise available at Google powered this success. For other applications it is not so clear.
Finding similar dominance in other applications is trickier. To accelerate an application with a specialized architecture you need dominating characteristics in the apps memory-access, computational, and control profiles.
Another interesting aspect of moving to a more efficient substrate would be that power requirements for the devices will also lower as per Koomey's law https://en.wikipedia.org/wiki/Koomey%27s_law
Wires can't get smaller without compromising RC (and thus speed). Quite horrifically: this is way more an issue than the transistor.
Graphene and photonics don't help this. At all. It isn't a matter of how small a tube. You physically need 5nm to insulate, and 5nm for a functional material. So a 5nm device with a 5nm spacer and a 5nm space to the next device is about it. The smallest pitch of any physical device is 20nm. The critical pitches in wafer are about 30nm and 40nm, so in an ideal world, we can reach 3x, ever. It doesn't matter which material you choose.
And yeah, you can stack up, but not in quite the way you dream, and thermal and processing issues make this hard in most domains. When I build, I deposit at temperatures, which affect underlying layers. So stacking doesn't quite work as you might expect. Again, real materials Ina real flow are actually different, and not in a trivial 'just make it work' reducible fashion.
Memristors may not really exist, and are useful in the context of high speed memory. That has real physical challenges. And people.have spent billions for decades on this problem.
Anyway, this is missing some background, but the presentation is great.
You must have missed slide 41, which has a "beyond silicon" bullet.
I'm not saying that none of these other computing substrates won't work, just pointing out that the simple fact that we are exploring them does not mean that they will. Technological progress is neither guaranteed nor automatic.
40% wasted work, does that mean that they checked the branch-predictor and found that 40% of the time was spent on (wrongfully) speculated branches?
It also suggests that for all of the power-efficiency faults of branch predictors (aka: running power-consuming computations when it was "unnecessary"), the best you could do is maybe a 40% reduction in power consumption (no task seems to be 40% inefficient).
When someone says Intel i5 or i7, I immediately wonder if they're talking about 2008 i7 or 2019 model.
Intel would be smart to retire whole i3/i5/i7/i9 branding. People seem to think every i5 or i7 is the same.
Unfortunately, this is a feature, not a bug. Intel wants their branding to have this effect... the lay-person isn't supposed to understand Sandy Bridge (i7-2700k) vs Skylake (i7-6700k)
Before that era people didn't know much about the details either, but they did understand 800 MHz was faster than 533 MHz.
The limits they are running up against are indeed crisises, but they're probably going to be able to find that they can copy whatever it is that biology is doing and squeeze out quite a bit more. The tradeoffs will get a lot weirder though.
I can do far less than 0.0001 single precision floating point operations per second, so whatever the context for "1 exaflops" is, it isn't general purpose computation.
EDIT: this seems sort of like saying that throwing a brick through a window achieves many exaflops because simulating the physics in real time would require that performance. I'd like to read more about this value and how someone came up with it, but googling just gives me that same scienceabc article and stuff referencing it.
0.01 floating point operations per second seems harder, but perhaps humanly doable.
The basic form of computing is becoming distributed. More are coming.
Also can someone tell me what p4 is? Looks like almost every company and a bunch of universities are "contributors" there.
Basically P4 allows you to (re)program your network data plane to do whatever you want, and you can create new network protocols or change the way existing ones work without having to change your hardware and without losing line rate performance.
It's also somewhat like EBPF, but it compiles to hardware as well as software.
P4 is a domain-specific programming language for accelerated packet header processing in switches and NICs.
I think consumer facing performance processors will fade.
Data centers will continue to push for more performance. It could mean less rack space, less power consumption, and less to manage.
Cell phone/tablet focused processors will become powerful enough to handle the majority of daily tasks while enjoying extended battery life.
All Moore's law talks about is the density of transistors on a chip, and it's never been a linear progression of numbers. Recently I've seen news articles about some research into 5nm processes and other methods for increasing density of components on silicon, so it seems Moore's law (really Moore's rule of thumb or Moore's casual observation) isn't done yet.