The post-exponential era of AI and Moore’s Law

m3at · on Nov 11, 2019

The "Bitter Lesson" post from Rich Sutton from earlier this year [1] seems a very good complement to this article: he explain how all of the big improvements in the field came from new methods that leveraged the much larger compute available from Moore's law, instead of progressive buildup over existing methods.

A great quote from McCarthy also regularly referenced by Sutton is "Intelligence is the computational part of the ability to achieve goals", which (IMO) help picture the tight link between compute growth and AI.

It's only a few minutes read, I highly recommend it:

[1] http://www.incompleteideas.net/IncIdeas/BitterLesson.html

_bxg1 · on Nov 11, 2019

It could be that the slowing of Moore's Law results in a more diverse array of specialized computing hardware. After all, the resurgence of ML didn't happen because x86 got fast enough, but because video games - of all things - funded the maturation of a whole new category of massively-parallel chips.

Veedrac · on Nov 11, 2019

Moore's law never slowed. At worst, one can rightly say Intel had issues with 10nm.

https://docs.google.com/spreadsheets/d/1NNOqbJfcISFyMd0EsSrh...

codesushi42 · on Nov 11, 2019

the resurgence of ML ... because video games - of all things - funded the maturation of a whole new category of massively-parallel chips

Fake news. It has far more to do with the rise of distributed computing than the existence of GPUs.

adwn · on Nov 11, 2019

Don't call a statement you disagree with "fake news". It a) doesn't even make sense in this case, and b) distracts from the discussion.

jchw · on Nov 11, 2019

I don’t really see this; All of the currently popular machine learning frameworks either support GPU or are oriented around GPU based execution. For most developers working on AI it is still the defacto standard.

sgt101 · on Nov 11, 2019

I think.. both. Distributed computing (Hadoop) was why a lot of data got collected and made available(ish).

GPU's are the engines that made CNN's (in particular) tractable, and opened up a bunch of applications for many companies, and opened up a reasonable route to results for a generation of researchers.

unkulunkulu · on Nov 11, 2019

Can anybody elaborate on why this is downvoted? This would be my guess as well, simd parallelism of GPUs solves only part of the challenges, you still need a general purpose data crunching machine to prepare and handle learning data.

fspeech · on Nov 11, 2019

For one GPU speedup over CPU isn’t that dramatic for small to medium sized problems, e.g. MNIST or CIFAR that one would try algorithm ideas on. So I think it’s a stretch to see GPU as essential to the new algorithms. On the other hand for large problems like the original Alpha Go you need to figure out the distributed computing to really scale.

This isn’t to say that GPUs aren’t nice. They do save time or for the same amount of time let you produce more polished results, which means in a competitive environment everyone would use them.

codesushi42 · on Nov 12, 2019

Exactly. GPUs are necessary now, but did not originally herald the deep learning revolution.

codesushi42 · on Nov 11, 2019

Except that's "currently". NNs can and were trained on CPUs originally.

solidasparagus · on Nov 11, 2019

I don't think this is very true. You can trace the deep learning revolution back to VGG and a fundamental driver in the success of the first multi-level networks was the ability to train in semi-reasonable amounts of time using GPUs.

Even today distributed training is relatively uncommon while pretty much everyone uses NVIDIA GPUs.

codesushi42 · on Nov 11, 2019

VGG was a long time after the first deep learning revolution.

solidasparagus · on Nov 13, 2019

You could say it started with ImageNet 2012 where AlexNet showed that deep networks were a newly promising area of study - however the actual performance of AlexNet was very very far from human performance. I tend to say the revolution started at ImageNet 2014 with VGG/GoogLeNet, the first human-caliber performances. Or you could say it was ImageNet 2015, when the first ResNet had better-than-human performance.

YeGoblynQueenne · on Nov 11, 2019

Great read. To follow up also read a comment on that piece titled "A better lesson" by Rodney Brooks:

https://rodneybrooks.com/a-better-lesson/

djokkataja · on Nov 11, 2019

> One of the most celebrated successes of Deep Learning is image labeling, using CNNs, Convolutional Neural Networks, but the very essence of CNNs is that the front end of the network is designed by humans to manage translational invariance, the idea that objects can appear anywhere in the frame. To have a Deep Learning network also have to learn that seems pedantic to the extreme, and will drive up the computational costs of the learning by many orders of magnitude.

Thanks for the link, this is a pretty strong rebuttal.

That said...

> Massive data sets are not at all what humans need to learn things so something is missing. Today’s data sets can have billions of examples, where a human may only require a handful to learn the same thing.

I've come across this idea before, the notion that we should be able to create an AI that can learn something from just a handful of examples because humans can do that. This assumes some rough level of equality in lack of related knowledge / capability between a tabula rasa AI system and a human adult. But a human adult has been awake and continuously learning & testing conscious and unconscious predictions about everything they interact with for at least 18 years (1 year = 31536000 seconds * 2/3 (accounting for sleep) = 21024000 seconds awake per year, * 18 years = 378 432 000 seconds). In other words, a human adult is already a highly trained system. And even very young human children have been awake and learning for quite a long time compared to anything we might imagine to be a tabula rasa system.

So I don't think that the human capacity for learning supports the idea that a tabula rasa AI system should be able to learn any particular thing with just a handful of examples.

YeGoblynQueenne · on Nov 11, 2019

>> So I don't think that the human capacity for learning supports the idea that a tabula rasa AI system should be able to learn any particular thing with just a handful of examples.

I don't know if Brooks is talking about learning everything from scratch here. I for one don't think that humans come into the world as tabulae rasae. I think instead that we come endowed with a rich knowledge of the world already, as well as strong inductive biases that allow us to learn new concepts from a combination of existing knowledge and new observations, with our characteristically high sample efficiency that remains unmatched by neural nets.

For instance, one theory about our ability to learn human language as infants is that we are born with a language endowment, Chomsky's "universal grammar", that predisposes us to pick out and learn human language, without having to figure it out from scratch.

I wonder then if what Brooks is saying is missing from neural nets is the ability to use background knowledge and strong inductive biases as effectively and efficiently as we do, so that they don't need hundreds of thousands of examples before they can learn a new concept.

Terr_ · on Nov 11, 2019

I often imagine young babies are staring around not in "childlike wonder" as much as an LSD drug trip, with their little minds laboring to discover any kind of correlation in an assault of data.

KKKKkkkk1 · on Nov 11, 2019

Sutton argues that algorithm development needs to be shaped by the assumption that compute power will continue growing exponentially into the future. At this point, it is commonly believed that this is not the case.

One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are search and learning.

Veedrac · on Nov 11, 2019

I'm fairly sure the consensus from the hardware side, especially considering how companies like NVIDIA, Intel and Arm are reacting, is that we still have a lot of exponential growth to go, especially for AI. A factor 1000 increase in model sizes seems within reason within 10ish years; I go over some arguments in the link below.

https://www.reddit.com/r/MachineLearning/comments/ds1xvc/d_d...

ChuckMcM · on Nov 11, 2019

Thank you for that, it was excellent.

buboard · on Nov 11, 2019

thats great arguments except vision. Knowledge about the physiology of the visual system DID go into designing convolutional networks. That probably doesn't extend to other domains though, and things prove once again that our intuitions may be misleading.

perlpimp · on Nov 11, 2019

what a great article. would give one a way to have introspection of how and why these methods work and going forward learning machine learning/AI.

ChuckMcM · on Nov 11, 2019

The article discusses one of the impacts of compute performance growth rates slowing while compute demand (training AI models) is growing exponentially.

The 'end of Moore's law' (as a measure of performance, not density) is probably the most significant "Tech" story of the next decade. Why? Because it is going to demand engineers who can write fast code over engineers who write code fast.

A lot of frameworks and abstractions will be ripped out and thrown away as 'taking too much time for not enough benefit.'

caseymarquis · on Nov 11, 2019

In my experience, poor performance is only very occasionally the fault of abstractions and frameworks. The problem is near universally between the chair and the keyboard.

I don't think we'll be throwing away the old abstractions, but adding new ones in beside them.

As an example, I made a demo a couple years back which processed and rendered about 64 MB of CNC G Code (a programming language for telling robots how to move and spin) in a web browser to match CNC toolpaths (the path of the robots motion) to their realtime run statistics (show me a realtime 3d map of where we can speed up the robot). To do this, you need to simulate the movements of the machine to build a 3d model, and then pair the model to collected data. You then need to cull/average data which is too fine to be visible, and ship this processed data to the browser as arrays of floating point numbers, where it gets rendered as -effectively- thousands of colored 3d arcs.

The naive implementation took minutes to run. The final version took about 1 second. I didn't have to throw away any abstractions or frameworks to make things faster. I had to use my language's new abstractions for massively parallel processing and stack allocated data to speed up server side processing, and then use libraries for decoding base64 binary into floating point arrays on the browser side to speed up data transfer and parsing.

I guess my point is that the end of Moore's law doesn't mean throwing old things away, it means using existing solutions when appropriate, and approaching new problems in different ways.

coldtea · on Nov 11, 2019

>Why? Because it is going to demand engineers who can write fast code over engineers who write code fast.

Or just making slower things more orthogonally parallelized and throwing more hardware units at the problem (as opposed to more powerful hardware or faster code).

Most code by engineers "who write code fast" is not in performance critical domains...

jerf · on Nov 11, 2019

A lot of the ways programmers write slow code ends up with multiplicative factors you can't practically recover with simply throwing more hardware at it, because it's really easy for us to write something 100x or 1000x slower (even in the same big-O class) by using a slow language with poor cache locality, that makes it easy to do more work than was necessary, by poorly using threading resources, etc. There's a lot of multiplicative factors that can stack together fairly quickly.

On the plus side, it turns out recovering most of that isn't that hard with a bit of careful language selection and just a bit of care in writing code. I think this is one of the somewhat subtle reasons for Go's success is that it trims away most of the slow things that a lot of "scripting" languages do without throwing away too much of the power, and I anticipate the continued entry and success of other languages like Nim into this space on the language chart in the next few years. To get to screaming fast may take a lot of effort, but "pretty fast, let's parallelize what we have instead of hyperoptimizing it" isn't too hard.

Because of all of these multiplicative slowdowns, even at very small scales it's still generally a better idea to start by optimizing and removing some of your multiplicative slowdowns before you just throw more hardware at it. Compute is merely cheap, not free. It's the rare bit of code that nobody's spent any time optimizing, but is also free of these multiplicative slowdown problems.

brudgers · on Nov 11, 2019

I agree mostly. But how long can engineers throw more and more and exponentially more hardware at the phone that fits in my pocket?

coldtea · on Nov 11, 2019

>But how long can engineers throw more and more and exponentially more hardware at the phone that fits in my pocket?

Why would they need to? They can sell improved cameras and other additions instead of improved CPUs...

brudgers · on Nov 11, 2019

I agree there's no need. What phone camera improvements do you envision that don't involve more transistors on the CMOS sensor, GPU, CPU etc? Microfilm?

oblio · on Nov 11, 2019

If my experience with the IT world has taught me anything, you'd be surprised for how long :-)

Shorel · on Nov 11, 2019

"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.

Yet we should not pass up our opportunities in that critical 3%."

The end of Moore's law is the end of passing up opportunities in that critical 3%. There's a lot of performance debt accumulated.

It's only 3%. Optimize when it matters!

pmoriarty · on Nov 11, 2019

It's not just training AI models that's demanding exponentially increasing compute power, but also streaming, storing, and processing the enormous amounts of scientific data coming from (for example) high-end physics experiments at world-class laboratories like CERN or the LHC, and there's also the tremendous amounts of surveillance data and data processing that's sprung up in recent decades, with an insatiable appetite of various agencies and companies to track, record, and analyze everyone and everything.

ben_w · on Nov 11, 2019

True, but from what people were telling me before I left university, scientific processing always produced more data than could be analysed, even after accounting for the effort put into efficient algorithms.

fulafel · on Nov 12, 2019

Is there basis for thinking that the relative demand of fast-code programmers will increase vs code-fast programmers?

There doesn't seem to be anything in the horizon that would make performance a more pressing concern in the work of the 99% of developers who do make business software and consumer web/mobile apps. They just sometimes call into libraries that have had performance work put into them and get by with rudimentary understanding of what seems to cause slowdowns.

In the other dimension, there does seem to be a continued focus and resulting improvements in velocity.

The recent ML popularity is all about leveraging packaged optimized code (Tensorflow etc) with vast majority of programmers freed from the engine level details and detailed understanding of GPU perf work or parallel/distributed programming.

carlmr · on Nov 11, 2019

>Because it is going to demand engineers who can write fast code over engineers who write code fast.

I surely hope so, the waste mentality has to end.

brazzy · on Nov 11, 2019

Premature optimization will never become the right thing to do. Most code has absolutely no reason to be particularly fast because it doesn't get executed very frequently.

carlmr · on Nov 11, 2019

Nobody said that. Just that you should think about which algorithms you use, profile, and not waste so much. The wasted power at this point is significant.

brazzy · on Nov 11, 2019

That is sensible, but I've also heard people basically call any code that is not micro-optimized wasteful and harp on about how back in the day it was second nature for programmers to care about bytes and cycles in every line of code they wrote, and modern day programmers are just lazy and incompetent.

carlmr · on Nov 12, 2019

Yeah, I get where you're coming from. There's as always a sweet spot between the two extremes. But needing a new computer for running Word or Excel smoothly just seems like there's a lot of low hanging fruit left.

WizardAustralis · on Nov 11, 2019

We will probably start to see this a lot in the video game consoles field. On the hardware front, looking at the early specs of the PS5 for instance show them emphasis faster storage access speed more so than added CPU/GPU/RAM.

Consoles were always more likely to see speed optimizations, but this had slid aside a bit in the last two decades.

foota · on Nov 11, 2019

I'm not sure that's a given, who's to say we won't be happy with the current performance we have?

skohan · on Nov 11, 2019

There are still problems, like protein folding, or training larger and larger deep learning models, which have nowhere near the performance you'd actually want in a perfect world. There's always going to be a bigger target in some of these domains, and it would be a shame to imagine we'll reach a ceiling in some areas of scientific inquiry because of hardware limitations.

foota · on Nov 12, 2019

This is true, but imo most the gains outside of traditional CPU improvements will be either in specialized hardware or algorithms, I don't see much room for improvement in like optimizing code there.

cft · on Nov 11, 2019

Rust will fare well. Ruby won't.

csomar · on Nov 11, 2019

That has a lot to do with why I took on learning Rust. My last year Macbook Pro has a 6-core CPU. At this rate, we should have 32-64-core CPUs for consumers in a few years. Not using all the processors means your program only makes use of 3-1.5% of the available power.

My prediction is that, in a few years, lots of programming languages and software developers will simply become obsolete because what the other guys can produce is heaps of performance, security and manageability atop.

But I might be wrong, so make your career decisions carefully.

oblio · on Nov 11, 2019

I hope you're right.

However Intel Core Duo was released in January 2006 and it had 2 cores. So in a bit more than 13 years we only have 3x as many cores. Core 2 Quad was released in January 2007, so if you count from that one, a bit more than 12 years for 1.5x as many cores.

I wouldn't bet on widespread 64 core CPUs for consumers in the next 10 years or so.

Most commercial software just doesn't use it efficiently and consumers don't need it for existing software.

We need a software breakthrough, a killer app, something that people really want and that really needs a ton of cores.

For now, I don't see it. But then again, that's how killer apps are, nobody sees them until they're there.

csomar · on Nov 11, 2019

> However Intel Core Duo was released in January 2006 and it had 2 cores. So in a bit more than 13 years we only have 3x as many cores. Core 2 Quad was released in January 2007, so if you count from that one, a bit more than 12 years for 1.5x as many cores.

https://xkcd.com/605/

If you are going to extrapolate, don't pick the date. Take all the numbers you have and trace the average value. If you are picking the numbers, you can extrapolate to any point you want.

> I wouldn't bet on widespread 64 core CPUs for consumers in the next 10 years or so.

We already have 32-core processors for servers. Imo, the biggest challenge will be the laptop and heat management rather than putting a bunch of processors together.

> We need a software breakthrough, a killer app, something that people really want and that really needs a ton of cores.

We need that now. Most software I'm using is slow. Part of why I picked the iPhone (and I bet many people did) was speed. It is fast. People care about speed and if you introduce them to something faster then they'll get hooked. Also, you are already using all of your cores because of multi-tasking. It makes your computer experience overall better. At 6 cores, multi-threading for an application doesn't give much noticeable experience, since you probably have 5-6 things running on your computer doing other stuff like network or an audio-player.

> For now, I don't see it. But then again, that's how killer apps are, nobody sees them until they're there.

> I hope you're right.

Don't worry, just hang in there. The start of multi-cores was slow but then that's all there is right now. Seems intel/amd have mastered how to add more cores but the CPU frequency is pretty much capped going forward!

fouc · on Nov 11, 2019

Ruby isn't in the critical path, so that's a bad example.

In general, the real issue is the layers of abstractions between the hardware and the end user. The over complicated architectures. The constant re-invention of databases/operating systems/virtual machines at each layer.

clarry · on Nov 11, 2019

> The constant re-invention of databases/operating systems/virtual machines

I wish that were the case but operating systems and systems software in general seems to be the most stagnant field. Everyone just buys the same FLOSS stack for $0 and compatibility is king, so there's very little research going in this field and much less of it ends up in any product you're ever going to see in use.

I would love to see a vibrant scene of competing operating systems and databases with fresh new ideas and research.

fouc · on Nov 12, 2019

I meant that it is re-implemented at every layer.. Like a browser is practically an OS on its own.

oblio · on Nov 11, 2019

> The constant re-invention of databases/operating systems/virtual machines at each layer.

Those are because of the human struggle, the political and economic fights. They're not because of purely technical problems.

Each layer is usually backed by a party with a very clear interest, usually control, and each layer on top tries to escape that control. It's been that way since the times of p-code.

ivalm · on Nov 11, 2019

I don't have a hat in the game but why won't Ruby do well or not well based on this? It's not like basic webapps need to be much more efficient, and no one was writing high performance compute in ruby anyways.

DennisP · on Nov 11, 2019

Looking over the techempower benchmarks, the fastest web frameworks are over fifty times faster than ruby/rails. If you can replace a hundred web servers with two, that matters.

ghettoimp · on Nov 11, 2019

Sure. But an awful lot of folks don't have a hundred web servers. If you're running something simple, you can pretty much use whatever you want.

ativzzz · on Nov 11, 2019

The vast majority of web apps are CRUD apps, which can achieve sub 100ms responses on any language with caching, background workers and minor optimizations.

There will always be problems where you need faster responses, and the solutions to those problems are already not using Ruby. You don't use a hammer when a screwdriver is appropriate, but nails work pretty well in most places and screws are unnecessary.

DennisP · on Nov 11, 2019

100ms with how many responses per second? Throughput matters too. And techempower does some pretty realistic-looking benchmarks. For example:

> In this test, the framework's ORM is used to fetch all rows from a database table containing an unknown number of Unix fortune cookie messages (the table has 12 rows, but the code cannot have foreknowledge of the table's size). An additional fortune cookie message is inserted into the list at runtime and then the list is sorted by the message text. Finally, the list is delivered to the client using a server-side HTML template. The message text must be considered untrusted and properly escaped and the UTF-8 fortune messages must be rendered properly.

https://www.techempower.com/benchmarks/

ativzzz · on Nov 12, 2019

> 100ms with how many responses per second

As always, it depends. Benchmarks are cool, and yes we know that Rust and Go will always be faster than Ruby out of the box. Once the number of responses goes up, you throw caching at the problem. But the benchmark says

> The requirements were clarified to specify caching or memoization of the output from JSON serialization is not permitted.

> the table has 12 rows

Which is low and should be quick regardless, but regardless of your tech stack if you are creating a "modern" UI, you handle this update asynchronously with JS anyway, and if you're not, do you really care about sub 100 ms responses?

If you want performance out of the box, yes you will use Go, but like I said earlier, you use a hammer for nails and a screwdriver for screws. If you want performance and type safety, you choose one, if you want faster prototyping and a more mature web dev ecosystem, you choose another.

buzzkillington · on Nov 11, 2019

FPGAs will fare well, general processors won't.

You know we've reach the end of Moore's law when balanced ternary computers become the norm.

streetcat1 · on Nov 11, 2019

Correct. As long as your bottleneck is the CPU.

Symmetry · on Nov 11, 2019

For a long time we had a situation where transistors got smaller and cheaper and faster and more power efficient all at the same through the wonders of Denard scaling and we gestured broadly at the whole thing and called it all "Moore's Law" without needing to distinguish which exponential improvement the term referred to. But in the mid 2000s Denard scaling broke down and now it looks like transistors are still getting smaller and cheaper and more power efficient but they aren't getting exponentially faster any more. So we've mostly settled on the smaller bit as being the true "Moore's Law" and that's more or less kept going but might be running out of steam. It's still very nice from the perspective of very parallel tasks but as consumers we don't see as much benefit from it in our computers except in graphics. But sooner or later Moore's Law will run out too, transistors can only get so small as long as they're made of atoms.

Luckily the laws of physics have precise limits for just how efficient computation can be, Landauer's principle, that are just as binding as the laws for how efficient a heat engine can be. Progress in engine efficiency did indeed form a converging sigmoid on what's allowed by Carnot. But we're still quite a distance away from Landauer's limit and if transistors can't get us there we have every reason to believe that other computational substrates are possible and can.

In the meantime we might be looking at an interregnum. But while AI is gobbling up computational cycles because they're available there's no reason to think that efficiency gains aren't possible - just look at the orders of magnitude improvement in training resources from AlphaGo to AlphaZero. I don't see that this should stop progress, though it might slow it.

[1]https://en.wikipedia.org/wiki/Landauer%27s_principle

rrss · on Nov 11, 2019

Transistors haven't really been getting smaller for the last few nodes, they've mostly just been getting denser.

AFAIK nobody currently knows how make finfets much smaller without making them awful.

marcosdumay · on Nov 11, 2019

Transistors haven't been getting cheaper either. New processes have a higher cost per transistor for a while.

Or better, as old processes get improved, transistors are getting either more compactly packed or cheaper, but not both. The expensive ones are getting a little bit faster too, but not in proportion to the reduced die size.

sprash · on Nov 11, 2019

I predict there will be much more assembly programming required in the future to squeeze out as much as performance as possible because the end of Moore's Law is already very apparent for several applications that can not be easily parallelized. It is a complete myth that you "can't beat the C-compiler" as it is claimed so often. The compiler can't know many things you know about the problem at hand. So far I have been able to beat gcc on a regular basis while just relying on x86_64 (which means using no instructions/registers beyond SSE2.0).

Right now I'm looking for a C/Shader-like language for x86_64/Linux which allows you to be much closer to the metal without requiring you to go down to the cumbersome level of ASM syntax and at the same time shedding libc and use syscalls instead (E.g. if you have specific knowledge about the data structures you want to map you can use the heap much more efficient than plain old malloc). So far I found nothing.

Shorel · on Nov 11, 2019

"Right now I'm looking for a C/Shader-like language for x86_64/Linux"

You seem like the ideal candidate for writing such a language. And the book about it.

ilaksh · on Nov 11, 2019

General purpose AI is not waiting for more compute power. Its waiting for algorithms that can do it.

I'm not 100% sure deep learning can get all the way there. We might need a new approach. But the deep learning leaders are taking it very seriously and making some progress.

For example, Yann LeCun is talking about self-supervised, learning models of the world. MILA (Bengio's group) is talking about "state representation learning, or the ability to capture latent generative factors of an environment". Hinton now has capsule networks. In my opinion these types of approaches are very promising for general purpose AI.

jononor · on Nov 11, 2019

Yeah, many ML applications are bottlenecked on availability of labels, not by compute. Especially once outside very well-defined and established task. I think self-supervised has large potential, at least in expanding ML applications towards "more general".

skohan · on Nov 11, 2019

I predict that the slowing of Moore's law is going to make HPC and optimization in general a much more valuable skillset in the next few decades. In the past 15 years or so, we've been more or less happy to treat CPU cycles as a limitless resource, and as a result modern software stacks have a lot of fat in them. At the end of the era of free speed increases, trimming the fat is going to be a lot more important.

pjmlp · on Nov 11, 2019

You already see that ongoing with Java and .NET, refocusing on having AOT as part of the toolchain instead of some third party toolchain, more language features for lowlevel control and GPGPU access as well.

Or the pendulum going back to compiled languages after the scripting craziness of the last decade.

streetcat1 · on Nov 11, 2019

CPU cycles are limitless. Most of the time the CPU is waiting for the memory system.

dragontamer · on Nov 11, 2019

We're talking about 16-bit floats, or even INT8 or INT4, as the basis of neural nets these days, pushing the problem back to CPU-cycles.

CNNs in particular recycle the same set of weights over-and-over again, fitting inside of the tiny caches (or shared-memory in GPUs), allowing for the compute-portion of the hardware to really work the data.

> CPU cycles are limitless. Most of the time the CPU is waiting for the memory system.

CPUs go out-of-order, deeply pipelines, and speculative so that they have work to do even while waiting for the memory system.

The typical CPU has over 200+ instructions in flight in parallel these days. (200+ sized reorder buffers and "shadow registers" to support this hidden parallelism), and that's split between two threads for better efficiency ("Hyperthreading").

GPUs can have 8x warps / wavefronts per SM (NVidia) or CU (AMD) waiting for memory. If one warp/wavefront (a group of 32 or 64 threads) is waiting for memory, the GPU will switch to another "ready to run" warp/wavefront.

It takes some programmer effort to understand this process and write high-performance code. But its doable with some practice.

streetcat1 · on Nov 11, 2019

So I assume that we are talking about Moores law for cpu/gpu and not for memory.

The bottleneck for GPU today is the amount of memory on board and not compute. Especially with large size models.

In addition, the problem with deep learning, in general, is the seq nature of the alg. I.e it is parallel within the layer, but not between layers. And for multi gpu setup, again it is the communication link between the GPUs.

So I think that the nature of the current state of the art optimization alg are what matter.

dragontamer · on Nov 11, 2019

> The bottleneck for GPU today is the amount of memory on board and not compute. Especially with large size models.

Well... everything "depends on the model". Some models will be GPU-compute limited, others will be bandwidth-limited, and others will be capacity limited.

> In addition, the problem with deep learning, in general, is the seq nature of the alg. I.e it is parallel within the layer, but not between layers. And for multi gpu setup, again it is the communication link between the GPUs.

You should increase the size of the model to increase parallelism. Any problem has sequential-bits and parallel bits, so we can't stop the sequential nature of problems.

But the idea is to think NOT in terms of Ahmdal's law, but instead in terms of Gustafson's law. When you have a computer that's twice-as-parallel, you need to double the work done.

You can't reasonably expect things to get faster-and-faster (Ahmdal's law). Instead, you load up more-and-more work the wider-and-wider machines get.

In the case of neural nets, instead of doing 128x128x5 sized kernels, you upgrade to 256x256x5 sized kernels as GPUs get thicker.

Moore's law was never about the "speed of computers", but instead about "the number of transistors". As such, its Gustafson's law that best scales with Moore's law. We've got another 5 to 10 years left in Moore's law by the way: denser memory, denser GPU compute, 5nm-class chips (probably networked as a bunch of chiplets).

We will have more CPU / GPU power, as well as more RAM density, in the future. The question is how to organize our code so that we'll be ready 5 years from now for the next-gen of hardware.

streetcat1 · on Nov 11, 2019

Thank you for your insights. Good info.

solidasparagus · on Nov 11, 2019

> The bottleneck for GPU today is the amount of memory on board and not compute. Especially with large size models

This is somewhat true, but not at clear cut as you make it seem. Most deep learning training I've been part of is realistically bottlenecked by compute. Increased memory usage is a way to improve the compute efficiency (less time moving between GPU and CPU). I haven't worked hands-on with the recent big NLP models so I could be mistaken in that domain, but there is a lot of evidence that model parallelism and hybrid parallelism work (technically - financially is less clear) so the memory limits of a single GPU/ASIC are less important than previously.

One day I hope to see metaoptimization - for example particle swarm optimization - used to parallelize training so it is truly horizontally scalable, but right now the cost/hardware availability is a blocker.

zozbot234 · on Nov 11, 2019

True enough, but modern software stacks also have a lot of wasteful memory use.

streetcat1 · on Nov 11, 2019

Right. Hence Moore's law is irrelevant. I.e. the CPU is NOT the bottleneck.

For example, who cares if you have 64 core , 5Ghz cpu, if the still doing I/O from an HDD.

oblio · on Nov 11, 2019

It depends where this is going to happen. Libraries, OSes, RDBMSes, etc. will have to be optimized for sure. But actual apps? There's way more app writers than library writes.

And most line-of-business apps don't need to be faster, they need to be easier to use and more stable.

fyp · on Nov 11, 2019

Since AI is heavily parallelizable, it only matters that cost(as in dollars) will keep exponentially decreasing.

It doesn't matter if you can't double the transistor density of a single cpu if you can just double the number of machines. At the end of the day you still managed to double performance for the same price.

See https://en.wikipedia.org/wiki/FLOPS#Hardware_costs (note: in another thread someone noted that this wiki is outdated/inaccurate. If anyone have the relevant expertise they should help edit it)

4NDR10D · on Nov 11, 2019

Hardware Acceleration/Parallization is the next frontier. We've already seen the benefit of some pretty simple ASICs (TPU was built to be simple) as well as more general purpose accelerators. Hardware architects used to have a hard time, because often the best option was to simply wait for CPUs to get faster. Now that we've seen CPU power begin to stall it makes economic sense not only to invest in more parallel software but more appliciation specific accelerators.

CPUs/GPUs are beasts of hardware architecture, being complex mostly due to their flexibility. We can achieve higher performance with dedicated hardware (or FPGAs), and it looks like the economic reasons to do so are slowly becoming more certain.

7thaccount · on Nov 11, 2019

Some problems don't parallelize well (some mixed integer programming problems) and some matrix operations. We look at the end of Moore's law with horror.

jacobcammack · on Nov 11, 2019

I don’t agree fundamentally with this article. Correct details, but missing the Forest for the trees. There are too many elements at work evolutionarily speaking to ignore the emergence of God knows what. Plus our AI (not to mention our most basic computing axioms) are absolutely juvenile as we are brand new as a species to be developing anything at all that has to do with computing. That doubly goes for “AI”... whatever that means.

nynx · on Nov 11, 2019

The topic is quite controversial, but there is a path forward. There are several theoretical computing technologies that can get very close to the theoretical maximum allowed by physics (as well as being revisible), but we can't build them yet because nanofactories/molecular assemblers/whatever-you-call-them don't exist yet.

narrator · on Nov 11, 2019

It'd be really weird if we could keep expanding Moore's law and surpass the brain on a performance per watt ratio.

hyko · on Nov 11, 2019

shusson · on Nov 11, 2019

> The takeaway is that, even if we assume great efficiency breakthroughs and performance improvements to reduce the rate of doubling, AI progress seems to be increasingly compute-limited at a time when our collective growth in computing power is beginning to falter

A similar conclusion can be made for genomics too.

e_carra · on Nov 11, 2019

Modern hardware has lot to improve, memories and transmission lines have been the main bottlenecks for more than a decade now, bigger caches helped but are not enough. Solutions are being developed, like on-circuit optical fiber transmission lines, but it takes time.

vagab0nd · on Nov 11, 2019

> A couple of years ago I was talking to the CEO of an AI company who argued that AI progress was basically an S-curve, and we had already reached its top for sound processing, were nearing it for image and video, but were only halfway up the curve for text. No prize for guessing which one his company specialized in — but he seems to have been entirely correct.

This feels a little off. To me, it feels like image has made the most progress, then text, then sound, then video.

Anyone knows which company they were referring to?

CRUDite · on Nov 11, 2019

Moore's law may be dead, but it is still mind boggling to project if forwards.. Seth lloyd in his paper [1] on the limits of computation, mentions in 250 years computational density will equal that of a black hole (kilogram sized). [1] https://cds.cern.ch/record/396654/files/9908043.pdf

hyperpallium · on Nov 11, 2019

I am optimistic that Moore's Law will eventually recover, with a new technology, perhaps silicon-based, perhaps not. Information processing is not intrinsically limited by silicon - for example, mammalian brains are more powerful.

But that's lomg-term, big-picture. Technologies can remain stagnant longer than you can remain alive.

giacaglia · on Nov 11, 2019

It's funny that the article mentions that processors have not increase their speed, and therefore Moore's law is dead. In fact, as many people know Intel has been struggling with their 10nm chip, and now with a new CEO, this might change. All other processor manufacturers are catching up and some are moving ahead of Intel. That's the whole reason Apple is trying to move away from X86 with their MACs

_bxg1 · on Nov 11, 2019

Honestly, this is a bit of a relief. All of the AI nightmare scenarios (be they the Terminator kind, or the more realistic hyper-empowerment-of-a-few-elites kind) rely on that exponential growth continuing unimpeded. If there's no exponential growth, there's no runaway AI that swiftly outpaces human understanding.

npo9 · on Nov 11, 2019

Let me fuel your paranoia a little.

It’s likely that cloud computing will continue to see decreases in costs, which means the availability of computing will continue to raise after Mores Law has stopped giving.

Quantum computers might play a role in increasing computational power in the next decade.

Right now a lot of AI research is low hanging fruit recently made available with advances in CPU and GPU. After the advances dry up maybe more fundamental research will produce much better results with our current technology.

sgt101 · on Nov 11, 2019

>cloud computing will continue to see decreases in costs

cloud computing costs much more than on prem; it's available on demand, which is why it's attractive to some companies. For everyone else it's an outsourcing play that unlocks capex for revenue generation.

>Quantum computers might play a role in increasing computational power in the next decade.

Nope. No chance. Best case is that QC will provide improved molecular simulations in 10-20 years. But that will be a very specialised result.

>Right now a lot of AI research is low hanging fruit recently made available with advances in CPU and GPU. After the advances dry up maybe more fundamental research will produce much better results with our current technology.

100%, but the key word is maybe. One problem I anticipate is that the current skill set of grad students and the current management mind set in academia and commercial research is not well fitted to the kind of fundamental research that includes maybe and risk. I think that a more likely turn is that there will be a focus on how to better exploit the tech for data science and ML that we have. Currently we're pretty hit and miss and also the results are rather unpredictable and brittle. When did you last find an ML project on Github with unit tests?

emteycz · on Nov 11, 2019

> cloud computing costs much more than on prem;

Yeah, and that will continue to go down, exactly because of what you stated in the next sentence.

jayd16 · on Nov 11, 2019

Its not that computers in the cloud are cheap, its that the capacity will continue to grow and they're fully networked such that our AI overlords can take full advantage of them.

the8472 · on Nov 11, 2019

> Quantum computers might play a role in increasing computational power in the next decade.

It won't increase your FLOP/s in any way. It will mostly[0] help solving narrow problem classes, those outside P but in BQP[1]. To my understanding none of the problems encountered in NN training falls into that.

[0] ignoring polynomial speedups such as grover's algorithm [1] https://en.wikipedia.org/wiki/BQP

martinpw · on Nov 11, 2019

> It’s likely that cloud computing will continue to see decreases in costs, which means the availability of computing will continue to raise after Mores Law has stopped giving.

Why would cloud computing continue to see significant decreases in costs once Moore's law stops? Clearly there will be some ongoing decline in infrastructure costs, but if your main requirement is for lots of compute (for AI) then it seems hard to maintain an exponential rate of improvement overall if the silicon is no longer getting faster and cheaper, and the exponential is what you need, else your costs get out of control quickly.

npo9 · on Nov 11, 2019

I was speaking about linear decreases. Cheaper power, more refined manufacturing processes, and economy of scale benefits are where I expect some of the linear benefits to come from. We can also expect CPU manufactures to compete for awhile, getting increasingly marginal improvements.

Linear improvements will get us there in the end.

dogma1138 · on Nov 11, 2019

Why would power will become cheaper? Also manufacturing refinement? Sure it happens but not to any meaningful degree these days it’s more about bribing a new process to the same cost/efficiency as your older manufacturing processes.

As far as economy of scales goes it’s also not a given.

Large cloud providers are much more adapt at reusing old hardware for their PAAS and weaker instances.

I’ve had a chance to talk to a few reps from Dell once and while cloud buy a lot of servers they also upgrade them much less often than large enterprises that run their own DC this is because Amazon can always continue to make money on 3-5 year old hardware by shifting it off to cheaper tiers or by running things like SQS or Lambdas on it in the background.

On the other hand if you have a 3-5 year old server and you need more horse power if you manage your own DC or rack space the hosting costs make it economical to keep underperforming hardware.

Not to mention other costs such as support are quite different, if an Amazon server dies they don’t care. Traditional self hosted environments will require extended support because they can’t have the same redundancy as Amazon.

If you delegate a 5 year old server to some back office applications while you use your new ones for client facing apps the old server is still a mission critical box even if it doesn’t directly run revenue generating services.

Basically Amazon buys to grow, on-prem orgs buy to both grow and upgrade.

If you have an organization the size of Amazon that is completely self hosted they will end up buying much more hardware over a decade than Amazon.

martinpw · on Nov 11, 2019

> Linear improvements will get us there in the end.

No, they won't. That's the point. To quote directly from the article:

we’re talking about exponential rates of growth here, linear expense adjustments won’t move the needle.

_bxg1 · on Nov 11, 2019

Thanks :P

xyproto · on Nov 11, 2019

Yet AI beat humans at every strategic game out there, from Chess and Go to StarCraft2. We don't have general AI, but in some fields, human understanding have already been outpaced.

tsimionescu · on Nov 11, 2019

Specialized fields like games aren't necessarily a problem - computers have been beating us at arithmetic for decades, and it isn't exactly world shattering.

Also, a minor nitpick: AI has not succeeded at beating the best players at SC2, though it did beat the vast majority of the ladder.

ben_w · on Nov 11, 2019

> Specialized fields like games aren't necessarily a problem - computers have been beating us at arithmetic for decades, and it isn't exactly world shattering.

I would argue that it is, because that’s the driver for industrial automation.

pmoriarty · on Nov 11, 2019

Those three games are very far from "every strategic game out there", and even they can be trivially made much more difficult for AI's by growing their board size, for example, making more types of pieces with different moves, or making other changes to their rules. There are hundreds of chess variants alone, for instance.

FartyMcFarter · on Nov 11, 2019

> growing their board size, for example, making more types of pieces with different moves, or making other changes to their rules

This would also make the games more difficult for humans, and it would erase much of the accumulated knowledge about strategy.

_bxg1 · on Nov 11, 2019

Each of those has fixed rules, and each took lots of effort from human researchers to hand-craft and hand-train the models. Not that it isn't impressive, but it's still fundamentally a different thing than general intelligence. I don't think it's just a matter of doing more of the same.

ivalm · on Nov 11, 2019

In complete information games like chess and go the AI does reign supreme. In SC2, the AI plays highly abusable strategies that would not win with proper prep vs top players. Same thing with Dota bots by OpenAI.

DennisP · on Nov 11, 2019

It's not just complete information games. AI recently overcame the best humans in six-player no-limit Texas hold'em.

https://www.newscientist.com/article/2209631-ai-beats-profes...

networkimprov · on Nov 11, 2019

The heck with all that. Show me the software that beats Roger, Rafa, and Nole at tennis!

pmoriarty · on Nov 11, 2019

I'm not sure if you'll have that long to wait, considering the kind of success Boston Robotics has been having.

I would be surprised if by the end of the century robots couldn't beat humans at every physical sport in existence.

It's likely that robots and robotically-enhanced humans will both be banned from competing directly with humans in sports, because it'll be no contest -- in the robots' favor.

p1esk · on Nov 11, 2019

Ping pong is almost solved: https://www.youtube.com/watch?v=kZzL2rDNSJk Give it a year or two and it will probably beat the world champion. Tennis is more challenging, but still, the surface is flat, no obstacles, so a platform with wheels carrying an arm with a racket could be fast and stable.

I'm not even sure what would be the hardest part to make a champion tennis robot.

AnimalMuppet · on Nov 11, 2019

The hardest part might be to create one that can store enough energy to last through a five-hour match.

pmoriarty · on Nov 11, 2019

Why couldn't it just change batteries as needed, or just be wired up to power for that matter?

I'd expect something like wrestling would be much more challenging than non-contact sports, assuming, of course, that the robot's strength and torque was limited to human levels.

Wrestling is actually a pretty interesting case, now that I think about it for a bit, as robots wouldn't need to be limited to human-like bodies. In principle, they could have, say, a starfish-like or octopus-like body, where its tentacles would allow it to grip its human opponent in ways that may be impossible to escape, even were the robot limited to human strength.

So I suspect that in sports like wrestling (and maybe all other sports), robots that are allowed to compete would likely be limited to human-like forms, as other forms could give them too great an advantage.

AnimalMuppet · on Nov 11, 2019

> Why couldn't it just change batteries as needed, or just be wired up to power for that matter?

Because the human it's playing can't. In a sport where endurance is a significant component, wiring one competitor up to power, while the other has to use only the fuel that he/she has stored, doesn't seem like an equitable competition.

p1esk · on Nov 11, 2019

I agree that it must be battery powered, but keep in mind that human players eat snacks during long matches.

pmoriarty · on Nov 11, 2019

Humans make the rules, and it's doubtless that some human's sense of fairness will determine what rules they'd compete under, rather than simply letting the robot compete using its full capability, which we can already see will be superior in many ways.

AnimalMuppet · on Nov 11, 2019

Well, a human with a finite energy capacity against a robot with an infinite energy capacity (because it has a feed) is clearly an unequal match on that basis.

Similarly, a human player against a robot that spans the entire width of the court, and that has 48 rackets, is clearly not an equal match on that basis.

In the same way, we don't race Formula 1 cars against NASCAR.

Andrew_nenakhov · on Nov 11, 2019

Are you suggesting to ban human players from drinking water during the match?

AnimalMuppet · on Nov 11, 2019

No, I wasn't. I don't think that a drink of water is the same as a battery swap.

Andrew_nenakhov · on Nov 12, 2019

Why not the same? If you impose the restriction on competitors to use only the fuel that he/she has stored, it would be very unfair to allow one player to break this rule and refill himself.

AnimalMuppet · on Nov 12, 2019

Water isn't an energy source, though. Letting the robot top off on, say, hydraulic fluid or coolant or whatever, might be the same. Even battery water.

Andrew_nenakhov · on Nov 12, 2019

But tennis players' drinks are definitely an energy source: https://www.quora.com/What-kind-of-sports-drink-do-professio...

AnimalMuppet · on Nov 12, 2019

True. On the other hand, five comments up-thread, you specifically said "water".

Andrew_nenakhov · on Nov 12, 2019

Yes, for brevity sake. Anyway, banning "fuels" but allowing other types of refills is a slippery slope. Can a machine have maintenance if it breaks? If no, can a human player be treated by a doctor if he has a muscle spasm?

orasis · on Nov 11, 2019

One factor: the computational cost of training any extremely expensive model is amortized across all uses of that model.

sdoken123 · on Nov 11, 2019

Could it be that with Quantum Computing computational power will continue to increase exponentially?

yarg · on Nov 11, 2019

For some problems, perhaps - but it could also be the case maintaining entanglement becomes exponentially harder with an increasing number of qubits.

npo9 · on Nov 11, 2019

Is Quantum Computing computational power increasing exponentially now?

buzzkillington · on Nov 11, 2019

Yes, we have gone from being able to factor the number 1 to being able to factor the number 2 in the last 20 years.

yarg · on Nov 11, 2019

Yes - and on two fronts. The number of qubits that can be entangled is currently going up at an exponential rate, and with each additional qubit the power of the machine (in a certain sense) doubles.

hyperpallium · on Nov 11, 2019

Why can't we just have bigger dies? Same density, more transistors. If yields are too low, connect smaller dies somehow.

clarry · on Nov 11, 2019

If you stop scaling node size down and start scaling die size up, the result will be higher prices and power consumption. If you don't mind, go buy a Threadripper or EPYC. Might consider getting multiple sockets while at it. Caveat: won't fit in your pocket.

I'm sure there'll be additional interconnect issues to worry about as the physical cluster of dies gets larger and larger.

boyadjian · on Nov 11, 2019

Maybe it is a good thing, so the singularity will be avoided. There is so much wrong usages of AI.

mikorym · on Nov 11, 2019

Mathematician here, I don't think the singularity as popularised through people like Ray Kurzweil would happen.

In particular, their metaphor of an "infinity" point seems to me like the incorrect application of mathematical ideas to social contexts.

I think the question of artificial general intelligence is a different question than a singularity, much like how the Chinese room addresses a different question than the Turing test.

emteycz · on Nov 11, 2019

Why is it a different question? AFAIK the assumption is that if the computer can improve itself (which it can if it has IQ 100), there is nothing holding the singularity back.

TheOtherHobbes · on Nov 11, 2019

Which is dead wrong, because not only is there no usable definition of "improve itself", but there isn't even any understanding of the kinds of skills required to create a usable definition.

It's the difference between a computer that is taught how to compose okay-ish music, and a computer that learns spontaneously how to compose really really great music and do all of the social, cultural, and financial things required to create a career for itself as a notable composer and then does something entirely new and surprising given that starting point.

They're completely different problem classes, operating on completely different levels of sophistication and insight.

A lot of "real" AI problems are cultural, social, psychological, and semantic, and are going to need entirely new forms of meta-computation.

You're not going to get there with any current form of ML, no matter how fast it runs, because no current form of ML can represent the problems that need to be solved to operate in those domains - never mind spontaneously generate effective solutions for those problems.

emteycz · on Nov 11, 2019

> Which is dead wrong, because not only is there no usable definition of "improve itself", but there isn't even any understanding of the kinds of skills required to create a usable definition.

I disagree. A program improves itself when it reacts to a problem and implements a solution. Obviously that is very general, but enough. A human of IQ 100 certainly can develop software; a program of IQ 100 should be able to do the same, and then you scale horizontally.

thfuran · on Nov 11, 2019

Have you taken an IQ test? They only test a few classes of problem. Performance on these for a human is deemed a workable poxy for intelligence but, for something that approaches the problems very differently, it may not be at all indicative of general intelligence. I think we probably already have reached or are near the point where we could train systems to achieve human-level performance on each of those basic tasks. We are, however, not seemingly near a humanlike AGI.

TheOtherHobbes · on Nov 11, 2019

I don't mean to be rude, but I'm impressed that you seem to have completely skated over the content of my comment.

Please read it again. You need to understand what a "problem" and a "solution" both are - in detail - because otherwise you have nothing to work with.

And a human of IQ 100 will only ever develop poor software. If you scale horizontally, you won't get game-changers - you'll just get a flood of equally poor software more quickly.

ThrowawayR2 · on Nov 11, 2019

Many here have an IQ well over 100 and have no idea how to rebuild an improved version of themselves or even how their consciousness works internally. Seems rather likely any sapient AI we cook up will have just as little idea how itself works as we do how we work. QED, no singularity.

AnimalMuppet · on Nov 11, 2019

A human with IQ 100 can develop software. Can they develop it well enough to improve AI software? Or can they just adequately develop general software?

Can a human with IQ 100 write general AI software that a human with IQ 100 can debug?

mikorym · on Nov 12, 2019

There are different ways of explaining this (like conversely talking about what computers and computation are) but my personal preference is to explain it in terms of time.

We don't understand physics well enough to know whether time (and space [1]) can be arbitrarily halved over and over.

For an actual "infinity point" you would need such halving of time of computation up to a point where you actually have infinitely small time increments. The way you would define this would be via a limit and again the main premise is arbitrary halving, something that we have not seen in nature and which is a mathematical tool, rather than a hard physical fact.

Since you have physical computers, the composition of such a computer would have to lead to computations in these infinitesimal time brackets and in turn, for example, may necessitate infinitesimal matter. The way I understand the Planck length, at least as a concept, is that after some really small length of space then essentially you don't get anything smaller.

In any case, talking about a "singularity point" is an activity done by social personalities who like to sound clever. But they are not saying anything remotely clear.

As I mentioned, the other argument that one can make is to challenge the ability to indefinitely "improve yourself" as a machine, another dubious claim.

[1] https://en.wikipedia.org/wiki/Planck_length

Edit: I am not a physicist, so there could be better ways to think of Planck length and discuss its validity.

AnimalMuppet · on Nov 11, 2019

See my comment below about whether IQ 100 is enough. But also, there still can be plenty holding the singularity back.

Remember Zeno's paradox. "Improving" could mean "incrementally approaching a limit". It doesn't necessarily mean "marching off to infinity".

Then, there's the matter of hardware. An infinitely smart AI running on finite hardware seems likely to be a contradiction. Is the AI going to be able to not just improve its code, but improve its hardware? Without depending on slow, ignorant humans, who might even be having second thoughts about the wisdom of giving this AI even more hardware? That takes considerably more than IQ 100; that takes robots carrying out the AI's wishes, which some factory needs to make. And an IQ of 100 probably isn't going to cut it to hack all the factory control systems, to take them over and have them start producing what the AI wants.

jeremydeanlakey · on Nov 11, 2019

I wonder does the compute power per dollar have to level off so soon?

jobseeker990 · on Nov 11, 2019

We could make AI more efficient by switching to analog. Otherwise we're encoding the real world into digital and then back into analog again.