A great quote from McCarthy also regularly referenced by Sutton is "Intelligence is the computational part of the ability to achieve goals", which (IMO) help picture the tight link between compute growth and AI.
It's only a few minutes read, I highly recommend it:
Fake news. It has far more to do with the rise of distributed computing than the existence of GPUs.
GPU's are the engines that made CNN's (in particular) tractable, and opened up a bunch of applications for many companies, and opened up a reasonable route to results for a generation of researchers.
This isn’t to say that GPUs aren’t nice. They do save time or for the same amount of time let you produce more polished results, which means in a competitive environment everyone would use them.
Even today distributed training is relatively uncommon while pretty much everyone uses NVIDIA GPUs.
Thanks for the link, this is a pretty strong rebuttal.
> Massive data sets are not at all what humans need to learn things so something is missing. Today’s data sets can have billions of examples, where a human may only require a handful to learn the same thing.
I've come across this idea before, the notion that we should be able to create an AI that can learn something from just a handful of examples because humans can do that. This assumes some rough level of equality in lack of related knowledge / capability between a tabula rasa AI system and a human adult. But a human adult has been awake and continuously learning & testing conscious and unconscious predictions about everything they interact with for at least 18 years (1 year = 31536000 seconds * 2/3 (accounting for sleep) = 21024000 seconds awake per year, * 18 years = 378 432 000 seconds). In other words, a human adult is already a highly trained system. And even very young human children have been awake and learning for quite a long time compared to anything we might imagine to be a tabula rasa system.
So I don't think that the human capacity for learning supports the idea that a tabula rasa AI system should be able to learn any particular thing with just a handful of examples.
I don't know if Brooks is talking about learning everything from scratch here.
I for one don't think that humans come into the world as tabulae rasae. I
think instead that we come endowed with a rich knowledge of the world already,
as well as strong inductive biases that allow us to learn new concepts from a
combination of existing knowledge and new observations, with our
characteristically high sample efficiency that remains unmatched by neural
For instance, one theory about our ability to learn human language as infants
is that we are born with a language endowment, Chomsky's "universal grammar",
that predisposes us to pick out and learn human language, without having to
figure it out from scratch.
I wonder then if what Brooks is saying is missing from neural nets is the
ability to use background knowledge and strong inductive biases as effectively
and efficiently as we do, so that they don't need hundreds of thousands of
examples before they can learn a new concept.
One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are search and learning.
The 'end of Moore's law' (as a measure of performance, not density) is probably the most significant "Tech" story of the next decade. Why? Because it is going to demand engineers who can write fast code over engineers who write code fast.
A lot of frameworks and abstractions will be ripped out and thrown away as 'taking too much time for not enough benefit.'
I don't think we'll be throwing away the old abstractions, but adding new ones in beside them.
As an example, I made a demo a couple years back which processed and rendered about 64 MB of CNC G Code (a programming language for telling robots how to move and spin) in a web browser to match CNC toolpaths (the path of the robots motion) to their realtime run statistics (show me a realtime 3d map of where we can speed up the robot). To do this, you need to simulate the movements of the machine to build a 3d model, and then pair the model to collected data. You then need to cull/average data which is too fine to be visible, and ship this processed data to the browser as arrays of floating point numbers, where it gets rendered as -effectively- thousands of colored 3d arcs.
The naive implementation took minutes to run. The final version took about 1 second. I didn't have to throw away any abstractions or frameworks to make things faster. I had to use my language's new abstractions for massively parallel processing and stack allocated data to speed up server side processing, and then use libraries for decoding base64 binary into floating point arrays on the browser side to speed up data transfer and parsing.
I guess my point is that the end of Moore's law doesn't mean throwing old things away, it means using existing solutions when appropriate, and approaching new problems in different ways.
Or just making slower things more orthogonally parallelized and throwing more hardware units at the problem (as opposed to more powerful hardware or faster code).
Most code by engineers "who write code fast" is not in performance critical domains...
On the plus side, it turns out recovering most of that isn't that hard with a bit of careful language selection and just a bit of care in writing code. I think this is one of the somewhat subtle reasons for Go's success is that it trims away most of the slow things that a lot of "scripting" languages do without throwing away too much of the power, and I anticipate the continued entry and success of other languages like Nim into this space on the language chart in the next few years. To get to screaming fast may take a lot of effort, but "pretty fast, let's parallelize what we have instead of hyperoptimizing it" isn't too hard.
Because of all of these multiplicative slowdowns, even at very small scales it's still generally a better idea to start by optimizing and removing some of your multiplicative slowdowns before you just throw more hardware at it. Compute is merely cheap, not free. It's the rare bit of code that nobody's spent any time optimizing, but is also free of these multiplicative slowdown problems.
Why would they need to? They can sell improved cameras and other additions instead of improved CPUs...
Yet we should not pass up our opportunities in that critical 3%."
The end of Moore's law is the end of passing up opportunities in that critical 3%. There's a lot of performance debt accumulated.
It's only 3%. Optimize when it matters!
There doesn't seem to be anything in the horizon that would make performance a more pressing concern in the work of the 99% of developers who do make business software and consumer web/mobile apps. They just sometimes call into libraries that have had performance work put into them and get by with rudimentary understanding of what seems to cause slowdowns.
In the other dimension, there does seem to be a continued focus and resulting improvements in velocity.
The recent ML popularity is all about leveraging packaged optimized code (Tensorflow etc) with vast majority of programmers freed from the engine level details and detailed understanding of GPU perf work or parallel/distributed programming.
I surely hope so, the waste mentality has to end.
Consoles were always more likely to see speed optimizations, but this had slid aside a bit in the last two decades.
My prediction is that, in a few years, lots of programming languages and software developers will simply become obsolete because what the other guys can produce is heaps of performance, security and manageability atop.
But I might be wrong, so make your career decisions carefully.
However Intel Core Duo was released in January 2006 and it had 2 cores. So in a bit more than 13 years we only have 3x as many cores. Core 2 Quad was released in January 2007, so if you count from that one, a bit more than 12 years for 1.5x as many cores.
I wouldn't bet on widespread 64 core CPUs for consumers in the next 10 years or so.
Most commercial software just doesn't use it efficiently and consumers don't need it for existing software.
We need a software breakthrough, a killer app, something that people really want and that really needs a ton of cores.
For now, I don't see it. But then again, that's how killer apps are, nobody sees them until they're there.
If you are going to extrapolate, don't pick the date. Take all the numbers you have and trace the average value. If you are picking the numbers, you can extrapolate to any point you want.
> I wouldn't bet on widespread 64 core CPUs for consumers in the next 10 years or so.
We already have 32-core processors for servers. Imo, the biggest challenge will be the laptop and heat management rather than putting a bunch of processors together.
> We need a software breakthrough, a killer app, something that people really want and that really needs a ton of cores.
We need that now. Most software I'm using is slow. Part of why I picked the iPhone (and I bet many people did) was speed. It is fast. People care about speed and if you introduce them to something faster then they'll get hooked. Also, you are already using all of your cores because of multi-tasking. It makes your computer experience overall better. At 6 cores, multi-threading for an application doesn't give much noticeable experience, since you probably have 5-6 things running on your computer doing other stuff like network or an audio-player.
> For now, I don't see it. But then again, that's how killer apps are, nobody sees them until they're there.
> I hope you're right.
Don't worry, just hang in there. The start of multi-cores was slow but then that's all there is right now. Seems intel/amd have mastered how to add more cores but the CPU frequency is pretty much capped going forward!
In general, the real issue is the layers of abstractions between the hardware and the end user. The over complicated architectures. The constant re-invention of databases/operating systems/virtual machines at each layer.
I wish that were the case but operating systems and systems software in general seems to be the most stagnant field. Everyone just buys the same FLOSS stack for $0 and compatibility is king, so there's very little research going in this field and much less of it ends up in any product you're ever going to see in use.
I would love to see a vibrant scene of competing operating systems and databases with fresh new ideas and research.
Those are because of the human struggle, the political and economic fights. They're not because of purely technical problems.
Each layer is usually backed by a party with a very clear interest, usually control, and each layer on top tries to escape that control. It's been that way since the times of p-code.
There will always be problems where you need faster responses, and the solutions to those problems are already not using Ruby. You don't use a hammer when a screwdriver is appropriate, but nails work pretty well in most places and screws are unnecessary.
> In this test, the framework's ORM is used to fetch all rows from a database table containing an unknown number of Unix fortune cookie messages (the table has 12 rows, but the code cannot have foreknowledge of the table's size). An additional fortune cookie message is inserted into the list at runtime and then the list is sorted by the message text. Finally, the list is delivered to the client using a server-side HTML template. The message text must be considered untrusted and properly escaped and the UTF-8 fortune messages must be rendered properly.
As always, it depends. Benchmarks are cool, and yes we know that Rust and Go will always be faster than Ruby out of the box. Once the number of responses goes up, you throw caching at the problem. But the benchmark says
> The requirements were clarified to specify caching or memoization of the output from JSON serialization is not permitted.
> the table has 12 rows
Which is low and should be quick regardless, but regardless of your tech stack if you are creating a "modern" UI, you handle this update asynchronously with JS anyway, and if you're not, do you really care about sub 100 ms responses?
If you want performance out of the box, yes you will use Go, but like I said earlier, you use a hammer for nails and a screwdriver for screws. If you want performance and type safety, you choose one, if you want faster prototyping and a more mature web dev ecosystem, you choose another.
You know we've reach the end of Moore's law when balanced ternary computers become the norm.
Luckily the laws of physics have precise limits for just how efficient computation can be, Landauer's principle, that are just as binding as the laws for how efficient a heat engine can be. Progress in engine efficiency did indeed form a converging sigmoid on what's allowed by Carnot. But we're still quite a distance away from Landauer's limit and if transistors can't get us there we have every reason to believe that other computational substrates are possible and can.
In the meantime we might be looking at an interregnum. But while AI is gobbling up computational cycles because they're available there's no reason to think that efficiency gains aren't possible - just look at the orders of magnitude improvement in training resources from AlphaGo to AlphaZero. I don't see that this should stop progress, though it might slow it.
AFAIK nobody currently knows how make finfets much smaller without making them awful.
Or better, as old processes get improved, transistors are getting either more compactly packed or cheaper, but not both. The expensive ones are getting a little bit faster too, but not in proportion to the reduced die size.
Right now I'm looking for a C/Shader-like language for x86_64/Linux which allows you to be much closer to the metal without requiring you to go down to the cumbersome level of ASM syntax and at the same time shedding libc and use syscalls instead (E.g. if you have specific knowledge about the data structures you want to map you can use the heap much more efficient than plain old malloc). So far I found nothing.
You seem like the ideal candidate for writing such a language. And the book about it.
I'm not 100% sure deep learning can get all the way there. We might need a new approach. But the deep learning leaders are taking it very seriously and making some progress.
For example, Yann LeCun is talking about self-supervised, learning models of the world. MILA (Bengio's group) is talking about "state representation learning, or the ability to capture latent generative factors of an environment". Hinton now has capsule networks. In my opinion these types of approaches are very promising for general purpose AI.
Or the pendulum going back to compiled languages after the scripting craziness of the last decade.
CNNs in particular recycle the same set of weights over-and-over again, fitting inside of the tiny caches (or shared-memory in GPUs), allowing for the compute-portion of the hardware to really work the data.
> CPU cycles are limitless. Most of the time the CPU is waiting for the memory system.
CPUs go out-of-order, deeply pipelines, and speculative so that they have work to do even while waiting for the memory system.
The typical CPU has over 200+ instructions in flight in parallel these days. (200+ sized reorder buffers and "shadow registers" to support this hidden parallelism), and that's split between two threads for better efficiency ("Hyperthreading").
GPUs can have 8x warps / wavefronts per SM (NVidia) or CU (AMD) waiting for memory. If one warp/wavefront (a group of 32 or 64 threads) is waiting for memory, the GPU will switch to another "ready to run" warp/wavefront.
It takes some programmer effort to understand this process and write high-performance code. But its doable with some practice.
The bottleneck for GPU today is the amount of memory on board and not compute. Especially with large size models.
In addition, the problem with deep learning, in general, is the seq nature of the alg. I.e it is parallel within the layer, but not between layers. And for multi gpu setup, again it is the communication link between the GPUs.
So I think that the nature of the current state of the art optimization alg are what matter.
Well... everything "depends on the model". Some models will be GPU-compute limited, others will be bandwidth-limited, and others will be capacity limited.
> In addition, the problem with deep learning, in general, is the seq nature of the alg. I.e it is parallel within the layer, but not between layers. And for multi gpu setup, again it is the communication link between the GPUs.
You should increase the size of the model to increase parallelism. Any problem has sequential-bits and parallel bits, so we can't stop the sequential nature of problems.
But the idea is to think NOT in terms of Ahmdal's law, but instead in terms of Gustafson's law. When you have a computer that's twice-as-parallel, you need to double the work done.
You can't reasonably expect things to get faster-and-faster (Ahmdal's law). Instead, you load up more-and-more work the wider-and-wider machines get.
In the case of neural nets, instead of doing 128x128x5 sized kernels, you upgrade to 256x256x5 sized kernels as GPUs get thicker.
Moore's law was never about the "speed of computers", but instead about "the number of transistors". As such, its Gustafson's law that best scales with Moore's law. We've got another 5 to 10 years left in Moore's law by the way: denser memory, denser GPU compute, 5nm-class chips (probably networked as a bunch of chiplets).
We will have more CPU / GPU power, as well as more RAM density, in the future. The question is how to organize our code so that we'll be ready 5 years from now for the next-gen of hardware.
This is somewhat true, but not at clear cut as you make it seem. Most deep learning training I've been part of is realistically bottlenecked by compute. Increased memory usage is a way to improve the compute efficiency (less time moving between GPU and CPU). I haven't worked hands-on with the recent big NLP models so I could be mistaken in that domain, but there is a lot of evidence that model parallelism and hybrid parallelism work (technically - financially is less clear) so the memory limits of a single GPU/ASIC are less important than previously.
One day I hope to see metaoptimization - for example particle swarm optimization - used to parallelize training so it is truly horizontally scalable, but right now the cost/hardware availability is a blocker.
For example, who cares if you have 64 core , 5Ghz cpu, if the still doing I/O from an HDD.
And most line-of-business apps don't need to be faster, they need to be easier to use and more stable.
It doesn't matter if you can't double the transistor density of a single cpu if you can just double the number of machines. At the end of the day you still managed to double performance for the same price.
See https://en.wikipedia.org/wiki/FLOPS#Hardware_costs (note: in another thread someone noted that this wiki is outdated/inaccurate. If anyone have the relevant expertise they should help edit it)
CPUs/GPUs are beasts of hardware architecture, being complex mostly due to their flexibility. We can achieve higher performance with dedicated hardware (or FPGAs), and it looks like the economic reasons to do so are slowly becoming more certain.
A similar conclusion can be made for genomics too.
This feels a little off. To me, it feels like image has made the most progress, then text, then sound, then video.
Anyone knows which company they were referring to?
But that's lomg-term, big-picture. Technologies can remain stagnant longer than you can remain alive.
It’s likely that cloud computing will continue to see decreases in costs, which means the availability of computing will continue to raise after Mores Law has stopped giving.
Quantum computers might play a role in increasing computational power in the next decade.
Right now a lot of AI research is low hanging fruit recently made available with advances in CPU and GPU. After the advances dry up maybe more fundamental research will produce much better results with our current technology.
cloud computing costs much more than on prem; it's available on demand, which is why it's attractive to some companies. For everyone else it's an outsourcing play that unlocks capex for revenue generation.
>Quantum computers might play a role in increasing computational power in the next decade.
Nope. No chance. Best case is that QC will provide improved molecular simulations in 10-20 years. But that will be a very specialised result.
>Right now a lot of AI research is low hanging fruit recently made available with advances in CPU and GPU. After the advances dry up maybe more fundamental research will produce much better results with our current technology.
100%, but the key word is maybe. One problem I anticipate is that the current skill set of grad students and the current management mind set in academia and commercial research is not well fitted to the kind of fundamental research that includes maybe and risk. I think that a more likely turn is that there will be a focus on how to better exploit the tech for data science and ML that we have. Currently we're pretty hit and miss and also the results are rather unpredictable and brittle. When did you last find an ML project on Github with unit tests?
Yeah, and that will continue to go down, exactly because of what you stated in the next sentence.
It won't increase your FLOP/s in any way. It will mostly help solving narrow problem classes, those outside P but in BQP. To my understanding none of the problems encountered in NN training falls into that.
 ignoring polynomial speedups such as grover's algorithm
Why would cloud computing continue to see significant decreases in costs once Moore's law stops? Clearly there will be some ongoing decline in infrastructure costs, but if your main requirement is for lots of compute (for AI) then it seems hard to maintain an exponential rate of improvement overall if the silicon is no longer getting faster and cheaper, and the exponential is what you need, else your costs get out of control quickly.
Linear improvements will get us there in the end.
As far as economy of scales goes it’s also not a given.
Large cloud providers are much more adapt at reusing old hardware for their PAAS and weaker instances.
I’ve had a chance to talk to a few reps from Dell once and while cloud buy a lot of servers they also upgrade them much less often than large enterprises that run their own DC this is because Amazon can always continue to make money on 3-5 year old hardware by shifting it off to cheaper tiers or by running things like SQS or Lambdas on it in the background.
On the other hand if you have a 3-5 year old server and you need more horse power if you manage your own DC or rack space the hosting costs make it economical to keep underperforming hardware.
Not to mention other costs such as support are quite different, if an Amazon server dies they don’t care. Traditional self hosted environments will require extended support because they can’t have the same redundancy as Amazon.
If you delegate a 5 year old server to some back office applications while you use your new ones for client facing apps the old server is still a mission critical box even if it doesn’t directly run revenue generating services.
Basically Amazon buys to grow, on-prem orgs buy to both grow and upgrade.
If you have an organization the size of Amazon that is completely self hosted they will end up buying much more hardware over a decade than Amazon.
No, they won't. That's the point. To quote directly from the article:
we’re talking about exponential rates of growth here, linear expense adjustments won’t move the needle.
Also, a minor nitpick: AI has not succeeded at beating the best players at SC2, though it did beat the vast majority of the ladder.
I would argue that it is, because that’s the driver for industrial automation.
This would also make the games more difficult for humans, and it would erase much of the accumulated knowledge about strategy.
I would be surprised if by the end of the century robots couldn't beat humans at every physical sport in existence.
It's likely that robots and robotically-enhanced humans will both be banned from competing directly with humans in sports, because it'll be no contest -- in the robots' favor.
I'm not even sure what would be the hardest part to make a champion tennis robot.
I'd expect something like wrestling would be much more challenging than non-contact sports, assuming, of course, that the robot's strength and torque was limited to human levels.
Wrestling is actually a pretty interesting case, now that I think about it for a bit, as robots wouldn't need to be limited to human-like bodies. In principle, they could have, say, a starfish-like or octopus-like body, where its tentacles would allow it to grip its human opponent in ways that may be impossible to escape, even were the robot limited to human strength.
So I suspect that in sports like wrestling (and maybe all other sports), robots that are allowed to compete would likely be limited to human-like forms, as other forms could give them too great an advantage.
Because the human it's playing can't. In a sport where endurance is a significant component, wiring one competitor up to power, while the other has to use only the fuel that he/she has stored, doesn't seem like an equitable competition.
Similarly, a human player against a robot that spans the entire width of the court, and that has 48 rackets, is clearly not an equal match on that basis.
In the same way, we don't race Formula 1 cars against NASCAR.
I'm sure there'll be additional interconnect issues to worry about as the physical cluster of dies gets larger and larger.
In particular, their metaphor of an "infinity" point seems to me like the incorrect application of mathematical ideas to social contexts.
I think the question of artificial general intelligence is a different question than a singularity, much like how the Chinese room addresses a different question than the Turing test.
It's the difference between a computer that is taught how to compose okay-ish music, and a computer that learns spontaneously how to compose really really great music and do all of the social, cultural, and financial things required to create a career for itself as a notable composer and then does something entirely new and surprising given that starting point.
They're completely different problem classes, operating on completely different levels of sophistication and insight.
A lot of "real" AI problems are cultural, social, psychological, and semantic, and are going to need entirely new forms of meta-computation.
You're not going to get there with any current form of ML, no matter how fast it runs, because no current form of ML can represent the problems that need to be solved to operate in those domains - never mind spontaneously generate effective solutions for those problems.
I disagree. A program improves itself when it reacts to a problem and implements a solution. Obviously that is very general, but enough. A human of IQ 100 certainly can develop software; a program of IQ 100 should be able to do the same, and then you scale horizontally.
Please read it again. You need to understand what a "problem" and a "solution" both are - in detail - because otherwise you have nothing to work with.
And a human of IQ 100 will only ever develop poor software. If you scale horizontally, you won't get game-changers - you'll just get a flood of equally poor software more quickly.
Can a human with IQ 100 write general AI software that a human with IQ 100 can debug?
We don't understand physics well enough to know whether time (and space ) can be arbitrarily halved over and over.
For an actual "infinity point" you would need such halving of time of computation up to a point where you actually have infinitely small time increments. The way you would define this would be via a limit and again the main premise is arbitrary halving, something that we have not seen in nature and which is a mathematical tool, rather than a hard physical fact.
Since you have physical computers, the composition of such a computer would have to lead to computations in these infinitesimal time brackets and in turn, for example, may necessitate infinitesimal matter. The way I understand the Planck length, at least as a concept, is that after some really small length of space then essentially you don't get anything smaller.
In any case, talking about a "singularity point" is an activity done by social personalities who like to sound clever. But they are not saying anything remotely clear.
As I mentioned, the other argument that one can make is to challenge the ability to indefinitely "improve yourself" as a machine, another dubious claim.
Edit: I am not a physicist, so there could be better ways to think of Planck length and discuss its validity.
Remember Zeno's paradox. "Improving" could mean "incrementally approaching a limit". It doesn't necessarily mean "marching off to infinity".
Then, there's the matter of hardware. An infinitely smart AI running on finite hardware seems likely to be a contradiction. Is the AI going to be able to not just improve its code, but improve its hardware? Without depending on slow, ignorant humans, who might even be having second thoughts about the wisdom of giving this AI even more hardware? That takes considerably more than IQ 100; that takes robots carrying out the AI's wishes, which some factory needs to make. And an IQ of 100 probably isn't going to cut it to hack all the factory control systems, to take them over and have them start producing what the AI wants.