As a comparison, a Google TPUv3 pod with 1024 chips got to 75.2% accuracy with ResNet-50 in 1.8 minutes, and 76.2% accuracy in 2.2 minutes with an optimizer change and distributed batch normalization .
v3 pods don't seem to be publicly available. A 256 chip 11.5 petaflop v2 pod is $384 per hour, $3.366 million per year. 
Meanwhile Google Cloud Volta GPU prices (which are probably inflated over building your own cluster, but are hopefully close enough to a reasonable ballpark) are $1.736 per hour, would be $7.791 million per year for 512.
Unless Google GPU prices are really inflated, clusters are legitimately substantially cheaper than cloud GPUs, or these researchers did a poor job, it seems like this is a good advertisement for TPUs.
 Pod availability / performance / pricing information here: https://cloud.google.com/tpu/
 GPU pricing info: https://cloud.google.com/gpu/
They are teaching us how to solve problems with hardware instead with algorithms, and we, as individuals, will never have access to their computing power.
I work in the field, and the way it works is like this, in this order:
1) Someone works out how to do something
2) Someone works out how to improve the accuracy
3) The accuracy maxes out
4) People improve training efficiency.
We see this over and over again. Take a look at the FastAI results on ImageNet training speed for example.
All of this is within reach of the average developer. If there's any monopoly, it's over training data, which Google and Amazon happen to specialize in. But even large datasets exist as open source.
From what I can tell, machine learning, the precursor to AI, is here, and both knowlege and implementation are fully accessable to the general populace.
First we built awesome high power single core chips, then multi core chips and continued improving performance per dollar.
A single GPU in 10 years might be able to smoke these 512 GPUs.
Firstly, it will not, because there are physical limitations to silicon-based computing and I don't think we will get a different kind in the next 10-20 years. Secondly, unlike commodity programming, AI is an area where improving algorithm/learning efficiency is infinitely more valuable than figuring out how to throw more hardware at the problem. For starters, any algorithmic improvement makes all further research and experimentation easier for everybody as opposed to a handful of agents who have access to ridiculously powerful hardware. I can list many other reasons, like global energy consumption and the need for learning in embedded devices.
I find it highly annoying that deep learning enthusiasts immediately turn into uber-skeptics whenever the conversation touches on other ML approaches. Makes me wonder how difficult it is to get funding for fundamentally new AI research in this climate.
Improving algorithm/learning efficiency gets much easier when you can iterate faster.
The physical limitations may be reached, but we'll just throw more cores unto the PCB to compensate.
FastAI trained RestNet-50 to 93% accuracy in 18 minutes for $48 using the same code which can be run on your own GPU machine.
If you want to do it cheaper and faster, you can do the same for in 9 minutes for $12 on Googles (publicaly available) TPUv2s.
This isn't a monopolization of AI, it is the opposite.
That’s what 512 GPUs for 1.5 minutes costs on EC2, using spot P2.16xl with 16GPUs each. The price would be $12 if you had to use on-demand instances.
It does start to add up — if you wanted to do 2 runs per hour, 8 hours a day, 250 days per year (ie, a “full work year”), it would cost $48k. (And you’d have done 4000 such experiments.)
All of this is to say, I think you might be exaggerating this as the “death of independent researchers”.
A single card can still do meaningful inference on a model trained this way, and can do useful prototyping before you deploy your fanout for a few minutes and $20.
This also ignores all data transfer & storage costs.
There’s no network cost within a region, and you’re probably not downloading terabytes of data out of AWS — so you’re talking about a few dollars a month, as opposed to tens a day in GPU cost. (You’d also have 23TB of RAM in your cluster, if that’s enough for you to use as storage.)
Of course a single Titan RTX can't match the performance of a huge compute cluster, but I can rent a 512 TPU pod with 11.5petaFLOPS of compute and 4TB of HBM for $6.40 per minute. Members of the TensorFlow Research Cloud programme can access TPU pods free of charge.
Also you are talking about a model which is specifically designed for TPU (the dimensionality of the networks is especially fine-tuned).
And even tho, BERT_large still fit in the memory of a single GPU (for very small batch), there is an implementation on Pytorch. I don't understand, are people complaining that Deep Learning is actually (reasonably) scaling ? Isn't it a good news ?
MS already published another model based on BERT that is even better. It's unlikely memory x #GPUs would go down in foreseeable future; it's more like that everybody will start as large models as their infrastructure allows if they find something that improves target metrics.
We just submitted a paper, for instance, that is entirely CPU based, and required running 4 CPUs for a few days to reproduce (and even then, you can reproduce 90% of the paper within minutes on a single machine).
I was at an ML conference last year and I asked the panel: if I want to go forward in ML, should I rather do a PHD or work in the industry? a professor (!) answered that I should work in the industry as most research groups don't have the funding to be competitive.
They also need idle periods every day, during which they don't even shut down!
You can't power them with PV cells either, instead they rely on carbohydrates produced via a horribly inefficient chemical photosynthesis process.
And if you intend to let your human classifier run for 8 hours a day you better buy at least three of those for error correction.
And I must say this comparison is still quite lenient towards the humans since we're not even comparing them to purpose made silicon entities but generalists.
I assume if I power the silicon with batteries it's going to stop being inferior?
You mean it would take 60,000 people one minute? Doable.