The authors trained ResNet-50 in 7.3 minutes at 75.3% accuracy.
As a comparison, a Google TPUv3 pod with 1024 chips got to 75.2% accuracy with ResNet-50 in 1.8 minutes, and 76.2% accuracy in 2.2 minutes with an optimizer change and distributed batch normalization [1].
A TPUv3 pod is ~107 petaflops (Googles number from your paper). 512 Volta GPUs is ~64 petaflops (Nvidias number from [1]).
v3 pods don't seem to be publicly available. A 256 chip 11.5 petaflop v2 pod is $384 per hour, $3.366 million per year. [2]
Meanwhile Google Cloud Volta GPU prices (which are probably inflated over building your own cluster, but are hopefully close enough to a reasonable ballpark) are $1.736 per hour, would be $7.791 million per year for 512.
Unless Google GPU prices are really inflated, clusters are legitimately substantially cheaper than cloud GPUs, or these researchers did a poor job, it seems like this is a good advertisement for TPUs.
512 GPUs on 56 Gbps network? I'd rather see researchers exploring potentially more efficient alternatives to traditional neural nets, like XOR Nets, or different architectures like ngForests or probabilistic and logistic circuits, or maybe listen to Vapnik and invest into general statistical learning efficiency.
I have realised that the age of a AI is just a movement of big corporations towards a profitable monopoly.
They are teaching us how to solve problems with hardware instead with algorithms, and we, as individuals, will never have access to their computing power.
You can easily build and train commercial grade neural nets on consumer hardware. Read about, say, a state of the art image recognition net on arxiv, pull an implementation for python in tensor flow or cafe or pytorch or what have you from GitHub (tons of open source), and with nothing more than just a cursory understanding of what you're doing and some basic programming skills, you can run scripts to train and evaluate functional neural networks on your own data set. I've fully trained numerous modern architectures on a single 1080TI to perform image recognition in a matter of days.
All of this is within reach of the average developer. If there's any monopoly, it's over training data, which Google and Amazon happen to specialize in. But even large datasets exist as open source.
From what I can tell, machine learning, the precursor to AI, is here, and both knowlege and implementation are fully accessable to the general populace.
>A single GPU in 10 years might be able to smoke these 512 GPUs.
Firstly, it will not, because there are physical limitations to silicon-based computing and I don't think we will get a different kind in the next 10-20 years. Secondly, unlike commodity programming, AI is an area where improving algorithm/learning efficiency is infinitely more valuable than figuring out how to throw more hardware at the problem. For starters, any algorithmic improvement makes all further research and experimentation easier for everybody as opposed to a handful of agents who have access to ridiculously powerful hardware. I can list many other reasons, like global energy consumption and the need for learning in embedded devices.
I find it highly annoying that deep learning enthusiasts immediately turn into uber-skeptics whenever the conversation touches on other ML approaches. Makes me wonder how difficult it is to get funding for fundamentally new AI research in this climate.
We can't even do that, the biggest GPUs today are at the reticle limits of foundries> They literally can't be made bigger. Of course we can still go the chiplet way but...
... but power requirements would put a limit to that as well. The RTX 2080 Ti already has a 250W TDP. Putting a couple of those on a single card and you are looking at >1000 W. Cooling and power becomes very hard as we are effectively trying to run and cool a space heater at the same time.
Oh well, this is the death of democratic AI and an end of independent researchers :-( There goes any hope of a single Titan RTX producing meaningful commercial models.
That’s what 512 GPUs for 1.5 minutes costs on EC2, using spot P2.16xl with 16GPUs each. The price would be $12 if you had to use on-demand instances.
It does start to add up — if you wanted to do 2 runs per hour, 8 hours a day, 250 days per year (ie, a “full work year”), it would cost $48k. (And you’d have done 4000 such experiments.)
All of this is to say, I think you might be exaggerating this as the “death of independent researchers”.
A single card can still do meaningful inference on a model trained this way, and can do useful prototyping before you deploy your fanout for a few minutes and $20.
It’s 25Gbps between instances, and NVLINK between GPUs on the same instance. It’s really a pretty comparable setup.
There’s no network cost within a region, and you’re probably not downloading terabytes of data out of AWS — so you’re talking about a few dollars a month, as opposed to tens a day in GPU cost. (You’d also have 23TB of RAM in your cluster, if that’s enough for you to use as storage.)
GPUs and TPUs are readily available in the cloud. Big institutions have always had massive mainframes or compute clusters, but cloud services have democratised access to those resources.
Of course a single Titan RTX can't match the performance of a huge compute cluster, but I can rent a 512 TPU pod with 11.5petaFLOPS of compute and 4TB of HBM for $6.40 per minute. Members of the TensorFlow Research Cloud programme can access TPU pods free of charge.
Sorry but even tho the big companies produce a lot of interesting research, I challenge you to not find any interesting model trained on a single GPU from recent publications (the majority coming from academia). Actually it's very rare to find a paper where largely distributed training is necessary (i.e: the training would fail or would be unreasonably too long). Yes having more money help you to scale your experiments, it's nothing new and it's not something specific to AI.
A trivial example is BERT_large; won't fit into 12/16GB and takes ~year to train from the scratch on a single 15TFlops machine. It's now a base model for transfer learning for NLP.
I'm not saying that there's no very big model, just saying that it's a minority of publications, for any trivia example of big model I can show you 10x trivia examples of relevant non-big models.
Also you are talking about a model which is specifically designed for TPU (the dimensionality of the networks is especially fine-tuned).
And even tho, BERT_large still fit in the memory of a single GPU (for very small batch), there is an implementation on Pytorch. I don't understand, are people complaining that Deep Learning is actually (reasonably) scaling ? Isn't it a good news ?
You need to study state-of-art a bit more. BERT_large can't reproduce results its authors achieved with TPUs on a Titan V/Tesla P100 as for getting there you need to use substantially larger batch sizes that won't fit into 12/16GB. If you get a V100/Titan RTX, it would fit, but you'd wait ~1 year for a single training session (40 epochs) to finish.
MS already published another model based on BERT that is even better. It's unlikely memory x #GPUs would go down in foreseeable future; it's more like that everybody will start as large models as their infrastructure allows if they find something that improves target metrics.
Yes, absolutely. I'm an AI researcher, and most of my colleagues just use their desktop computers with a standard GPU to do their work. Of course, long-running jobs get put on the cloud, as do large distributed jobs, but those are surprisingly rare.
We just submitted a paper, for instance, that is entirely CPU based, and required running 4 CPUs for a few days to reproduce (and even then, you can reproduce 90% of the paper within minutes on a single machine).
If independent researchers don’t like that they should come up with a way to train resnet in 1 minute on a single GPU. Unless you think the ultimate best method to train neural nets has been discovered? Somehow 30 years ago researchers managed to invent cnns and rnns using hardware million times slower than what you can buy today for a few thousand bucks.
If each experiment takes you literally 400 times longer than for a Google researcher, your chances of figuring out anything new drops dramatically.
I was at an ML conference last year and I asked the panel: if I want to go forward in ML, should I rather do a PHD or work in the industry? a professor (!) answered that I should work in the industry as most research groups don't have the funding to be competitive.
Silicon is inferior to chemical energy. Human is upwards of thousands of orders of magnitude more efficient than today's best GPU's. However, speed != efficiency. Classifying imageNet with a human would take 1000 hours.
On the other hand GPUs are made for computing and will crunch those numbers for you until they die. If you want a human to classify things you have to consider the lifetime cost of making said human and keeping it entertained.
They also need idle periods every day, during which they don't even shut down!
You can't power them with PV cells either, instead they rely on carbohydrates produced via a horribly inefficient chemical photosynthesis process.
And if you intend to let your human classifier run for 8 hours a day you better buy at least three of those for error correction.
And I must say this comparison is still quite lenient towards the humans since we're not even comparing them to purpose made silicon entities but generalists.
As a comparison, a Google TPUv3 pod with 1024 chips got to 75.2% accuracy with ResNet-50 in 1.8 minutes, and 76.2% accuracy in 2.2 minutes with an optimizer change and distributed batch normalization [1].
[1]: https://arxiv.org/abs/1811.06992