Is this surprising? I think it's expected, when most published numbers from Google suggest that this is only 2x over DistBelief, which Adam Coates improved upon 6x (http://www.cs.stanford.edu/~acoates/papers/CoatesHuvalWangWu...) in 2013. Coming from a HPC background, DistBelief (and sometimes MR in general) has always felt like a square being forced into a round hole. Sure it "scales" but it's not how you would build a system from scratch to meet the same computational needs. Moving data around is much slower than doing extra math.
DNNs provide some interesting opportunities for domain specific optimization with lossy math because empirically it seems that the solutions dont require significant precision. Something like a general Tensor library holds you back from using some of the tricks like synchronization-free hogwild, where it wouldn't be appropriate in other scenarios where more exact computation is necessary.
One interesting step towards this is their implementation of lossy compression with a custom 16-bit half float representation. However a lot of people are dialing down the precision even further, including one of the authors of TensorFlow (http://petewarden.com/2015/05/23/why-are-eight-bits-enough-f...).
Scott Gray, one of the principle engineers at Nervana Systems, has a very wise reply in the github thread - they aren't doing much to optimize for memory locality, and wont have amazing speedups until that happens.
From Google's perspective it is probably more about how TensorFlow scales out horizontally. If a researcher fires off a Borg run (or whatever they use now) and the job takes a few thousand CPUs, no problem, at least for research.
They must have better optimization a for running in production, such as in place operations.
I'm going to say something unpopular, but horizontally-scaled deep learning is overkill for most applications.
Can anyone here present a use case where they have personally needed horizontal scaling because a Titan X couldn't fit what they were trying to do? It's the thing I hear talked about the most and used the least.
The biggest misunderstanding I've heard is "I have petabytes of data so I need multi-GPU", but NNs are trained on batches of data, so it doesn't matter. It's the model that must fit, but 12gb can fit a big model.
The whole "who can build the biggest net" is like building the fastest car: it's for show, the slower ones are way more cost effective and get you there just as well.
Keras is my present deep learning library recommendation--it's like torch 7 but in python and theano. Multi-GPU is roadmap because there's a lot more going on in deep learning that acts as a bigger differentiator than horizontal scaling.
For the record, when I say scaling, I'm not talking about having multiple machines serve multiple requests (that's a good reason to purchase more machines), I'm talking about spreading your gradient computations and/or model weights across machines or GPUs.
I suspect this is a just-so-tale because the capability is so new and the GPU programming skill required to implement it efficiently is so rare. TLDR: just say no to parameter server (most of the time).
But IMO the real obstacle to horizontal scaling is the communication between servers not the usefulness of doing so. Within a server, one can pack 8 TitanX/M40 GPUs with high speed 13 GB/s P2P communication between them (and up to 16 GPUs in various unproven science project servers). That's ~50 TFLOPs and 96 GB in a box. That rocks. Just ask the guys at mindori (if they ever ship that is).
But between servers lies a sippy straw of 100 Gb/s Infiniband at best or worst-case, ~1 Gb/s on AWS GPU servers with freaky nearest neighbor weather. If you can't make efficient use of 8 GPUs in a single box, I agree, don't bother breaking out to the next server.
That said, frameworks like mxnet have opened the floodgates to experimenting with larger models and more distributed training algorithms. Time will tell if this pans out. But 2 years ago, Andrew Ng's group showed 12 GTX 680s distributed across 3 servers kicking the crap out of Google Brain. I expect more of this, not less, in the near future.
For big LSTMs and long-ish sequences, the intermediate gradients can take up a huge amount of memory - often more than the model parameters themselves. In my experience it is mostly big LSTMs that need the 12GB+ GPUs. You can reduce the batch size to help this a bit, or train using trucated BPTT but RNN training is already a slow, sequential business.
Of course, there are no clear wins (generally, losses) in computation by scaling RNNs horizontally on GPUs - but sometimes you really do need more than 12GB.
No clear wins in horizontal scaling
Reduce batch size
Use truncated back prop
*search for better hyper params*
I usually do 2-4. After those, have you really seen scaling result in significant accuracy gains? And what percent of the time is that necessary? Genuinely interested--and you guys rock btw!
One case where I see 1 as being necessary is the softmax size / vocabulary boost in the seq2seq paper (8 GPUs IIRc, and 4 were dedicated to a softmax!). There are other ways to handle this (hierarchical softmax, sampled softmax, skip-thoughts trick of using word2vec vocabulary) but every time someone figures out how to have a larger vocabulary, neural machine translation results seem to improve. In general, 1 is a last resort for me - but maybe this is due to current tooling and availability of hardware as much as anything?
> horizontally-scaled deep learning is overkill for most applications
It's difficult to say what "most applications are" but many algos currently seem bottlenecked by how fast you can load data into your GFX card (hint: it's slow). So what you're saying about batching is correct, but it does matter.
> many algos currently seem bottlenecked by how fast you can load data into your GFX card
All of them are, but that's a PCI-e issue and horizontal scaling doesn't fix that (unless using nvlink or similar afaik, but then you face the fact that current horizontal scaling schemes aren't very effective at increasing model accuracy anyway)
> It's difficult to say what "most applications are"
Nevermind "most applications"--so far, all I've heard is one, that being the absolute bleeding edge of RNN research, assuming you're using a huge softmax instead of an alternative.
My point remains: Multi-GPU is way down the list when it comes to features a DL framework should have. Because very very few people need it.
kajecounterhack, do you use multi-GPU for your DL work? If so, how often?
>do you use multi-GPU for your DL work? If so, how often?
I'm using it right now, to great effect. I can't really say what for, but methinks I'll be using it every hour of every day for the foreseeable future.
Our experiments might support this.
With 4 or 8 titan X, training time is shortened from more than one week to one or two days.
code available at https://github.com/yjxiong/caffe
Data parallel is easy to implement and leads to linear training time speed ups, as per 1 week to 2 days with 4x hardware. But not bigger models.
Model parallelism leads to bigger models and is what I've been referring to. It is overkill and doesn't work well anyway. Frameworks should not be expected to support it because there are more interesting topics in deep learning they could support instead.
Better research/methods will come out at some point, at which point this calculus will change, but not yet! Today it is the very definition of premature optimization in nearly all cases.
It's a good example of something people think they need multi-GPU for, but no, I don't think it's needed. For the same reason as "I've got petabytes of data".
The rest of us don't have "a few thousand CPUs" though. Seen this way, TensorFlow is largely inaccessible and makes more sense to focus on existing librairies optimized for single cpus
I don't think you understand what this library is for. You can pick up a graphic card for $100 and voila now you got 1000's of cores. To your defense the comment you answered seems to have mixed a few random things together too.
As far as I know, TensorFlow's scaling behavior has not been published.
For its predecessor DistBelief, here are some quotes from NIPS 2012 paper:
"The moderately sized speech model runs fastest on 8 machines, computing 2.2x faster than using a single machine. (Models were configured to use no more than 20 cores per machine.) Partitioning the model on more than 8 machines actually slows training, as network overhead starts to dominate in the fully-connected network structure and there is less work for each machine to perform with more partitions." ("The moderately sized" model here has 42 million model parameters. Check the paper for details.)
"In contrast, the much larger, locally-connected image models can benefit from using many more machines per model replica. The largest model, with 1.7 billion parameters benefits the most, giving a speedup of more than 12x using 81 machines. For these large models using more machines continues to increase speed, but with diminishing returns."
Until now, I've seen two responses to Google's TensorFlow from Facebook employees. Yann Le Cunn seemed to really challenge Jeff Dean about TensorFlow's scalability [1] and this benchmark puts TensorFlow down there in all the measures it tested for. I can't ignore the possibility that this criticism of TensorFlow from Facebook employees (while factually correct and constructive) might be driven by some competition and jealousy.
I'm the one who runs the benchmarks. It's sad that you put such a twist to the whole thing.
I've been running convnet-benchmarks forever now, and I've been running them independently on separate personal hardware. I do this as a hobby.
I've done an apples-to-apples comparison, and my benchmark review only puts the facts forward, I dont attack them.
If you read my other social media comments, I've been pretty positive about TensorFlow, and I've even put in some groundwork to write Torch bindings for it.
Please stop spewing nonsense interpolated from like 2 super-weak data points.
This benchmark basically shows that releasing TensorFlow with cudnn v2 backend support hurts - v2 is quite a bit slower than v3 (current) and v4 (upcoming). TF has announced that they will update to v4 support, which should help quite a bit - but when many hobbyists and researchers are developing on one or two GPUs performance on that scale is more important (for them) than infinite scalability.
It is not surprising that a tool developed and focused on `Google scale` work has some imperfections in a wildly different setting. The question is - will they (or some dedicated contributor) speed up a use case the business itself may not have a use for? My gut feeling is that they will, but these things usually don`t happen overnight. Torch, Theano, and Caffe have years of work put into them and have largely been focused on the one to two GPU case.
My point is more that even if they were on the same footing from benchmark timings, v2 is still far behind what is supported in Torch, Caffe, and Theano right now (v3 in all IIRC). Your comparison is very fair, and it is good insight!
You might be seeing the final gasps of Google's longstanding and now-reversed anti-GPU stance in TF's initial GPU performance.
NVIDIA support alone will make sure TF knocks it out of the park down the road. IMO it's crazy to consider the first release the final say on TF's GPU performance. Caffe, Torch, and Theano have a huge head start (and lots of pre-existing technical support from NVIDIA).
The biggest limitation I see right now is that their multi-GPU algorithms are really simple and inefficient. That will change I'm sure now that they're getting benchmarked against everyone else.
I don't think anyone is claiming TF won't get much, much faster (probably very soon). But claiming that right now - today - TF kills existing toolkits like Torch, Caffe, and Theano (which I have seen here and elsewhere - though not claiming you are in this camp) is a bit premature.
Even with v4 support which is coming soon, the general setup that users have is one to two GPUs in one machine. This means adding the right inplace operations can have a huge impact on performance, and is probably a usecase that Google has not focused on given their internal infrastructure. I am sure they will normalize to "at or slightly above" the performance of other toolkits, but the question is when?
This benchmark doesn't have anything to do with multi-GPU as far as I am aware - these are single machine, single GPU results. I would wager 90% of the deep learning hobbyist and research communities run in this setting, so benchmarking this is really important.
For people with huge amounts of networked and distributed resources - distributed support will be very amazing!
Inplace is a nice late phase optimization, but the root problem here appears to be that Google's convolution kernels are crap compared to those in cuDNN3 and Neon (I asked Scott Gray about this directly and I trust his wisdom).
No surprises there whatsoever. The tensorflow engine is the most gold-plated POS I've seen in a long time. if it were running on 1000+ servers, I'd get the level of overengineering they've applied here. But single server pthreads? WTF?
Also parameter server is dumb unless you sweat the implementation of the gather/reduction ops, but I digress.
Do we know if this is a just a reference implementation or what Google uses in production? My guess the numbers will come down pretty quick. From what I can see from mxnet versus theano versus torch is that GPUs are really the final determinant of speed and not the framework.
I think I'll give the benefit of the doubt to, you know, the pioneer of deep learning, inventor of convolutional neural networks, and (co)-inventor of the backpropagation algorithm.
Just because you invent an algorithm, it doesn't mean you know how to implement it in the most efficient way possible.
I think the key to TensorFlow is not how fast it runs on 1 machine; but how fast it runs on 10,000.
Consider map-reduce (Hadoop). Sure, you can sort 1GB data on a single machine 10x faster (using /usr/bin/sort) than using Hadoop on that machine; but make the data 1TB and add 1000 machines, now lets see how fast you can sort with /usr/bin/sort!
I'm just pointing out how ridiculous it is to cast aspersions on the judgment of one of the world's leading experts on deep learning.
It's also getting out of hand because everyone has already decided TensorFlow must be amazing, so everyone is extolling the virtues of what right now we only know to be speculation and PR.
>I'm just pointing out how ridiculous it is to cast aspersions on the judgment of one of the world's leading experts on deep learning.
A position of authority doesn't mean the person is immune to jealousy. Someone in a high status position is going to be much more likely to have a knee-jerk defensive reaction to news that threatens their image.
Yann LeCun has been working on hardware implementations of convnets almost since the beginning. LeNet 5 (the check reader of ATT, circa early 90s) needed a dedicated and specialized hardware implementation IIRC. It could be kneejerk, but he definitely has the background to make a fair assessment - and this benchmark seems to back his claims.
These benchmarks evaluate single-node performance. LeCun's remarks were concerning distributed training (specifically that bandwidth between machines is a limiting factor to scalability) -- which we can't test yet since the current version of TF is single-node only.
Dean's response in the video "it depends on your [computer] network" is an interesting response :).
Both Google and FB (from what I understand) have a ton of tricks to help this, but I expect Yann was speaking generally even with all these tricks, and he is right from what I have seen. The old paper on DistBelief talks about topping out at ~80 machines due to network overhead - it would be great if they talk more about distributed TF in an upcoming paper.
If TF really has a way to make general, networked, distributed training efficient (more than 1 bit weight updates, low precision weights and all the other crazy tricks which already exist) - that is truly remarkable and they rightly deserve huge kudos.
If they are faster in distributed training only on Google machines or with Google's network architecture, that isn't really a useful datapoint for the general public.
We could wire everything with 10GB/s NiCs, change to jumbo frames, pull all these other bandwidth reducing tricks, trick out the Linux kernel, etc. and then your network probably won't matter - but that isn't really general or cheap. The key will be what is the minimum effort necessary to avoid network bottlenecking, and does TF improve that minimum level over existing solutions?
Scenarios of machine learning on 1 machine is way more common than machine learning on 10,000 machines! I doubt even Google would have 10,000 for google translate
It's more realistic to discuss librairies for 1 machine / GPU
The responses seem to show that the way you implement things can make a big difference in runtime. Perhaps the scripts used for benchmarking can be further optimized?
That said, the lack of in-place operations might be surprising (although it has been said that they are coming)
Interesting benchmarks. One hopefully constructive critique, if you say things go out of memory, it'd be really useful to know what your setup is. Maybe you've got a big array of massive GPUs or you're running it on a more normal consumer GPU+box.
According to: https://github.com/soumith/convnet-benchmarks, it's an NVIDIA TitanX (12GB GPU memory), which is pretty much the top of the line GPU for training neural nets. From my limited experience, most deep learning in research - including much of the state of the art - is done on single GPUs. (My guess is that if your model doesn't fit in 12GB, your model has way too many parameters to practically train anyway).
It's almost like Google wanted everyone to use slow obsolete software and keep the really good stuff for itself, while still making it look like they're doing a great thing for the community.
a) It's Google, they can throw hardware at a problem such as out of memory
b) TensorFlow makes methods development so much easier that it's worth the loss of performance
c) It's early days and the compute graph scheduler has lots of opportunities, and is designed, for optimization, and in a more flexible fashion than other frameworks.
When I work on developing methods for scientific code, I worry more about whether the code is bug free/easy to understand and that it's giving the right answer. I don't usually worry about performance unless I'm actually not able to run things. Especially since when developing stuff you waste way more time on runs with bugs - I dread to think what my (published / unpublished) CPU hour ratio is. If the new approach allows less buggy implementations then that's a resource win.
> b) TensorFlow makes methods development so much easier that it's worth the loss of performance
Indeed, if TensorFlow means I can try out an idea with 1 day of coding and 2 days of training rather than 3 days of coding and 1 day of training then I can spend a day drinking cocktails and reading books and still be finished sooner.
Based on the tutorials it seems like I'd be able to pretty quickly build a translation pipeline, and in fact there's an implementation of that I think I'll try. If it takes a week or two to train, that's fine by me, I've got other things to be getting on with.
Seeing as Google has one of the most powerful computing networks in the world, I bet it's more that they can just throw resources at it until it's fast enough. This helps justify the trade off between speed and ease of use.
DNNs provide some interesting opportunities for domain specific optimization with lossy math because empirically it seems that the solutions dont require significant precision. Something like a general Tensor library holds you back from using some of the tricks like synchronization-free hogwild, where it wouldn't be appropriate in other scenarios where more exact computation is necessary.
One interesting step towards this is their implementation of lossy compression with a custom 16-bit half float representation. However a lot of people are dialing down the precision even further, including one of the authors of TensorFlow (http://petewarden.com/2015/05/23/why-are-eight-bits-enough-f...).
Scott Gray, one of the principle engineers at Nervana Systems, has a very wise reply in the github thread - they aren't doing much to optimize for memory locality, and wont have amazing speedups until that happens.