In terms of "how is CNTK better than TensorFlow," CNTK trains 5-10x faster at minimum on pre-2.0 LSTM benchmarks, which is big since LSTMs are used for a lot nowadays. (https://arxiv.org/abs/1608.07249)
Disclaimer: I work at Microsoft.
The "they" who chose not to use the cuDNN bindings were the authors of the benchmark. Some of the Torch folks filed a bug with the HKbench folks for the same error, but with respect to Torch: https://github.com/hclhkbu/dlbench/issues/14
Pull a researcher aside, and I bet you every one will have at least one experience of trying to reproduce some reported--sometimes even lauded--result but being unable to.
The good news is this kind of cultural fix is really invariant of compute platform or modeling framework. It's just nice that MSFT appears to be making it a priority.
Classic hacker news right here.
I'm fairly sure you don't realize what a major problem reproducible results is for the scientific and ML community, but it's big. A commitment to this and transparency is a pretty big deal.
"...It was made very clear from the first day of TensorFlow’s announcement, that Google created two TensorFlow versions: a public version and an internal version. As a TensorFlow user, one either must tolerate the slow speed of the public version, or pay to run the TensorFlow job on Google’s cloud.
There is one TensorFlow. The differences between using TF internally and externally have primarily to do with which RPC bindings it uses (the external one uses gRPC, which is open-source, and the internal one uses the internal RPC framework, which is tied in with all of the internal cluster stuff and authentication and whatnot), and things like filesystems that only exist in Google. The other difference is that there are linkages to use TPUs, instead of just GPUs, which is hardware that doesn't exist outside of Google. The final differences are just in how the BUILD files link against library files -- the external version downloads protobuf for you, the internal version assumes it's there to use. yadda yadda yadda.
You can see all of this in the code. It leaks out in places, such as:
Yes, it's that _super secret_ use of a different integral_types.h header. (/sarcasm). If you look through for things like PLATFORM_GOOGLE in the defines, you'll see a lot of the things that differ, and they're incredibly boring. The core of TensorFlow performance-related stuff is Eigen (or, thanks to Intel's recent contributions, Intel MKL) for executing Tensor ops on CPU, or cuDNN for executing Tensor ops on GPU. Just like every other freakin' framework out there. There's a reason that all of these things tend to reduce to the performance of cuDNN...
See also Pete Warden's article: https://www.oreilly.com/ideas/how-the-tensorflow-team-handle...
("we use almost exactly the same code base inside Google that we make available on GitHub").
(Source: I'm a part-time hanger-on on the Brain team, which develops TensorFlow. I'm also a Carnegie Mellon professor most of the time, and I despise marketing getting in the way of truth.)
are generated using GCP, AWS, and an NVidia DGX-1, all using exactly the capabilities any ordinary user has on those platforms. The K80 distributed training results are AWS.
There's also a very useful set of suggestions for how to tune TensorFlow for best performance both, and scripts that repeat the benchmarking results: https://www.tensorflow.org/performance/
I see that since my comment, Microsoft has updated the claims in the cited page. It's still not true that there are two versions, but I'm glad you're trying to provide more detail. I'd like to stick a big  on the claim that the internal version is much faster.
At the time Mu Li did his performance analysis of MXnet vs Tensorflow, we hypothesized that gRPC overhead was one of the reasons that MXnet was showing better scaling numbers than TF. That turns out to not have been very correct - there were several things that the TF team identified that closed the scalability gap to a pretty narrow degree around the 1.0 release. I don't feel confident that gRPC is much of an impediment to scalability. (I'm also not saying that it isn't -- just that I don't think there's a lot of evidence one way or another).
I'd love it if the CNTK team or someone else were to publish high-quality, head-to-head scalability numbers using the best practices and scripts identified in the TensorFlow performance guide, and using the equivalent CNTK best practices. It benefits everyone when Microsoft and Google work hard to out-do each other. :) (And throw in MXNet as well, with Amazon's best guidance.)
MxNet has a highly efficient network stack that's open source; Caffe2 uses gloo, which is open source; CNTK primarily uses Open MPI, NCCL and soon NCCL2.0. I think it's fair that Google also open source the internal network stack because it is the key to scaling.
Most convolutional networks are not a stress test for scaling because the model size/computation ratio is too low. Use a speech model that has many fully connected layers, or VGG16/19, the communication cost will dominate, and that's when CNTK's 1-bit SGD and Block Momentum really shine.
Again, I work at Microsoft.
Using NCCL is great; TF now supports it, as of about a month and a half ago (though I don't know how tightly integrated it is): https://github.com/tensorflow/tensorflow/blob/master/tensorf...
From the benchmarks available, and not knowing what your in-house experiments show, I don't believe that the "internal network stack" is key to scaling. The scalability numbers shown on tensorflow.org/performance are very reasonable: From 902 images/sec to 1783 (1.97x) going from 32->64 K80 GPUs on Amazon for Inception v3, and 565->981 (1.7x) for ResNet-512. I'd love to be proved wrong.
That 1.7x scaling on ResNet-512 would be a great point of comparison, for example. From my student Hyeontaek's results, I actually suspect that there are scheduling improvements that could make up some of that difference, not networking improvements.
As I'm sure you know, of course, and are just fishing for, the reason that code links against gRPC externally is because trying to extract Google's internal networking code from the full internal software codebase would be ridiculous. I think it's far more likely to see the other direction, with everything settling on gRPC -- gRPC is actually newer, and in general, more feature-ful, than Stubby: https://cloudplatform.googleblog.com/2016/08/gRPC-a-true-Int...
Disclaimer: I work at Google.
Obviously, this is taking the word of someone who is incentivized to get as many people using TF as possible, but I haven't been given a reason to not believe him.
I believe that the public TensorFlow repo is very close to what they use internally. That said, I'm sure there is a huge amount of internal tooling (for easily spinning up clusters internally, profiling, probably an automatic device placer) that we don't get to see. But that has more to do with the fact that its designed with Google's specific infrastructure in mind than it has to do with "hoarding the good stuff".
I personally don't care what tools or frameworks people use to get work done and have repeatedly suggested people use whatever works best for them.
It also wouldn't make any sense to hoard any good stuff internally if we wanted to provide a useful framework that people wanted to use externally.
We actually don't have a huge amount of tooling internally that we hold back. The only tooling I really use is the [timeline](https://github.com/tensorflow/tensorflow/blob/f488419cd6d925...) for debugging performance (not the EEG tool in the whitepaper, I've never used it). That's available externally though, as you can see.
On some of the comments in this thread in general, I'm pretty sad at the lack of scientific rigor in the community, and that goes for any person who publishes code that differs from the results they claim, regardless of affiliation. I am happy about projects like OpenAI's RL baselines, as I know others are too.
Most of the papers and articles comparing performance of frameworks have lots of bugs and aren't even comparing the same model computation between frameworks. In fact, CNTK's article points to an external benchmark showing CNTK in a good light, but those benchmarks have bugs in them rendering the comparison incorrect (we've been sending PRs to fix them). I find it disappointing that the culture of the organization promotes calling others out for being 'irresponsible' except when it suits them.
The TensorFlow team hasn't published many articles comparing performance directly to others because it's honestly a lot of hard work to verify that you are comparing fairly. Of course, I do think TensorFlow needs to improve performance out of the box for people, and the team is working on that.
I think part of the conspiracy theorizing is due to the misconceived notion that Google has some "secret sauce" that allows it to do what it does, as opposed to many talented engineers spending a lot of man-hours on a problem. There has also been a fair amount of negative Google sentiment in the community recently, and the story that Google is holding out on developers feeds into this narrative.
Benchmarking has always been low-hanging fruit for community members to latch onto for the sake of attacking/defending a particular framework. However, the practical difference between these frameworks (assuming each is configured properly) seem to be within a margin of error and are constantly changing (not to mention the inconsistencies you mentioned), so choosing a framework solely on its benchmarking scores is narrow-minded.
From what I've seen, benchmarking has been more useful as a discovery mechanism for areas in a codebase that can be improved. The TensorFlow team has done an excellent job of using various benchmarks to guide development, and I imagine other frameworks are doing the same.
I have an Electrical Engineering degree and am good at math ( Math is a passion for me).
I am extremely interested in ML and would like to start by doing stuff, rather than theoretical aspects.
With the material so far that I have read on ML, there seems to be a huge number of variables governing the outcome of a particular method / algorithm (no. of data points, no. of learning iterations, etc)
If I had to pick up a toolkit for get started with ML, is this a good one ? (I am aware of scikit-learn, Tensor Flow, etc).
If this is the one, what book/books can I keep as a reference while working with the toolkit?
I usually select a project, work out the human-machine interaction (UI, backend stuff, etc) on a functional level and then select a stack for implementing the project. I also change the functional aspects of my original design if the stack I have selected offers some commonly used functions.
My initial project is to develop a machine learning system that can detect various QR codes in an image and get their contents.
If you want to go more high-level, use sth like Keras. You define your network structure as a series of layer types, or maybe in other ways, and it does most of the logic for you, and has already implemented most the commonly used Deep Learning techniques. So you concentrate more on the network structure, about what techniques you want to use, etc. And Keras actually supports several backends such as Theano and TensorFlow and CNTK is work-in-progress, although as a user, you won't notice so much difference, except that maybe one backend is faster than the other or does not support some specific functionality or so.
Since I have a pre-set project at hand, I want to use the system first, so as per your suggestion, I'll go with keras.
However, I want to understand the system at a deeper level, purely as a curiosity.
One is the 1-bit SGD that has long been in CNTK and has been criticized for its weird license. I am not a lawyer, but my understanding is it says something like you cannot use this unless you call it from inside CNTK. The terms are not that bad and you don't have to use 1-bit SGD if you don't like them.
CNTK 2.0 has another 1-bit thing going on as well which is binary convolution. This uses the Halide compiler to generate code that is 10x faster than optimized 32-bit convolution. This still seems to be at a proof of concept stage.
I ask because when CNTK was first shared it was also under a non-commercial license. It seemed to subsequently drop off the radar for many people (eg. It wasn't mentioned in Stanford's ConvNet course or in Udacity's Machine Learning course).
I wonder how their distributed training setup compares to mxent ps-lite (in terms of performance and licensing).
There is also the v7 benchmark done on a lot more hardware and where tensorflow fares a bit better - http://dlbench.comp.hkbu.edu.hk/?v=v7
Does anyone know whether TF had a performance regression between v0.11 and v1.0 or if it was just lucky on benchmark v7 and unlucky on v8?
Also, how does CNTK manage to be that much better than anyone else on LSTMs? It's ability to scale to bigger batch sizes is unreal. Order of magnitude faster than other frameworks.
TensorFlow has multiple LSTM implementations, such as LSTMCell, BasicLSTMCell, LSTMBlockCell, and also one wrapper for CuDNN, and maybe more. I'm quite confident that in this benchmark, for TensorFlow, they did not use the CuDNN wrapper, which is a bit unfair I would say. Although the CuDNN wrapper in TensorFlow does not support sequences of different lengths but you could overcome this by just ignoring the non-used frames. See here for some more details:
Note that you could also provide your own LSTM kernel for TensorFlow, which is what we do in our framework, and then you can get really fast, although our benchmarks are a bit outdated.
Disclosure: I work on Google Cloud (but not directly in/on TensorFlow).
Glad to see Frank Seide (and others I guess?) are still working on that!
> Speed. CNTK is in general much faster than TensorFlow, and it can be 5-10x faster on recurrent networks.
This is because they use the CuDNN LSTM kernel but for the TensorFlow comparison they probably did not use the CuDNN LSTM kernel in TensorFlow. See here for some more details: https://news.ycombinator.com/item?id=14473234
> Accuracy. CNTK can be used to train deep learning models with state-of-the-art accuracy.
As well as all other frameworks can do.
> API design. CNTK has a very powerful C++ API, and it also has both low-level and easy to use high-level Python APIs that are designed with a functional programming paradigm.
TensorFlow also has a C++ API.
> Scalability. CNTK can be easily scaled over thousands of GPUs.
> Inference. CNTK has C#/.NET/Java inference support that makes it easy to integrate CNTK evaluation into user applications.
TensorFlow also has many bindings for other languages.
> Extensibility. CNTK can be easily extended from Python for layers and learners.
TensorFlow can very easily be extended. I did that a lot in our framework (https://github.com/rwth-i6/returnn).
> Built-in readers. CNTK has efficient built in data readers that also support distributed learning.
Just like TensorFlow.
> Identical internal and external toolkit. You would not be compromised in any way because the same toolkit is used by internal product groups at Microsoft.
Ok, maybe here they are better in some sense, although I am not sure that the Google internal version of TensorFlow differs so much. As far as I know, it just has some stuff added for their data centers, for TPU, etc.
Also, the licence of 1-bit SGD is strange in CNTK. Not sure what the state about this is. Last time you could not really use that.
I don't want to downplay CNTK. I really like it. I think it's great that they have really good working CuDNN LSTM wrappers in their default LSTM implementation, and it's better than all the other wrappers (see here: https://stackoverflow.com/questions/41461670/cudnnrnnforward...). The team behind CNTK is really strong. So thank you and congratulation for releasing CNTK 2.0!
Does it mean I can integrate deep learning models into Android applications as well?