Microsoft Cognitive Toolkit 2.0

minimaxir · on June 2, 2017

The important feature is Keras compatability (although it doesn't seem to be in the official repo yet: https://github.com/fchollet/keras/pull/6800)

In terms of "how is CNTK better than TensorFlow," CNTK trains 5-10x faster at minimum on pre-2.0 LSTM benchmarks, which is big since LSTMs are used for a lot nowadays. (https://arxiv.org/abs/1608.07249)

albertzeyer · on June 2, 2017

This is because they use the CuDNN LSTM kernel in CNTK and in TensorFlow they chose to not use the CuDNN LSTM kernel which might be a bit unfair because they could have used it. The same for Torch. See here for more details: https://news.ycombinator.com/item?id=14473234

justnikos · on June 2, 2017

While it is true that CNTK can use CuDNN LSTM, if the recurrence does not fall into the 4 recurrences that CuDNN supports, CNTK is still much faster. The simplest way to verify this is to take a Keras script that uses whatever recurrent network you want and run it (on a GPU) with Tensorflow backend and with CNTK backend. Some anecdotal evidence suggest an easy 3x speedup.

Disclaimer: I work at Microsoft.

moron4hire · on June 3, 2017

Why would that be "a bit unfair"? It sounds like the TensorFlow team just hasn't been able to make as many or as good of optimizations as the CNTK team. If they could have, but chose not to do the same things, what tradeoff is there to be able to say it's "a bit unfair"?

dgacmu · on June 3, 2017

The binding of the word "they" is unclear in the comment you're replying to.

The "they" who chose not to use the cuDNN bindings were the authors of the benchmark. Some of the Torch folks filed a bug with the HKbench folks for the same error, but with respect to Torch: https://github.com/hclhkbu/dlbench/issues/14

moron4hire · on June 3, 2017

thanks, that makes more sense

ronjouch · on June 2, 2017

Also, "Reasons to Switch from TensorFlow to CNTK": https://docs.microsoft.com/en-us/cognitive-toolkit/reasons-t...

thr0waway1239 · on June 2, 2017

[flagged]

eob · on June 2, 2017

As a former ML researcher, I read your quoted text and came away with the exact opposite conclusion. There is an incredible need as a community to work toward reproducible results.

Pull a researcher aside, and I bet you every one will have at least one experience of trying to reproduce some reported--sometimes even lauded--result but being unable to.

The good news is this kind of cultural fix is really invariant of compute platform or modeling framework. It's just nice that MSFT appears to be making it a priority.

cscurmudgeon · on June 2, 2017

I have to agree with your comment fully. Read the comment above and went WTF.

KirinDave · on June 2, 2017

Uhh...

Classic hacker news right here.

I'm fairly sure you don't realize what a major problem reproducible results is for the scientific and ML community, but it's big. A commitment to this and transparency is a pretty big deal.

75dvtwin · on June 2, 2017

I did not know the below situation with tensor flow. MS contributions to OSS, at least in this instant, appear a lot more transparent and not-self-centered, compared to Google's

"...It was made very clear from the first day of TensorFlow’s announcement, that Google created two TensorFlow versions: a public version and an internal version. As a TensorFlow user, one either must tolerate the slow speed of the public version, or pay to run the TensorFlow job on Google’s cloud. ..."

dgacmu · on June 2, 2017

This is baloney. In fact, it's offensive baloney.

There is one TensorFlow. The differences between using TF internally and externally have primarily to do with which RPC bindings it uses (the external one uses gRPC, which is open-source, and the internal one uses the internal RPC framework, which is tied in with all of the internal cluster stuff and authentication and whatnot), and things like filesystems that only exist in Google. The other difference is that there are linkages to use TPUs, instead of just GPUs, which is hardware that doesn't exist outside of Google. The final differences are just in how the BUILD files link against library files -- the external version downloads protobuf for you, the internal version assumes it's there to use. yadda yadda yadda.

You can see all of this in the code. It leaks out in places, such as:

https://github.com/tensorflow/tensorflow/blob/d0d975f8c3330b...

Yes, it's that _super secret_ use of a different integral_types.h header. (/sarcasm). If you look through for things like PLATFORM_GOOGLE in the defines, you'll see a lot of the things that differ, and they're incredibly boring. The core of TensorFlow performance-related stuff is Eigen (or, thanks to Intel's recent contributions, Intel MKL) for executing Tensor ops on CPU, or cuDNN for executing Tensor ops on GPU. Just like every other freakin' framework out there. There's a reason that all of these things tend to reduce to the performance of cuDNN...

See also Pete Warden's article: https://www.oreilly.com/ideas/how-the-tensorflow-team-handle...

("we use almost exactly the same code base inside Google that we make available on GitHub").

(Source: I'm a part-time hanger-on on the Brain team, which develops TensorFlow. I'm also a Carnegie Mellon professor most of the time, and I despise marketing getting in the way of truth.)

wishallbest · on June 3, 2017

Scalability is part of TensorFlow's claimed advantages. If someone adopts TF on their own cluster, would they get the same scalability story as marketed?

Disclaimer: I work at Microsoft.

dgacmu · on June 3, 2017

Martin already replied, but to provide a bit more detail, the benchmark results published at: https://www.tensorflow.org/performance/benchmarks

are generated using GCP, AWS, and an NVidia DGX-1, all using exactly the capabilities any ordinary user has on those platforms. The K80 distributed training results are AWS.

There's also a very useful set of suggestions for how to tune TensorFlow for best performance both, and scripts that repeat the benchmarking results: https://www.tensorflow.org/performance/

I see that since my comment, Microsoft has updated the claims in the cited page. It's still not true that there are two versions, but I'm glad you're trying to provide more detail. I'd like to stick a big [citation needed] on the claim that the internal version is much faster.

At the time Mu Li did his performance analysis of MXnet vs Tensorflow, we hypothesized that gRPC overhead was one of the reasons that MXnet was showing better scaling numbers than TF. That turns out to not have been very correct - there were several things that the TF team identified that closed the scalability gap to a pretty narrow degree around the 1.0 release. I don't feel confident that gRPC is much of an impediment to scalability. (I'm also not saying that it isn't -- just that I don't think there's a lot of evidence one way or another).

I'd love it if the CNTK team or someone else were to publish high-quality, head-to-head scalability numbers using the best practices and scripts identified in the TensorFlow performance guide, and using the equivalent CNTK best practices. It benefits everyone when Microsoft and Google work hard to out-do each other. :) (And throw in MXNet as well, with Amazon's best guidance.)

wishallbest · on June 3, 2017

Thanks for the clarification. gRPC is slow. We have in-house experiments showing on RDMA-capable networks an optimized implementation can achieve significant speed up over gRPC. And I bet Google's internal version is even faster.

MxNet has a highly efficient network stack that's open source; Caffe2 uses gloo, which is open source; CNTK primarily uses Open MPI, NCCL and soon NCCL2.0. I think it's fair that Google also open source the internal network stack because it is the key to scaling.

Most convolutional networks are not a stress test for scaling because the model size/computation ratio is too low. Use a speech model that has many fully connected layers, or VGG16/19, the communication cost will dominate, and that's when CNTK's 1-bit SGD and Block Momentum really shine.

Again, I work at Microsoft.

dgacmu · on June 3, 2017

Publish those results? It'd be very interesting to see. And, it sounds like you think there are benchmarks missing from the existing common set of things people are measuring -- what's a very specific network you'd like to see added to the mix? VGG16 doesn't fall into my radar of "modern and applicable" in the days of ResNet.

Using NCCL is great; TF now supports it, as of about a month and a half ago (though I don't know how tightly integrated it is): https://github.com/tensorflow/tensorflow/blob/master/tensorf...

From the benchmarks available, and not knowing what your in-house experiments show, I don't believe that the "internal network stack" is key to scaling. The scalability numbers shown on tensorflow.org/performance are very reasonable: From 902 images/sec to 1783 (1.97x) going from 32->64 K80 GPUs on Amazon for Inception v3, and 565->981 (1.7x) for ResNet-512. I'd love to be proved wrong.

That 1.7x scaling on ResNet-512 would be a great point of comparison, for example. From my student Hyeontaek's results, I actually suspect that there are scheduling improvements that could make up some of that difference, not networking improvements.

As I'm sure you know, of course, and are just fishing for, the reason that code links against gRPC externally is because trying to extract Google's internal networking code from the full internal software codebase would be ridiculous. I think it's far more likely to see the other direction, with everything settling on gRPC -- gRPC is actually newer, and in general, more feature-ful, than Stubby: https://cloudplatform.googleblog.com/2016/08/gRPC-a-true-Int...

ma2rten · on June 3, 2017

Yes.

Disclaimer: I work at Google.

sja · on June 2, 2017

I think that part is misleading. Vijay Vasudevan (a member of the TensorFlow team) has repeatedly put down the notion that the internal TensorFlow code is significantly different than what we see:

https://www.reddit.com/r/MachineLearning/comments/696dzy/d_i...

Obviously, this is taking the word of someone who is incentivized to get as many people using TF as possible, but I haven't been given a reason to not believe him.

I believe that the public TensorFlow repo is very close to what they use internally. That said, I'm sure there is a huge amount of internal tooling (for easily spinning up clusters internally, profiling, probably an automatic device placer) that we don't get to see. But that has more to do with the fact that its designed with Google's specific infrastructure in mind than it has to do with "hoarding the good stuff".

vrv · on June 3, 2017

Thanks for mentioning that Sam, I appreciate it. Speaking for just myself:

I personally don't care what tools or frameworks people use to get work done and have repeatedly suggested people use whatever works best for them.

It also wouldn't make any sense to hoard any good stuff internally if we wanted to provide a useful framework that people wanted to use externally.

We actually don't have a huge amount of tooling internally that we hold back. The only tooling I really use is the [timeline](https://github.com/tensorflow/tensorflow/blob/f488419cd6d925...) for debugging performance (not the EEG tool in the whitepaper, I've never used it). That's available externally though, as you can see.

On some of the comments in this thread in general, I'm pretty sad at the lack of scientific rigor in the community, and that goes for any person who publishes code that differs from the results they claim, regardless of affiliation. I am happy about projects like OpenAI's RL baselines, as I know others are too.

Most of the papers and articles comparing performance of frameworks have lots of bugs and aren't even comparing the same model computation between frameworks. In fact, CNTK's article points to an external benchmark showing CNTK in a good light, but those benchmarks have bugs in them rendering the comparison incorrect (we've been sending PRs to fix them). I find it disappointing that the culture of the organization promotes calling others out for being 'irresponsible' except when it suits them.

The TensorFlow team hasn't published many articles comparing performance directly to others because it's honestly a lot of hard work to verify that you are comparing fairly. Of course, I do think TensorFlow needs to improve performance out of the box for people, and the team is working on that.

sja · on June 3, 2017

Thanks for the response, Vijay. I didn't mean to insinuate that I thought the team was holding back tooling for the public (rather that such tooling wouldn't make sense to release), but it's reassuring to hear that the total TensorFlow experience is pretty much the same internally and externally.

I think part of the conspiracy theorizing is due to the misconceived notion that Google has some "secret sauce" that allows it to do what it does, as opposed to many talented engineers spending a lot of man-hours on a problem. There has also been a fair amount of negative Google sentiment in the community recently, and the story that Google is holding out on developers feeds into this narrative.

Benchmarking has always been low-hanging fruit for community members to latch onto for the sake of attacking/defending a particular framework. However, the practical difference between these frameworks (assuming each is configured properly) seem to be within a margin of error and are constantly changing (not to mention the inconsistencies you mentioned), so choosing a framework solely on its benchmarking scores is narrow-minded.

From what I've seen, benchmarking has been more useful as a discovery mechanism for areas in a codebase that can be improved. The TensorFlow team has done an excellent job of using various benchmarks to guide development, and I imagine other frameworks are doing the same.

75dvtwin · on June 3, 2017

Thx for all the follow ups and clarifications. I am a user of neither toolkit yet, but following their progress. Perhaps submitting a bug to ask MS to change the above misinformation, is appropriate

kumarvvr · on June 3, 2017

I am new to Machine Learning, but fairly confident with programming.

I have an Electrical Engineering degree and am good at math ( Math is a passion for me).

I am extremely interested in ML and would like to start by doing stuff, rather than theoretical aspects.

With the material so far that I have read on ML, there seems to be a huge number of variables governing the outcome of a particular method / algorithm (no. of data points, no. of learning iterations, etc)

If I had to pick up a toolkit for get started with ML, is this a good one ? (I am aware of scikit-learn, Tensor Flow, etc).

If this is the one, what book/books can I keep as a reference while working with the toolkit?

I usually select a project, work out the human-machine interaction (UI, backend stuff, etc) on a functional level and then select a stack for implementing the project. I also change the functional aspects of my original design if the stack I have selected offers some commonly used functions.

My initial project is to develop a machine learning system that can detect various QR codes in an image and get their contents.

albertzeyer · on June 3, 2017

On what level do you want to develop and understand the system? If you work directly with Theano/TensorFlow/CNTK/MXNet, you are pretty low-level. You more or less write down the formulas from papers / books, you let the framework take the gradient of some loss, and you take care of everything, like updating the parameters according to some update rule / optimization method like SGD. See some of the tutorials of those frameworks and just decide on your own what you prefer.

If you want to go more high-level, use sth like Keras. You define your network structure as a series of layer types, or maybe in other ways, and it does most of the logic for you, and has already implemented most the commonly used Deep Learning techniques. So you concentrate more on the network structure, about what techniques you want to use, etc. And Keras actually supports several backends such as Theano and TensorFlow and CNTK is work-in-progress, although as a user, you won't notice so much difference, except that maybe one backend is faster than the other or does not support some specific functionality or so.

kumarvvr · on June 3, 2017

Thanks a lot for your suggestion. Never heard of keras, will check it out.

Since I have a pre-set project at hand, I want to use the system first, so as per your suggestion, I'll go with keras.

However, I want to understand the system at a deeper level, purely as a curiosity.

johnsmith21006 · on June 3, 2017

Would recommend using Tensorflow instead. TF already has 59k stars in GitHub and is going to be easier to find answers to questions and find tutorials, books, etc. Would say TF is already close to being the canonical ML framework. Think MS was just too late.

phren0logy · on June 2, 2017

Wow, adding Keras support is really slick! It's nice to be able to use Keras with a few different back ends.

iraphael · on June 2, 2017

And they also have builtin support for 1-bit SGD. Compresses a model "down to 1 bit per weight" [0]. Seems to be a general technique for model compression, but it's nice for deployment to have it built in. This also doesn't seem to be a new addition to CNTK, just something I didn't know before.

[0] https://www.microsoft.com/en-us/research/publication/1-bit-s...

justnikos · on June 2, 2017

There's actually two 1-bit things going on with CNTK 2.0.

One is the 1-bit SGD that has long been in CNTK and has been criticized for its weird license. I am not a lawyer, but my understanding is it says something like you cannot use this unless you call it from inside CNTK. The terms are not that bad and you don't have to use 1-bit SGD if you don't like them.

CNTK 2.0 has another 1-bit thing going on as well which is binary convolution. This uses the Halide compiler to generate code that is 10x faster than optimized 32-bit convolution. This still seems to be at a proof of concept stage.

Disclaimer: I work at Microsoft.

minimaxir · on June 2, 2017

1-bit SGD is cutting-edge deep learning tech. (so cutting edge it has a different license than CNTK itself: https://docs.microsoft.com/en-us/cognitive-toolkit/CNTK-1bit...)

Permit · on June 2, 2017

Is this common? Do libraries such as Tensorflow occasionally license portions of themselves only for non-commercial use?

I ask because when CNTK was first shared it was also under a non-commercial license. It seemed to subsequently drop off the radar for many people (eg. It wasn't mentioned in Stanford's ConvNet course or in Udacity's Machine Learning course).

rabidsnail · on June 2, 2017

"How is this better than TensorFlow" is the new "How is this better than Hadoop".

I wonder how their distributed training setup compares to mxent ps-lite (in terms of performance and licensing).

cromulen · on June 2, 2017

They refer to this benchmark in the blog post - http://dlbench.comp.hkbu.edu.hk/

There is also the v7 benchmark done on a lot more hardware and where tensorflow fares a bit better - http://dlbench.comp.hkbu.edu.hk/?v=v7

Does anyone know whether TF had a performance regression between v0.11 and v1.0 or if it was just lucky on benchmark v7 and unlucky on v8?

Also, how does CNTK manage to be that much better than anyone else on LSTMs? It's ability to scale to bigger batch sizes is unreal. Order of magnitude faster than other frameworks.

albertzeyer · on June 2, 2017

CNTK uses the LSTM implementation by CuDNN in their official LSTM layer.

TensorFlow has multiple LSTM implementations, such as LSTMCell, BasicLSTMCell, LSTMBlockCell, and also one wrapper for CuDNN, and maybe more. I'm quite confident that in this benchmark, for TensorFlow, they did not use the CuDNN wrapper, which is a bit unfair I would say. Although the CuDNN wrapper in TensorFlow does not support sequences of different lengths but you could overcome this by just ignoring the non-used frames. See here for some more details:

https://stackoverflow.com/questions/41461670/cudnnrnnforward...

Note that you could also provide your own LSTM kernel for TensorFlow, which is what we do in our framework, and then you can get really fast, although our benchmarks are a bit outdated.

https://github.com/rwth-i6/returnn

sayanpa · on June 2, 2017

CNTKs roots are from Speech type data which inherently have a notion of time. The architecture of the toolkit support efficient recurrence from ground up. Also the toolkit focusses on handling large production scale data workloads which implies additional engineering efficiencies built into the toolkit. I am a Microsoft employee.

verdverm · on June 2, 2017

Love the pure (realistic?) linear scaling of "projected" performance in the graph

cbasoglu · on June 2, 2017

The only reason for projection here was the newness of the Volta HW (announced 2 weeks ago). The linear scaling is proven on Pascal and Maxwell HW and due to the communication library and algorithms like Block Momentum. Note: I am a MSFT employee.

projectorlochsa · on June 2, 2017

They use http://halide-lang.org/ for some convolutions. Very interesting.

boulos · on June 3, 2017

Cool! It's too bad about the cuDNN comparison bit people mention (as with dgacmu, I contend that all frameworks now devolve into the low level kernel used), but I'm both impressed by the release and appreciate all the Microsoft folks clearly disclaiming their affiliation.

Disclosure: I work on Google Cloud (but not directly in/on TensorFlow).

maga · on June 2, 2017

The C++ first API design is what appeals to me most. Not enough to switch from Tensorflow, though.

seanmcdirmid · on June 2, 2017

> After training a model using either Python or BrainScript, Cognitive Toolkit had always provided many ways to evaluate the model in either Python, BrainScript, or C#

Glad to see Frank Seide (and others I guess?) are still working on that!

albertzeyer · on June 2, 2017

I think their comparison is a bit unfair.

> Speed. CNTK is in general much faster than TensorFlow, and it can be 5-10x faster on recurrent networks.

This is because they use the CuDNN LSTM kernel but for the TensorFlow comparison they probably did not use the CuDNN LSTM kernel in TensorFlow. See here for some more details: https://news.ycombinator.com/item?id=14473234

> Accuracy. CNTK can be used to train deep learning models with state-of-the-art accuracy.

As well as all other frameworks can do.

> API design. CNTK has a very powerful C++ API, and it also has both low-level and easy to use high-level Python APIs that are designed with a functional programming paradigm.

TensorFlow also has a C++ API.

> Scalability. CNTK can be easily scaled over thousands of GPUs.

Like TensorFlow.

> Inference. CNTK has C#/.NET/Java inference support that makes it easy to integrate CNTK evaluation into user applications.

TensorFlow also has many bindings for other languages.

> Extensibility. CNTK can be easily extended from Python for layers and learners.

TensorFlow can very easily be extended. I did that a lot in our framework (https://github.com/rwth-i6/returnn).

> Built-in readers. CNTK has efficient built in data readers that also support distributed learning.

Just like TensorFlow.

> Identical internal and external toolkit. You would not be compromised in any way because the same toolkit is used by internal product groups at Microsoft.

Ok, maybe here they are better in some sense, although I am not sure that the Google internal version of TensorFlow differs so much. As far as I know, it just has some stuff added for their data centers, for TPU, etc.

Also, the licence of 1-bit SGD is strange in CNTK. Not sure what the state about this is. Last time you could not really use that.

I don't want to downplay CNTK. I really like it. I think it's great that they have really good working CuDNN LSTM wrappers in their default LSTM implementation, and it's better than all the other wrappers (see here: https://stackoverflow.com/questions/41461670/cudnnrnnforward...). The team behind CNTK is really strong. So thank you and congratulation for releasing CNTK 2.0!

tarlinian · on June 3, 2017

The CNTK C++ API is much more straightforward to use than TF(especially on Windows) and is much better suited to deploying serialized models into production as a part of existing C++ applications. The provided examples require way more boilerplate than an equivalent CNTK application.

namelezz · on June 2, 2017

> Java Bindings

Does it mean I can integrate deep learning models into Android applications as well?

it_learnses · on June 2, 2017

I am a .net developer, and would like to learn it, but not sure how. Any suggestions to where can I access a dataset and the computing power?

mtw · on June 2, 2017

I suggest following a machine learning / deep learning course first then learn this

it_learnses · on June 2, 2017

I actually did take a data mining course a couple years ago in uni, and also at the time I was following Andrew Ng's course. TBH, I learn better by working on a project. Otherwise I lose interest midway.

sayanpa · on June 2, 2017

If a course is of interest, we will be launching a deep learning course shortly that would help bridge the gaps. Stay tuned. I am a Microsoft employee.

mcintyre1994 · on June 2, 2017

Will this be based on Azure notebooks or otherwise cloud based + free?

sayanpa · on June 2, 2017

We are working through the details. It will follow the standard MOOC like Coursera, Edx, Udacity. You should be able to use the Azure Notebook (that is the goal) but there may be small caveats. Stay tuned as we work through the details.

mcintyre1994 · on June 2, 2017

This sounds awesome, thankyou for doing this! Is there anywhere I can give my email to get updates/a date I can put a reminder in my calendar to check this out?

mtgx · on June 2, 2017

I wonder if Tensorflow 2.0 is about to drop soon, too, with Caffe and Cognitive Toolkit now reaching version 2. I also assume it's going to be a lot more than a version change, too.