
Microsoft Cognitive Toolkit 2.0 - ronjouch
https://www.microsoft.com/en-us/cognitive-toolkit/blog/2017/06/microsofts-high-performance-open-source-deep-learning-toolkit-now-generally-available/
======
minimaxir
The important feature is Keras compatability (although it doesn't seem to be
in the official repo yet:
[https://github.com/fchollet/keras/pull/6800](https://github.com/fchollet/keras/pull/6800))

In terms of "how is CNTK better than TensorFlow," CNTK trains 5-10x faster _at
minimum_ on pre-2.0 LSTM benchmarks, which is big since LSTMs are used for a
lot nowadays.
([https://arxiv.org/abs/1608.07249](https://arxiv.org/abs/1608.07249))

~~~
albertzeyer
This is because they use the CuDNN LSTM kernel in CNTK and in TensorFlow they
chose to not use the CuDNN LSTM kernel which might be a bit unfair because
they could have used it. The same for Torch. See here for more details:
[https://news.ycombinator.com/item?id=14473234](https://news.ycombinator.com/item?id=14473234)

~~~
moron4hire
Why would that be "a bit unfair"? It sounds like the TensorFlow team just
hasn't been able to make as many or as good of optimizations as the CNTK team.
If they could have, but chose not to do the same things, what tradeoff is
there to be able to say it's "a bit unfair"?

~~~
dgacmu
The binding of the word "they" is unclear in the comment you're replying to.

The "they" who chose not to use the cuDNN bindings were the authors of the
benchmark. Some of the Torch folks filed a bug with the HKbench folks for the
same error, but with respect to Torch:
[https://github.com/hclhkbu/dlbench/issues/14](https://github.com/hclhkbu/dlbench/issues/14)

~~~
moron4hire
thanks, that makes more sense

------
ronjouch
Also, "Reasons to Switch from TensorFlow to CNTK":
[https://docs.microsoft.com/en-us/cognitive-
toolkit/reasons-t...](https://docs.microsoft.com/en-us/cognitive-
toolkit/reasons-to-switch-from-tensorflow-to-cntk)

~~~
75dvtwin
I did not know the below situation with tensor flow. MS contributions to OSS,
at least in this instant, appear a lot more transparent and not-self-centered,
compared to Google's

"...It was made very clear from the first day of TensorFlow’s announcement,
that Google created two TensorFlow versions: a public version and an internal
version. As a TensorFlow user, one either must tolerate the slow speed of the
public version, or pay to run the TensorFlow job on Google’s cloud. ..."

~~~
dgacmu
This is baloney. In fact, it's _offensive_ baloney.

There is _one_ TensorFlow. The differences between using TF internally and
externally have primarily to do with which RPC bindings it uses (the external
one uses gRPC, which is open-source, and the internal one uses the internal
RPC framework, which is tied in with all of the internal cluster stuff and
authentication and whatnot), and things like filesystems that only exist in
Google. The other difference is that there are linkages to use TPUs, instead
of just GPUs, which is hardware that doesn't exist outside of Google. The
final differences are just in how the BUILD files link against library files
-- the external version downloads protobuf for you, the internal version
assumes it's there to use. yadda yadda yadda.

You can see all of this in the code. It leaks out in places, such as:

[https://github.com/tensorflow/tensorflow/blob/d0d975f8c3330b...](https://github.com/tensorflow/tensorflow/blob/d0d975f8c3330b5402263b2356b038bc8af919a2/tensorflow/core/platform/types.h)

Yes, it's that _super secret_ use of a different integral_types.h header.
(/sarcasm). If you look through for things like PLATFORM_GOOGLE in the
defines, you'll see a lot of the things that differ, and they're incredibly
boring. The core of TensorFlow performance-related stuff is Eigen (or, thanks
to Intel's recent contributions, Intel MKL) for executing Tensor ops on CPU,
or cuDNN for executing Tensor ops on GPU. Just like every other freakin'
framework out there. There's a reason that all of these things tend to reduce
to the performance of cuDNN...

See also Pete Warden's article: [https://www.oreilly.com/ideas/how-the-
tensorflow-team-handle...](https://www.oreilly.com/ideas/how-the-tensorflow-
team-handles-open-source-support)

("we use almost exactly the same code base inside Google that we make
available on GitHub").

(Source: I'm a part-time hanger-on on the Brain team, which develops
TensorFlow. I'm also a Carnegie Mellon professor most of the time, and I
despise marketing getting in the way of truth.)

~~~
wishallbest
Scalability is part of TensorFlow's claimed advantages. If someone adopts TF
on their own cluster, would they get the same scalability story as marketed?

Disclaimer: I work at Microsoft.

~~~
dgacmu
Martin already replied, but to provide a bit more detail, the benchmark
results published at:
[https://www.tensorflow.org/performance/benchmarks](https://www.tensorflow.org/performance/benchmarks)

are generated using GCP, AWS, and an NVidia DGX-1, all using exactly the
capabilities any ordinary user has on those platforms. The K80 distributed
training results are AWS.

There's also a very useful set of suggestions for how to tune TensorFlow for
best performance both, and scripts that repeat the benchmarking results:
[https://www.tensorflow.org/performance/](https://www.tensorflow.org/performance/)

I see that since my comment, Microsoft has updated the claims in the cited
page. It's still not true that there are two versions, but I'm glad you're
trying to provide more detail. I'd like to stick a big [citation needed] on
the claim that the internal version is much faster.

At the time Mu Li did his performance analysis of MXnet vs Tensorflow, we
_hypothesized_ that gRPC overhead was one of the reasons that MXnet was
showing better scaling numbers than TF. That turns out to not have been very
correct - there were several things that the TF team identified that closed
the scalability gap to a pretty narrow degree around the 1.0 release. I don't
feel confident that gRPC is much of an impediment to scalability. (I'm also
not saying that it _isn 't_ \-- just that I don't think there's a lot of
evidence one way or another).

I'd love it if the CNTK team or someone else were to publish high-quality,
head-to-head scalability numbers using the best practices and scripts
identified in the TensorFlow performance guide, and using the equivalent CNTK
best practices. It benefits everyone when Microsoft and Google work hard to
out-do each other. :) (And throw in MXNet as well, with Amazon's best
guidance.)

~~~
wishallbest
Thanks for the clarification. gRPC is slow. We have in-house experiments
showing on RDMA-capable networks an optimized implementation can achieve
significant speed up over gRPC. And I bet Google's internal version is even
faster.

MxNet has a highly efficient network stack that's open source; Caffe2 uses
gloo, which is open source; CNTK primarily uses Open MPI, NCCL and soon
NCCL2.0. I think it's fair that Google also open source the internal network
stack because it is the key to scaling.

Most convolutional networks are not a stress test for scaling because the
model size/computation ratio is too low. Use a speech model that has many
fully connected layers, or VGG16/19, the communication cost will dominate, and
that's when CNTK's 1-bit SGD and Block Momentum really shine.

Again, I work at Microsoft.

~~~
dgacmu
Publish those results? It'd be very interesting to see. And, it sounds like
you think there are benchmarks missing from the existing common set of things
people are measuring -- what's a very specific network you'd like to see added
to the mix? VGG16 doesn't fall into my radar of "modern and applicable" in the
days of ResNet.

Using NCCL is great; TF now supports it, as of about a month and a half ago
(though I don't know how tightly integrated it is):
[https://github.com/tensorflow/tensorflow/blob/master/tensorf...](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/nccl/python/ops/nccl_ops.py)

From the benchmarks available, and not knowing what your in-house experiments
show, I don't believe that the "internal network stack" is key to scaling. The
scalability numbers shown on tensorflow.org/performance are very reasonable:
From 902 images/sec to 1783 (1.97x) going from 32->64 K80 GPUs on Amazon for
Inception v3, and 565->981 (1.7x) for ResNet-512. I'd love to be proved wrong.

That 1.7x scaling on ResNet-512 would be a great point of comparison, for
example. From my student Hyeontaek's results, I actually suspect that there
are scheduling improvements that could make up some of that difference, _not_
networking improvements.

As I'm sure you know, of course, and are just fishing for, the reason that
code links against gRPC externally is because trying to extract Google's
internal networking code from the full internal software codebase would be
ridiculous. I think it's far more likely to see the other direction, with
everything settling on gRPC -- gRPC is actually newer, and in general, more
feature-ful, than Stubby: [https://cloudplatform.googleblog.com/2016/08/gRPC-
a-true-Int...](https://cloudplatform.googleblog.com/2016/08/gRPC-a-true-
Internet-scale-RPC-framework-is-now-1-and-ready-for-production-
deployments.html)

------
kumarvvr
I am new to Machine Learning, but fairly confident with programming.

I have an Electrical Engineering degree and am good at math ( Math is a
passion for me).

I am extremely interested in ML and would like to start by doing stuff, rather
than theoretical aspects.

With the material so far that I have read on ML, there seems to be a huge
number of variables governing the outcome of a particular method / algorithm
(no. of data points, no. of learning iterations, etc)

If I had to pick up a toolkit for get started with ML, is this a good one ? (I
am aware of scikit-learn, Tensor Flow, etc).

If this is the one, what book/books can I keep as a reference while working
with the toolkit?

I usually select a project, work out the human-machine interaction (UI,
backend stuff, etc) on a functional level and then select a stack for
implementing the project. I also change the functional aspects of my original
design if the stack I have selected offers some commonly used functions.

My initial project is to develop a machine learning system that can detect
various QR codes in an image and get their contents.

~~~
albertzeyer
On what level do you want to develop and understand the system? If you work
directly with Theano/TensorFlow/CNTK/MXNet, you are pretty low-level. You more
or less write down the formulas from papers / books, you let the framework
take the gradient of some loss, and you take care of everything, like updating
the parameters according to some update rule / optimization method like SGD.
See some of the tutorials of those frameworks and just decide on your own what
you prefer.

If you want to go more high-level, use sth like Keras. You define your network
structure as a series of layer types, or maybe in other ways, and it does most
of the logic for you, and has already implemented most the commonly used Deep
Learning techniques. So you concentrate more on the network structure, about
what techniques you want to use, etc. And Keras actually supports several
backends such as Theano and TensorFlow and CNTK is work-in-progress, although
as a user, you won't notice so much difference, except that maybe one backend
is faster than the other or does not support some specific functionality or
so.

~~~
kumarvvr
Thanks a lot for your suggestion. Never heard of keras, will check it out.

Since I have a pre-set project at hand, I want to use the system first, so as
per your suggestion, I'll go with keras.

However, I want to understand the system at a deeper level, purely as a
curiosity.

------
phren0logy
Wow, adding Keras support is really slick! It's nice to be able to use Keras
with a few different back ends.

~~~
iraphael
And they also have builtin support for 1-bit SGD. Compresses a model "down to
1 bit per weight" [0]. Seems to be a general technique for model compression,
but it's nice for deployment to have it built in. This also doesn't seem to be
a new addition to CNTK, just something I didn't know before.

[0] [https://www.microsoft.com/en-
us/research/publication/1-bit-s...](https://www.microsoft.com/en-
us/research/publication/1-bit-stochastic-gradient-descent-and-application-to-
data-parallel-distributed-training-of-speech-dnns/)

~~~
minimaxir
1-bit SGD is cutting-edge deep learning tech. (so cutting edge it has a
different license than CNTK itself: [https://docs.microsoft.com/en-
us/cognitive-toolkit/CNTK-1bit...](https://docs.microsoft.com/en-us/cognitive-
toolkit/CNTK-1bit-SGD-License))

~~~
Permit
Is this common? Do libraries such as Tensorflow occasionally license portions
of themselves only for non-commercial use?

I ask because when CNTK was first shared it was also under a non-commercial
license. It seemed to subsequently drop off the radar for many people (eg. It
wasn't mentioned in Stanford's ConvNet course or in Udacity's Machine Learning
course).

------
rabidsnail
"How is this better than TensorFlow" is the new "How is this better than
Hadoop".

I wonder how their distributed training setup compares to mxent ps-lite (in
terms of performance and licensing).

------
cromulen
They refer to this benchmark in the blog post -
[http://dlbench.comp.hkbu.edu.hk/](http://dlbench.comp.hkbu.edu.hk/)

There is also the v7 benchmark done on a lot more hardware and where
tensorflow fares a bit better -
[http://dlbench.comp.hkbu.edu.hk/?v=v7](http://dlbench.comp.hkbu.edu.hk/?v=v7)

Does anyone know whether TF had a performance regression between v0.11 and
v1.0 or if it was just lucky on benchmark v7 and unlucky on v8?

Also, how does CNTK manage to be that much better than anyone else on LSTMs?
It's ability to scale to bigger batch sizes is unreal. Order of magnitude
faster than other frameworks.

~~~
albertzeyer
CNTK uses the LSTM implementation by CuDNN in their official LSTM layer.

TensorFlow has multiple LSTM implementations, such as LSTMCell, BasicLSTMCell,
LSTMBlockCell, and also one wrapper for CuDNN, and maybe more. I'm quite
confident that in this benchmark, for TensorFlow, they did not use the CuDNN
wrapper, which is a bit unfair I would say. Although the CuDNN wrapper in
TensorFlow does not support sequences of different lengths but you could
overcome this by just ignoring the non-used frames. See here for some more
details:

[https://stackoverflow.com/questions/41461670/cudnnrnnforward...](https://stackoverflow.com/questions/41461670/cudnnrnnforwardtraining-
seqlength-xdesc-usage)

Note that you could also provide your own LSTM kernel for TensorFlow, which is
what we do in our framework, and then you can get really fast, although our
benchmarks are a bit outdated.

[https://github.com/rwth-i6/returnn](https://github.com/rwth-i6/returnn)

------
verdverm
Love the pure (realistic?) linear scaling of "projected" performance​ in the
graph

~~~
cbasoglu
The only reason for projection here was the newness of the Volta HW (announced
2 weeks ago). The linear scaling is proven on Pascal and Maxwell HW and due to
the communication library and algorithms like Block Momentum. Note: I am a
MSFT employee.

------
projectorlochsa
They use [http://halide-lang.org/](http://halide-lang.org/) for some
convolutions. Very interesting.

------
boulos
Cool! It's too bad about the cuDNN comparison bit people mention (as with
dgacmu, I contend that all frameworks now devolve into the low level kernel
used), but I'm both impressed by the release _and_ appreciate all the
Microsoft folks clearly disclaiming their affiliation.

Disclosure: I work on Google Cloud (but not directly in/on TensorFlow).

------
maga
The C++ first API design is what appeals to me most. Not enough to switch from
Tensorflow, though.

------
seanmcdirmid
> After training a model using either Python or _BrainScript_ , Cognitive
> Toolkit had always provided many ways to evaluate the model in either
> Python, BrainScript, or C#

Glad to see Frank Seide (and others I guess?) are still working on that!

------
albertzeyer
I think their comparison is a bit unfair.

> Speed. CNTK is in general much faster than TensorFlow, and it can be 5-10x
> faster on recurrent networks.

This is because they use the CuDNN LSTM kernel but for the TensorFlow
comparison they probably did not use the CuDNN LSTM kernel in TensorFlow. See
here for some more details:
[https://news.ycombinator.com/item?id=14473234](https://news.ycombinator.com/item?id=14473234)

> Accuracy. CNTK can be used to train deep learning models with state-of-the-
> art accuracy.

As well as all other frameworks can do.

> API design. CNTK has a very powerful C++ API, and it also has both low-level
> and easy to use high-level Python APIs that are designed with a functional
> programming paradigm.

TensorFlow also has a C++ API.

> Scalability. CNTK can be easily scaled over thousands of GPUs.

Like TensorFlow.

> Inference. CNTK has C#/.NET/Java inference support that makes it easy to
> integrate CNTK evaluation into user applications.

TensorFlow also has many bindings for other languages.

> Extensibility. CNTK can be easily extended from Python for layers and
> learners.

TensorFlow can very easily be extended. I did that a lot in our framework
([https://github.com/rwth-i6/returnn](https://github.com/rwth-i6/returnn)).

> Built-in readers. CNTK has efficient built in data readers that also support
> distributed learning.

Just like TensorFlow.

> Identical internal and external toolkit. You would not be compromised in any
> way because the same toolkit is used by internal product groups at
> Microsoft.

Ok, maybe here they are better in some sense, although I am not sure that the
Google internal version of TensorFlow differs so much. As far as I know, it
just has some stuff added for their data centers, for TPU, etc.

Also, the licence of 1-bit SGD is strange in CNTK. Not sure what the state
about this is. Last time you could not really use that.

I don't want to downplay CNTK. I really like it. I think it's great that they
have really good working CuDNN LSTM wrappers in their default LSTM
implementation, and it's better than all the other wrappers (see here:
[https://stackoverflow.com/questions/41461670/cudnnrnnforward...](https://stackoverflow.com/questions/41461670/cudnnrnnforwardtraining-
seqlength-xdesc-usage)). The team behind CNTK is really strong. So thank you
and congratulation for releasing CNTK 2.0!

~~~
tarlinian
The CNTK C++ API is much more straightforward to use than TF(especially on
Windows) and is much better suited to deploying serialized models into
production as a part of existing C++ applications. The provided examples
require way more boilerplate than an equivalent CNTK application.

------
namelezz
> Java Bindings

Does it mean I can integrate deep learning models into Android applications as
well?

------
it_learnses
I am a .net developer, and would like to learn it, but not sure how. Any
suggestions to where can I access a dataset and the computing power?

~~~
mtw
I suggest following a machine learning / deep learning course first then learn
this

~~~
it_learnses
I actually did take a data mining course a couple years ago in uni, and also
at the time I was following Andrew Ng's course. TBH, I learn better by working
on a project. Otherwise I lose interest midway.

~~~
sayanpa
If a course is of interest, we will be launching a deep learning course
shortly that would help bridge the gaps. Stay tuned. I am a Microsoft
employee.

~~~
mcintyre1994
Will this be based on Azure notebooks or otherwise cloud based + free?

~~~
sayanpa
We are working through the details. It will follow the standard MOOC like
Coursera, Edx, Udacity. You should be able to use the Azure Notebook (that is
the goal) but there may be small caveats. Stay tuned as we work through the
details.

~~~
mcintyre1994
This sounds awesome, thankyou for doing this! Is there anywhere I can give my
email to get updates/a date I can put a reminder in my calendar to check this
out?

------
mtgx
I wonder if Tensorflow 2.0 is about to drop soon, too, with Caffe and
Cognitive Toolkit now reaching version 2. I also assume it's going to be a lot
more than a version change, too.

