
Benchmarking CNTK on Keras: Is It Better at Deep Learning Than TensorFlow? - minimaxir
http://minimaxir.com/2017/06/keras-cntk/
======
sirfz
This comment [1] in Tensorflow's discussion group gives some valid points
explaining some of the crazy numbers reported by CNTK.

[1]
[https://groups.google.com/a/tensorflow.org/d/msg/discuss/Dhy...](https://groups.google.com/a/tensorflow.org/d/msg/discuss/Dhy9MseSXQI/naoy_EElBAAJ)

~~~
aub3bhat
Thanks, great to see that TF recongizing the issue with Pipelines/queues and
simplifying it. I wish they did NCHW conversion by automatically detecting
GPU.

I think Tensorflow is developed in a very principled manner focusing on
ability to add more ops and platforms in future. While this has hurt TF
usability in the short term I am bullish on its future.

------
option
PyTorch is also much faster than Tensorflow on LSTMs
[http://deeplearningathome.com/2017/06/PyTorch-vs-
Tensorflow-...](http://deeplearningathome.com/2017/06/PyTorch-vs-Tensorflow-
lstm-language-model.html). There is also a reason for that

~~~
1024core
> There is also a reason for that

Don't be shy. Do share what that reason is.

~~~
IanCal
From the linked article, to help others like me who mostly scan the comments

> PyTorch LSTM network is faster because, by default, it uses cuRNN’s LSTM
> implementation which fuses layers, steps and point-wise operations. See
> blog-post on this here.

> Tensorflow’s RNNs (in r1.2), by default, do not use cuDNN’s RNN, and their
> ‘call’ function describes only one time-step of computation, hence a lot of
> optimization opportunities are lost. On the flip side, though, this gives
> user much more flexibility, provided that the user knows what he is doing.

------
Analemma_
How much of this is solely due to the 1-bit gradient descent? It's a genuine
breakthrough and MSR (or whoever came up with it) deserves all the credit, but
assuming it's not patented I imagine Google will be adding it to TF sooner or
later and that will close most of the gap.

EDIT: My bad, I did not see at the end of the article that 1-bit SGD is not
enabled on Keras yet, so the performance wins are coming from somewhere else.
Neato.

~~~
minimaxir
The container specifically uses the non-1-bit SGD version CNTK after I learned
it does not work, just to be sure.

MSFT employees talk about LSTM gains on the original submission:
[https://news.ycombinator.com/item?id=14473255](https://news.ycombinator.com/item?id=14473255)

------
smortaz
If you havent tried CNTK and want to get an overview w/o installing anything,
try:

[https://notebooks.azure.com/cntk/libraries/tutorials](https://notebooks.azure.com/cntk/libraries/tutorials)

Cheers.

------
ntenenz
Assessing accuracy? If you're running the same architecture with the same
training regime, why would you expect differing accuracy numbers (other than a
bug of some sort)? Seems a bit strange to include.

~~~
minimaxir
From the CNTK vs. TensorFlow page: [https://docs.microsoft.com/en-
us/cognitive-toolkit/reasons-t...](https://docs.microsoft.com/en-us/cognitive-
toolkit/reasons-to-switch-from-tensorflow-to-cntk)

> TensorFlow shared the training script for Inception V3, and offered pre-
> trained models to download. However, it is difficult to retrain the model
> and achieve the same accuracy, because that requires additional
> understanding of details such as data pre-processing and augmentation. The
> best accuracy that were achieved by a third party (Keras in this case) is
> about 0.6% worse that what the original paper reported. Researchers in the
> CNTK team worked hard and were able to train a CNTK Inception V3 model with
> 5.972% top 5 error, even better than the original paper reported!

This suggests that an improvement in accuracy is _possible_ by switching to
CNTK, hence why I included an accuracy metric from both frameworks. (and also
for sanity checking, as you note)

~~~
benjismith
No, if you re-read the first sentence there, it says that the different
results are attributed to differences in "pre-processing and augmentation".
The choice of NN framework is essentially irrelevant.

~~~
dgacmu
Note, though, that the preprocessing and augmentation is (at least in TF) done
within the framework itself. I helped debug the pure-TensorFlow version of the
Inception input pipeline, and getting it to match the earlier DistBelief
version was agonizing -- it really shows all of the differences (and bugs) in
the image processing ops. And there can be subtle effects -- differences in
which image resizing algorithm you use, for example.

But it's worth noting that this code is all released:

[https://github.com/tensorflow/models/blob/master/inception/i...](https://github.com/tensorflow/models/blob/master/inception/inception/image_processing.py#L198)

It may be hard to replicate that across all platforms, though -- as an
example, the distortions include using four different image resizing
algorithms.

Some of it was _true_ preprocessing, i.e., cleaning up the imagenet data. I
wrote a bit about that here: [https://da-data.blogspot.com/2016/02/cleaning-
imagenet-datas...](https://da-data.blogspot.com/2016/02/cleaning-imagenet-
dataset-collected.html)

(tl;dr - there are some invalid images and bboxes, etc., and some papers chose
to deal with the "blacklisted" images differently.)

------
johnsmith21006
The thing is Tensorflow is racking up stars on Github 5x to CNTK. It has all
the momentum and hard to see MS able to slow it down.

------
eggie5
Although fine for your relative benchmarking, do you notice that docker
introduces noticeable overhead to training time vs non-docker?

~~~
minimaxir
From my testing before starting the benchmarks, no. I did later search around
the internet beforehand and others mention that there is not noticeable
overhead.

~~~
eggie5
It would be interesting to quantify this.

I did to go through the rigmarole of building tensorflow for GPU training in
GCE (building TF: installing nvidia drivers, cudnn, etc) and docker would have
definitely been a boon! Or it would be nice of GCE had an image marketplace
like AWS.

------
ipunchghosts
I wonder if this will start a flame war with fchollet.

~~~
minimaxir
In the PR he is very supportive of the addition of CNTK:
[https://github.com/fchollet/keras/pull/6800](https://github.com/fchollet/keras/pull/6800)

