
MXNet – Deep Learning Framework of Choice at AWS - werner
http://www.allthingsdistributed.com/2016/11/mxnet-default-framework-deep-learning-aws.html
======
cs702
Translation from corporatespeak: "We don't have an internally developed
framework that can compete with TensorFlow, which is controlled by Google, so
we are throwing our weight behind MXNet."

As others have commented here, there is no evidence that MXNet is _that_ much
better (or worse) than the other frameworks.

~~~
werner
Amazon has been building technology based on ML&DL for over 20 years and has
developed several frameworks. You must have missed the announcement of this
open source framework earlier in the year:
[https://github.com/amznlabs/amazon-
dsstne](https://github.com/amznlabs/amazon-dsstne).

~~~
cs702
I saw that when it was announced. DSSTNE has failed to capture the hearts and
minds of developers. In my experience, it doesn't come up in any conversations
about which frameworks to bet on for new product development.

And I'm rooting for Amazon (and FaceBook, and Microsoft...). TensorFlow needs
competition for the hearts and minds of developers.

------
fpgaminer
It seems more prevalent now than it used to be, that frameworks/libraries are
being used as weapons in a sort of mindshare war between the world's
megacorps. Or perhaps I'm misremembering history. And I don't mean just AI;
just look at Angular (Google) vs. React (Facebook).

It's a bit of a double edged sword. As developers this war gives us free
access to well funded and heavily developed tools. The world has been
fundamentally changed by their availability. But at the same time we need to
understand that the primary reason they exist is to lock developers into a
particular vendor. It's most transparent with Google's TensorFlow, where they
were obvious about their intentions to offer TensorFlow services on their
cloud platform.

This article more than most exemplifies their desperate attempts. For now it
seems to remain mostly that, desperate attempts, with the tools remaining
more-or-less platform agnostic. But I foresee a grim future where our best
libraries and tools are tied inextricably to a commercial ecosystem.

~~~
deepnotderp
Then utilize torch.

~~~
mastazi
Isn't Torch actively supported by Facebook?
[https://research.facebook.com/research/torch/](https://research.facebook.com/research/torch/)

------
oneshot908
Using 3 year-old GPUs on a much deeper network than the other guys(tm) to
demonstrate awesome scaling efficiency == Intel-level FUD. Note also the
absence of overall batch size.

Wonder what would happen to that scaling efficiency if those GPUs were P40s?

See also the absence of equivalent AlexNet numbers to further obscure attempts
at comparing this to the other guys(tm).

Can't wait for Intel's response to this.

~~~
piiswrong
Amazon probably used P2 because they want to advertise it. We can get almost
linear speedup on 10 8xM40 machines using MXNet. Batch size is linearly
increased with # of machines but empirically it doesn't hurt convergence, at
least on imagenet.

I mean who cares about AlexNet any more? It's 2016 already. It trains in under
2h on a single machine. Distributing it doesn't make much sense

~~~
p1esk
2 hours to train Alexnet on a single machine? Link please.

~~~
piiswrong
[https://developer.nvidia.com/cudnn](https://developer.nvidia.com/cudnn) Alex
did it on 2x580 in 2012. Took him 1 week. It's 60x faster now even compared to
K40

------
deepnotderp
Okay, with all due respect, this is BS. I love MXNet and think it's under
appreciated as well. But, pretty much its best feature is the memory mirror.
(see oneshot908's comment)

------
imh
This reads weirdly. He talks about how MXNet is the best choice without
comparing it to other frameworks. That's the whole point of choosing between
things. I'm sure they did the legwork to make this decision, and some insight
into that choice might help others follow. Without that, my distrust radar is
blinking.

------
AlexCoventry
From the OP:

    
    
      > a Deep Learning AMI, which comes pre-installed with the popular open source
      > deep learning frameworks mentioned earlier; GPU-acceleration through CUDA
      > drivers which are already installed, pre-configured, and ready to rock
    

You might want to clarify that the negative reviews [0] are from earlier
versions which did not include the CUDA drivers. I recently considered this
AMI and rejected it for a class [1] because of these reviews.

[0] [https://aws.amazon.com/marketplace/reviews/product-
reviews?a...](https://aws.amazon.com/marketplace/reviews/product-
reviews?asin=B01M0AXXQB)

[1] [https://www.meetup.com/Cambridge-Artificial-Intelligence-
Mee...](https://www.meetup.com/Cambridge-Artificial-Intelligence-
Meetup/events/235496478/)

~~~
mli
The deep learning AMI now has both CUDA and CUDNN installed.

------
eva1984
> we have concluded that MXNet is the most scalable framework

Without back by any benchmarks? This claim is lazy.

------
bsfjgngdnxy
>MXNet can consume as little as 4 GB of memory when serving _deep networks
with as many as 1000 layers_.

So perhaps I'm not well versed enough in deep learning, but does this mean
that they solved the vanishing gradient problem? How are they managing to do
this?

~~~
ogrisel
For deep convnets the vanishing gradient problems can mostly be solved by
using residual architectures. See:
[https://arxiv.org/abs/1603.05027](https://arxiv.org/abs/1603.05027)

This is kind of related to solving the vanishing gradient issue in RNNs by
using additive recurrent architectures like LSTMs and GRUs.

Alternatively it's possible to use concatenative skip connections as in
DenseNets:
[https://arxiv.org/abs/1608.06993](https://arxiv.org/abs/1608.06993)

Still using 1000 layers is useless in practice. State of the art image
classification models are in the range 30-100 layers with residual connections
and varying numbers of channels per layer depending on the depth so as to keep
a tractable total number of trainable parameters. The 1000 layers nets are
just interesting as a memory scalability benchmark for DL frameworks and to
validate empirically the feasibility of the optimization problem but are of no
practical use otherwise (as far as I know).

~~~
bsfjgngdnxy
Thank you!

------
mrdrozdov
Did not realize you could use MXNet declaratively (like Tensorflow/Theano) and
imperatively (like Torch/Chainer). Can anyone speak more of their imperative
usage of MXNet?

~~~
billconan
does declaratively mean the use of expression template in C++?

I learned about it last week, I don't seem to see too much benefit if the goal
is good performance.

~~~
ogrisel
No it means writing a program that defines the structure of a computation
graph lazily (without executing the nodes when defining the model) so as to
reuse that compute graph in a later step of the programs.

The computation graph is an in-memory datastructure that can be introspected
by the program itself at runtime so as to do symbolic operations (e.g. compute
the gradient of one node in the graph with respect to any ancestor input
node).

theano implements this in pure Python and can generate C or CUDA code from
string templates (in Python). tensorflow has to a Python API to assemble pre-
built operators which are mainly written in C++ and use the Eigen linear
algebra library.

~~~
billconan
"defines structure of a computation graph lazily (without executing the nodes
when defining the model)"

But this sounds exactly like expression template.

~~~
ogrisel
But neither declarative DL toolkits (theano & tensorflow) use that C++
language feature: the computation graph is typically defined by writing a
python script to assembles building blocks dynamically at runtime.

Once the graph is defined, it can be passed along with concrete values for the
input nodes to the runtime framework to execute the section of the graph of
interest (possibly with code generation + compilation).

------
turingbook
Li Mu, the core developer behind MXNet, works for Amazon recently.

------
partycoder
[offtopic] I think presentations with ascending bar charts are sort of cliche.

------
egeozcan
> Machine learning (...) is being employed in a range of computing tasks where
> programming explicit algorithms is infeasible.

I found this comment interesting. Is this really the summary of what machine
learning is about?

~~~
Analog24
Image classification is a classic example of such a task. How exactly would
you go about writing an algorithm to tell the difference between a picture of
a cat and a picture of a dog?

~~~
samcodes
Well, this might be cheating, but I would apply a bunch of different filters
for things like edge-detection, etc. Then I would come up with a statistical
model that, for each feature, gave the likelihood that there image under
consideration was a dog. Then I would aggregate all those results into a final
likelihood.

Not trying to be sarcastic, I just can't think of any way other than the ML
way.

~~~
Analog24
To further the point: what filters would you choose? What features could you
choose heuristically to distinguish between the two? They both have fur, they
both have four legs, they both have two eyes, they both come in a wide variety
of colors and patterns... Most dogs have an elongated snout but not all of
them (pugs, bulldog, etc.).

I would be extremely impressed if someone developed an algorithm that could
accomplish this task without using any type of statistical/machine learning.

------
blahi
MXNet is the only deep learning framework that has proper support for R.
That's why I use it and it is pretty nice IMO.

~~~
fnl
Isn't TF available in R as of late, too, from the RStudio guys? Still
incomplete?

~~~
blahi
Atrocious syntax.

------
gnipgnip
Can someone please spell-out for us muggles what sets these frameworks
(Theano, Tensorflow, Torch, CNTK, Mxnet) apart ? They all seem to be
essentially doing the same thing underneath.

~~~
politician
Cloud vendor feature signaling, mostly.

Microsoft wants you to use CNTK on Azure. Amazon wants you to use Mxnet on
AWS. Google wants you to use Tensorflow on GCP.

It's irrelevant whether these frameworks can be used outside their home
platform by broke college students. That's a red herring. The cloud vendors
are looking to sell enterprise contracts, and they need to check all of the
boxes.

This strategy makes complete sense from a business perspective, and you really
cannot fault them for doing it.

