
Amazon DSSTNE: Deep Scalable Sparse Tensor Network Engine - j_juggernaut
https://github.com/amznlabs/amazon-dsstne
======
waleedka
At a glance:

    
    
      - Only supports fully connected layers for now. No convnets or RNNs.
    
      - Requires a GPU. No option to run on CPU, not even for development. 
    
      - Setup instructions for Ubuntu only. No Mac or Windows.
    
      - Uses JSON to define the network architecture. Which limits what you can build.
    
      - Takes in data in NetCDF format only.
    
      - Very little documentation.
    
      - The name is bad. I'm not going to remember how to spell DSSTNE.
    

It seems like a very early proof of concept. I wouldn't expect it to be useful
to most people at this point. Built-in support for sparse vectors is
interesting, but not a strong selling point by itself. I hope Amazon continues
to develop it. Or, even better, contribute to one of the existing more mature
frameworks.

~~~
scottlegrand
It's more than that, and it's in use in production at Amazon. 8 TitanX GPUs
can contain networks with up to 6 billion weights. As Geoffrey Hinton once
said:

"My belief is that we’re not going to get human-level abilities until we have
systems that have the same number of parameters in them as the brain."

And you're right that it's a specialized framework/engine. But IMO making it
more general purpose is a matter of cutting and pasting the right cuDNN code
or we can double down on emphasizing sparse data. Amazon OSSed this partially
IMO to see what people would want here.

~~~
jrapdx3
> "My belief is that we’re not going to get human-level abilities until we
> have systems that have the same number of parameters in them as the brain."

An interesting quote.

Replicating functioning of the brain, or some major subsystem of it, is no
doubt going to require far more than just billions of parameters. The cortex
contains >15 billion neurons, but there are also the neurons contained in all
the other brain structures. Furthermore, neurons connect via dense dendritic
trees, the human brain having on the order of 100 trillion synapses.

Adding to the complexity, neurons have numerous "communication ports",
including numerous pre- and postsynaptic neurotransmitter receptors, and a
wide range of receptors for endocrine, immune system and other types of
signals. Message propagation typically involves as well the layer of complex
intracellular "second-messenger" transformations.

While it's highly probably future NNs will be developed that do even more
amazing things than now possible, I think the challenge of equaling what real
brains do is to say the least enormously daunting.

Somebody smarter than me could probably figure out the magnitude, how many
nodes or weights it takes for a NN to function like the brain, though I
imagine it will be a really impressive number.

Edit: typos

~~~
fauigerzigerk
_> Replicating functioning of the brain, or some major subsystem of it, is no
doubt going to require far more than just billions of parameters._

Maybe, but we shouldn't forget that computers do not suddenly lose their
capability to function as exact, deterministic, programmable machines just
because they happen to run an ANN.

What I mean is that there may be shortcuts to reduce the number of required
nodes dramatically.

If you take the state of an ANN after it was trained to perform some specific
task, you can ask the question whether there is a simpler function, i.e. one
with much fewer parameters, that approximates the learned function.

Sort of like a human with the Occam's razor gene. I think the fact that the
number of neurons does not correlate perfectly with intelligence in animals is
an indication that there is room for optimization.

~~~
scottlegrand
Absolutely 100% agree, but at the same time, I think we will ultimately need
to build and evaluate models that can span the memory of more than one
processor. I don't think a single GTX Titan X, GTX 1080 or even a server is
enough here.

Additionally, data parallelization and ASGD broadly disallow these larger
models (yes I know about send/receive nodes in TensorFlow, but they're not
general or automatic enough for researchers IMO) while ASGD makes horribly
inefficient use of the very limited bandwidth between processors. All IMO of
course. There are hacks and tricks here, but I think those should be late
stage optimizations, not requirements to achieve scaling.

Finally, I'm a stickler for deterministic computation as someone who spent a
decade writing graphics drivers before joining the CUDA team in 2006, but
that's pretty much a "hear me now, believe me later" opinion of mine after
tracking down too many bizarro race conditions late into the night in that
former life :-). Of course, one person's race condition can sometimes be an
ANN's regularizer, but I digress.

I also agree we'll do some amazing things with far fewer neurons and weights
than an actual human brain, but I'll bet you good money we end up needing more
than 12GB to do it. AlphaGo alone was 200+ GPUs, right?

------
throwaway6497
Amazon is turning a new leaf. They stopped publishing to any major conferences
after their last significant paper, DynamoDB.

My perception of Amazon is that they take everything from open-source but
don't actively give back. Amazon and open-source never went hand-in-hand.
Making their deep learning frameworks open-source is cool. Kudos to the team
which managed to do this. I am sure internally, it must have been a huge
struggle to get the approval from execs.

[Edit: Grammar]

~~~
throwaway6497
For a second, a thought crossed my mind that Amazon is actively trying to
change its external perception after the NY times article and is trying to
cozy up to developers. I found this on Glassdoor. Apparently, it will take a
long time for them to make their culture less toxic.

===From Glassdoor===

Cons

====

The management process is abusive, and I'm currently a manager. I've seen too
much "behind the wall" and hate how our individual performers can be treated.
You are forced to ride people and stack rank employees...I've been forced to
give good employees bad overall ratings because of politics and stack ranking.
Advice to Management Don't pretend that the recent NY Times article was all
about "isolated incidents". The culture IS abusive and it WILL backfire once
stock value starts to drop. I'm an 8 year veteran and I no longer recommend
former peers to interview with Amazon.

== [Edit: Formatted to make it clear what was pulled from Glassdoor]

~~~
eranation
I just joined AWS ProServ and I really don't see any of these things. Pretty
amazing team and one of the best work life balance I've seen in a tech company
so far. I have 4 other friends who work at AWS and all seem very happy so far.
I found the glass door comment and it seems to be from an engineering manager.
I have a friend who manages one of the AWS products and he seems to be pretty
happy.

I just joined so I really am not a statistically significant case but so far
it's no where near what was in that NYT article.

Edit: I can't read apparently :) thanks heuving for clarifying and the
commenter for reformatting

~~~
inopinatus
I suspect an inverse survivorship bias in the public representation of the
company by ex-employees. Those of us with positive recollections tend to say
very little, and (in my case at least) that's due to respect for Amazon's
culture.

~~~
jsolson
Eh, I had a pretty good ride there myself. I believe every incident in that NY
Times article happened.

If you were at Amazon for any length of time and didn't notice the existence
of toxic teams and the random chance element of being hired into one of them,
you weren't paying attention.

------
ktamura
First TensorFlow and now this. Tensor is quickly becoming a mathematical-term-
that-sounds-familiar-to-developers-but-most-don't-know-what-it-is-actually.

Another example is topology =)

~~~
rdtsc
Other one is isomorphic. Anything that sounds sciency or mathy will be
adopted. There is no other way ;-)

~~~
ryanobjc
my new programming language has isomorphic tensors built in as a first class
language feature :-)

~~~
rdtsc
Sold!

I'll use it write microservices for my new IoT application.

------
scottlegrand
Lead author of DSSTNE here...

1\. DSSTNE was designed two years ago specifically for product recommendations
from Amazon's catalog. At that time, there was no TensorFlow, only Theano and
Torch. DSSTNE differentiated from these two frameworks by optimizing for
sparse data and multi-GPU spanning neural networks. What it's not currently is
another framework for running AlexNet/VGG/GoogleNet etc, but about 500 lines
of code plus cuDNN could change that if the demand exists. Implementing
Krizhevsky's one weird trick is mostly trivial since the harder model parallel
part has already been written.

2\. DSSTNE does not yet explicitly support RNNs, but it does have support for
shared weights and that's more than enough to build an unrolled RNN. We tried
a few in fact. CuDNN 5 can be used to add LSTM support in a couple hundred
lines of code. But since (I believe) the LSTM in cuDNN is a black box, it
cannot be spread across multiple GPUs. Not too hard to write from the ground
up though.

3\. There are a huge number of collaborators and people behind the scenes that
made this happen. I'd love to acknowledge them openly, but I'm not sure they
want their names known.

4\. Say what you want about Amazon, and they're not perfect, but they let us
build this from the ground up and now they have given it away. Google hired me
away from NVIDIA (another one of those offers I couldn't refuse) OTOH blind-
allocated me into search in 2011 and would not let me work with GPUs despite
my being one of the founding members of NVIDIA's CUDA team because they had
not yet seen them as useful. I didn't stay there long. DSSTNE is 100% fresh
code, warts and all, and I think Amazon both for letting me work on a project
like this and for OSSing the code.

5\. NetCDF is a nice efficient format for big data files. What other formats
would you suggest we support here?

6\. I was boarding a plane when they finally released this. I will be
benchmarking it in the next few days. TLDR spoilers: near-perfect scaling for
hidden layers with 1000 or so hidden units per GPU in use, and effectively
free sparse input layers because both activation and weight gradient
calculation have custom sparse kernels.

7\. The JSON format made sense in 2014, but IMO what this engine needs now is
a TensorFlow graph importer. Since the engine builds networks from a rather
simple underlying C struct, this isn't particularly hard, but it does require
supporting some additional functionality to be 100% compatible.

8\. I left Amazon 4 months ago after getting an offer I couldn't refuse. I was
the sole GPU coder on this project. I can count the number of people I'd trust
with an engine like this with two hands and most of them are already building
deep learning engines elsewhere. I'm happy to add whatever functionality is
desired here. CNN and RNN support seem like two good first steps and the spec
already accounts for this.

8\. Ditto for a Python interface, easily implemented IMO through the Python
C/C++ extension mechanism:
[https://docs.python.org/2/extending/extending.html](https://docs.python.org/2/extending/extending.html)

Anyway, it's late, and it's turned out to be a fantastic day to see the
project on which I spent nearly two years go OSS.

~~~
shoyer
Thanks for sharing your story!

Let me comment on file formats as someone familiar with both netCDF and deep
learning.

I agree that netCDF is a sane binary file format for this application. It's
designed for efficient serialization of large arrays of numbers. One downside
is that netCDF does not support streaming without writing the data to
intermediate files on disk.

Keep in mind that netCDF v4 is itself just a thin wrapper around HDF5. Given
that your input format is basically a custom file format written in netCDF, I
would have just used HDF5 directly. The API is about as convenient, and this
would skip one layer of indirection.

The native file format for TensorFlow is its own custom TFRecords file format,
but it also supports a number of other file formats. TFRecords is _much_
simpler technology than NetCDF/HDF5. It's basically just a bunch of serialized
protocol buffers [1]. About all you can do with a TFRecords file is pull out
examples -- it doesn't support the fancy multi-dimensional indexing or
hierarchical structure of netCDF/HDF5. But that's also most of what you need
for building machine learning models, and it's quite straightforward to
read/write them in a streaming fashion, which makes it a natural fit for
technologies like map-reduce.

[1]
[https://www.tensorflow.org/versions/r0.8/api_docs/python/pyt...](https://www.tensorflow.org/versions/r0.8/api_docs/python/python_io.html#tfrecords-
format-details)

~~~
scottlegrand
Thanks for that! And boy, I wish I had the resources the TensorFlow team has
to build standards like this and also to write their own custom CUDA compiler.

I do want the multi-dimensional indexing for RNN data though. Maybe support
HDF5 directly is the path forward.

Thanks again!

------
jbandela1
Deep Learning systems are becoming C++11's halo projects. Here are some deep
learning libraries from the Internet Big 4.

Amazon DSSTNE - [https://github.com/amznlabs/amazon-
dsstne](https://github.com/amznlabs/amazon-dsstne)

Google TensorFlow -
[https://github.com/tensorflow/tensorflow/](https://github.com/tensorflow/tensorflow/)

Microsoft CNTK -
[https://github.com/Microsoft/CNTK/](https://github.com/Microsoft/CNTK/)

Facebook fbcunn -
[https://github.com/facebook/fbcunn/](https://github.com/facebook/fbcunn/)

They all utilize C++11 or later. Just as Hadoop pushed Java in the big data,
map-reduce realm, I think these libraries will push C++11 in the Deep Learning
realm.

------
vr3690
I get the acronym is easy to pronounce with the suggested word, but why not
just use the suggested word (destiny) as the name instead of the acronym. So
much easier to read and write. They could explain the name's origin in
Readme.md

~~~
abtinf
"Destiny" would also be ungooglable.

~~~
oh_sigh
Meanwhile, DSSTNE is completely unmemorable, so even if you wanted to google
it, you're going to end up typing "amazon destiny machine learning" or
something

~~~
meepmorp
I don't know about you, but I'm much more likely to be googling a project as
I'm already working with it as opposed to for general information purpose. In
that context, "DSSTNE [problem keywords]," seems more useful to me.

~~~
oh_sigh
That's true... I was mostly thinking about the case where you aren't using it,
but remember hearing about it or want to check it out.

------
nate_martin
Maybe someone who works on deep learning could comment on what this provides
vs other open source systems like theano, tensorflow, torch, etc.

~~~
curuinor
They claim it's twice as fast as tensorflow, which is not blow-you-out-of-the-
water (compare to like 50x speedup from GPU on most places), but it's a solid
speedup.

It's easily parallelizable on GPU's, or so the claim goes.

Its configuration language is much, much shorter than caffe's, but upon
inspection it looks like that the configuration language is also much less
flexible than caffe's and they implemented a damn sight less stuff. No
recurrent anything, for example, or LSTM, no gating stuff that you would need
if you were doing LSTM, no residual net stuff, just off the top of my head.

It looks like much, much less complete docs in comparison to TF and Theano and
things. Note the probability of dropout given in the user docs, but the actual
documentation for dropout feature is hidden away inside the repo.

The important thing, however, is that they claim that there's a significant
improvement on doing training on extraordinarily sparse datasets, like
recommender systems and things like that. It seems very specialized for that
specific exact purpose: see only accepting NetCDF format data, which is common
enough in climatology-land but less common in machine learning-land proper.

The test coverage... To a first approximation, there is no test coverage. It
seems quite research project-y.

------
Giorgi
Soo... what is the application for this (other than buzzwords)

~~~
romerocesar
srsly? all the discussion above 10hrs+ before your comment and that's your
question?

RTFM: [https://github.com/amznlabs/amazon-
dsstne/blob/master/FAQ.md](https://github.com/amznlabs/amazon-
dsstne/blob/master/FAQ.md)

