
TensorFlow Fold: Deep Learning with Dynamic Computation Graphs - moshe
https://research.googleblog.com/2017/02/announcing-tensorflow-fold-deep.html
======
mad44
Here is a summary of the TensorFlow Fold paper.
[http://muratbuffalo.blogspot.com/2017/01/deep-learning-
with-...](http://muratbuffalo.blogspot.com/2017/01/deep-learning-with-dynamic-
computation.html)

------
imh
For anyone interested in really flexible differentiable graphs, Chainer is the
most flexible convenient library I've used. It's all I use for prototyping
neural nets anymore, and I'm surprised not to see more adoption. It feels like
working in numpy.

~~~
curuinor
In large part, in the CPU section, it _is_ working in numpy. I think of the
neural network libraries, Chainer is made by the people who like actually
coding the most.

I mean, for example, lots of the TensorFlow type checking gets done in Eigen,
where it's done by C++ template metaprogramming (that's how it worked when I
looked at it, anyhow): Chainer type stuff just gets done by runtime
inspection.

Which one is faster? TF, by far. Which one would you rather have in _your_
codebase?

Edit: after reading the damned thing, they add in more runtime type stuff. And
after looking over TF back again, they still have this hybrid thing going on
where it's some Eigen stuff and some runtime stuff. I mean....

------
chewxy
The paper is dense and I'm on a train. Can anyone summarize the difference
between TensorFlow Fold and Chainer?

Also, self promotion: Gorgonia
([https://github.com/chewxy/gorgonia](https://github.com/chewxy/gorgonia)) has
support for dynamic computation graphs ala Chainer since day 1... however,
batched computation remains difficult to implement.

~~~
moshe
TensorFlow Fold provides a TensorFlow implementation of the dynamic batching
algorithm (described in detail in our paper [1]). Dynamic batching is an
execution strategy for computation graphs, you could also implement it in
PyTorch or Chainer or any other framework.

Our particular implementation of dynamic batching uses the TF while loop,
which means that you don't need to make run-time modification to the actual TF
computation graph. At runtime, we essentially encode the computation graph for
(let's say) a parse tree as a serialized protocol buffer (tf.string), so
instead of varying the computation graph itself we vary the input to a static
computation graph instead. This particular implementation strategy is very
much a byproduct of how TensorFlow works (static computation graph, heavy
lifting happens in ops implemented in C++).

[1] DEEP LEARNING WITH DYNAMIC COMPUTATION GRAPHS,
[https://openreview.net/pdf?id=ryrGawqex](https://openreview.net/pdf?id=ryrGawqex)

~~~
taliesinb
Congratulations on the nice work! It is very elegant to use combinators to
formulate and solve this problem. Though my worry with combinators is that
they introduce awkwardness with setting up 'long range' DAG edges, you _can_
have to do a lot of shuttling of things around manually through tuples and
whatnot, I'm not sure how it is with your framework.

Am I right in thinking that there is no bucketing going on here? In other
words, each batch is fixed length, and a short-lived DAG is planned and then
simulated with tf.while to accommodate the set of shapes in that particular
batch? Are there any problems when the input shapes are wildly different in
expense? For example, imagine a size-agnostic convnet. Maybe some of the
images in the training set are small, others are large, how would that look in
your framework, if it can be done? Is junk padding part of picture, to help
match almost-equal tensors to allow them to be batched?

~~~
moshe
Thanks, insightful questions.

You're absolutely right about combinators and long-range dependencies. We have
a special block type in the high-level API
([https://github.com/tensorflow/fold/blob/master/tensorflow_fo...](https://github.com/tensorflow/fold/blob/master/tensorflow_fold/g3doc/py/td.md#td.metric))
for accumulating results here without having to explicitly shuttle them
around; not perfect but very handy in many cases.

Regarding your second question, the equivalent of padding in dynamic batching
is "pass through" do-nothing ops, which are introduced transparently but worth
being aware of if you want to understand the machinery. The worst case
scenario here is chains of wildly varying lengths, where we need to add pass-
throughs to the shorter chains to match the length of the longest chain.

------
iraphael
Github link:
[https://github.com/tensorflow/fold](https://github.com/tensorflow/fold)

Paper link:
[https://openreview.net/pdf?id=ryrGawqex](https://openreview.net/pdf?id=ryrGawqex)

------
kyloon
This is great news, was just wondering when TensorFlow would support this
after reading about PyTorch.

~~~
zump
They got scooped and pushed to publish the interns project.

~~~
superfx
I believe the TensorFlow Fold paper came out before PyTorch.

------
kriro
The concept seems interesting. I have stopped the close investigation of the
stack I use at the "Keras level" and mostly use things below that as a black
box. I'm defaulting to Theano since I only have one GPU to work with but as
far as I can tell switching to TensorFlow is basically a small config-change.
I've only browsed this but since I mostly do NLP (and virtually no image
recognition) I suppose it could be worthwhile to switch. I guess I'll need to
open the black boxes a bit and see what Theano does :)

------
superfx
Why does the GitHub page say this is not an official google project, yet it's
on the google blog?

~~~
moshe
Please note that the GitHub page says "not an official Google product", rather
than "project". An official Google product would be something like gmail.

~~~
superfx
I guess I was confused because the main TensorFlow page doesn't have that
phrase. Made it sound like it won't be officially supported by Google.

------
congerous
The "leading" DL framework is playing catch-up to Chainer, PyTorch and DyNet.
Another Google product development bungle.

~~~
general_ai
The way I see it, TF is about to pull _way_ ahead thanks to XLA JIT/AOT. All
of a sudden you get the ability to fuse things at a much more granular level,
which could reduce memory bandwidth requirements by a lot. Frameworks like
Torch can't do any fusing at all, since their computation is fully imperative.
Tactical win for imperative frameworks, I suppose, but strategically
functional graph is the way to go. DB people realized this in the 70s, ML
people are realizing this now.

~~~
congerous
TF is way behind on UI, which is why it's making Keras its front-end. It's
fairly slow on multi-GPUs compared to Torch and neon. It might pull ahead in
performance on GCE, but that's just for lockin.

~~~
general_ai
TF is in a fortunate position of having several UIs at this point. It's a
lower level framework with a lot of power. If you don't need all that power,
Keras or TFLearn or Slim are pretty great. If you do, it's there for you. I
see no evidence that Google's goal with TF is to lock you into anything, and
especially GCE. I'm a former Google employee, and I can tell you unequivocally
— that's not how Google actually works.

