Dynamic computation graphs arise whenever the amount of work that needs to be done is variable. This may be when we're processing text, one example being a few words while another being paragraphs of text, or when we are performing operations against a tree structure of variable size. This problem is particularly prominent in particular subfields, such as natural language processing, where I spend most of my time.
PyTorch tackles this very well, as do Chainer and DyNet. Indeed, PyTorch construction was directly informed from Chainer, though re-architected and designed to be even faster still. I have seen all of these receive renewed interest in recent months, particularly amongst many researchers performing cutting edge research in the domain. When you're working with new architectures, you want the most flexibility possible, and these frameworks allow for that.
As a counterpoint, TensorFlow does not handle these dynamic graph cases well at all. There are some primitive dynamic constructs but they're not flexible and usually quite limiting. In the near future there are plans to allow TensorFlow to become more dynamic, but adding it in after the fact is going to be a challenge, especially to do efficiently.
Disclosure: My team at Salesforce Research use Chainer extensively and my colleague James Bradbury was a contributor to PyTorch whilst it was in stealth mode. We're planning to transition from Chainer to PyTorch for future work.
The primary issue is that the computation graph is not imperative - you define it explicitly. Chainer describes this as the difference between "Define-and-Run" frameworks and "Define-by-Run" frameworks.
TensorFlow is "Define-and-Run". For loops and conditionals end up needing to be defined and injected into the graph structure before it's run. This means there are "tf.while_loop" operations for example - you can't use a "while" loop as it exists in Python or C++. This makes debugging difficult as the process of defining the computation graph is separate to the usage of it and also restricts the flexibility of the model.
In comparison, both Chainer, PyTorch, and DyNet are "Define-by-Run", meaning the graph structure is defined on-the-fly via the actual forward computation. This is a far more natural style of programming. If you perform a for loop in Python, you're actually performing a for loop in the graph structure as well.
This has been a large enough issue that, very recently, a team at Google created "TensorFlow Fold", still unreleased and unpublished, that handles dynamic computation graphs. In it they tackle specifically dynamic batching within the tree structured LSTM architecture.
If you compare the best example of recursive neural networks in TensorFlow (quite complex and finicky in the details) to the example that comes with Chainer, which is perfectly Pythonic and standard code, it's pretty clear why one might prefer "Define-by-Run" ;)
I personally think the approach used in TensorFlow is preferable – having a static graph enables a lot of convenient operations, such as storing a fixed graph data structure, shipping models that are independent of code, performing graph transformations. But you're right that it entails a bit more complexity, and that implementing something like recursive neural networks, while totally possible in a neat way, ends up taking a bit more effort. I think that the trade-off is worth it in the long run, and that the design of TensorFlow is very much influenced by the long-run view (at the expense of immediate simplicity...).
The ops underlying TensorFlow's `tf.while_loop` are actually quite flexible, so I imagine you can create a lot of different looping constructs with them, including ones that easily handle recursive neural networks.
Thanks for pointing out a problem that I haven't really thought about before!
Would you mind providing a concrete example to relate to if you dont mind? Again, intrigued by PT so want to learn more about it vs TF...
In fact, with DyNet or PyTorth, you still need to bookkeeping the graph you traversed (tape) because no one is doing forward AD. If that's the case, why not have a good library to do symbolic computation graph and build dynamic feature on top of it. (I am not saying Tensorflow is a good symbolic computation graph library to build upon just arguing that start with a define-compile-run library doesn't necessarily hinder your ability to support dynamic graphs).
Once you have a reasonable symbolic computation graph library, you don't need to explicitly build a "tape" because the symbolic representation will record the order of execution and reverse AD even graph optimization (applying CSE etc) come naturally as well.
This won't be the same for TensorFlow as it was written with the concept of a static computation graph at its core. I'm certainly not saying it's impossible to re-architect - and many smart people in the community and at Google are devoting thinking and code to it - but simply that the process will be far more painful as it was not written with this as an intended purpose.
To note - there are many advantages to static computation graphs. Of particular interest to Google is that they distribute their computations very effectively over large amounts of hardware. Being able to do this with a dynamic computation graph would be far more problematic.
Does the upcoming XLA interact with this as well? I.e. compilation would be too costly for dynamic graphs, and so it would only make sense for static graphs?