This book explains and executes every single line of code interactively, from low level operations to high-level networks that do everything automatically. The code is built on the state of the art performance operations of oneDNN (Intel, CPU) and cuDNN (CUDA, GPU). Very concise readable and understandable by humans.
which I suppose can best be described as Lisp and Python having a baby. It was immense fun to code neural networks from scratch in it. I hope Clojure can find a bigger place in the world of ML.
In those days there was a lovely LuaJIT based tensor manipulation language torch7 [1,2] developed by Leon Bottou. It later became basis for PyTorch. I still believe that Lua in general and LuaJIT in particular are much superior to Python for Deep Learning.
Another student of LeCun from NYU here. Can attest that lush is adorable. For example:
For high performing parts of your code, a subset of lush would generate C code and compile them. I imagined that this is what it was like to write the first version of C++, the one that generated C code.
If only C++ supported interactive REPL and the rest of Clojure/Lisp goodies, that might be possible.
However, the code is CLOSELY related to the actual CUDA/C++ api. It's a lot simpler, concise, and everything, but I explain everything so that you can use the relevant parts with cuDNN and DNNL APIs in any language that you're most proficient in.
Have you used it? When I last tried Cling not that long ago, it wasn't even alpha quality software and given that it has been around for a while, my default assumption would be that this hasn't suddenly improved.
Please read a few of tutorials from my blog. Most programmers in your situation told me that they had no problems following it; it only gradually introduces advanced Clojure concepts, and code snippets are usually extremely short + completely executable interactively as-is.
This is a great site. A few pieces of unsolicited feedback from a marketing perspective:
* Always have the call to action repeated at the bottom of the page; you did a great job having it above the fold, but I got to the end and had to scroll back up to express intent. That's a flowbreaking design.
* This is purely stylistic, but I strongly prefer the first letter of each line to be capitalized. I find the stylistic inconsistency around capitalization offputting.
* You have pretty nicely made docs, consider making the link to view the sample chapters a clickable picture of the first diagram-page in your chapter.
Thanks! I have this slated in my to-do tabs (which I actually do lol). Just looking through it looks like you did a lot of work and took a lot of time for this so just wanted to say thanks.
I am not sure if I am that enthusiastic. The problem with Lisp is not a number of parenthesis but where they are and what is their role. In c-like languages parenthesis help parser compiler but they also help humans to read the code. In case of Lisp they are just for the sake of the parser.
Let's look on the code:
Python:
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),
activation='relu',
input_shape=(28, 28, 1)))
Which one is more readable? looking on the Clojure code I see 128 1 28 28 thrown on me, without digging in the documentation I have no idea what's happening.
Now, granted 128 1 28 28 can be difficult to understand without documentation but that is not due to Lisp's fully parenthesized prefix notation. The Clojure code would also look equally readable if it had used keyword arguments.
Are you sure you are not confounding familiarity with readability? With Lisp, after a while, the parentheses become invisible to the programmer.
I really enjoy your work. I bought your book recently and appreciate your approach of building an understanding of the library based on "first principles". Really appreciate this performant and elegant option for working with deep-learning in Clojure - Thank you!
How is "conv" arbitrary? There is a function object that represents a convolutional layer in the network. It is bound to two symbols (because why not). You can either use "convolution" if you prefer full names, or "conv" if you prefer shorter. It doesn't represent the operation, but the layer. There are functions (with longer names) representing the convolution operation, which follow cuDNN and DNNL naming schemes.
Regarding the magic, I believe you haven't read my writings related to this. Exactly the opposite - there is no magic other than usual Clojure-fu, which I explain in a layered way.
But it's difficult to exactly reply to your critique, because you haven't given any example of an approach that would be good Clojure. Ok, give me an example of how you would do it in a comprehensible way (if what I provide is incomprehensible). You don't have to actually implement it. Show a non-working alternative. How would it look like?
Concision is a style choice to be used with care. Spending screen space on additional characters and descriptions detracts from the ability to fit more logic on the screen at once and grok the larger flow. Splashing symbolic alphabet soup into your IDE in the name of concision isn't usually a good idea, but naming something "conv" in the immediate local context of a convolutional layer doesn't seem so bad.
Does 'convo' refer to a 2D convolution or a 1D convolution? Given the large number [1] of arguments that a convolution can take, which ones are being specified? I can probably guess, since only 2 are given, but if there were more, which order would they be in and which would refer to which?
The code is on github [2, 3] see for yourself if you think it's more or less obvious than the python equivalent of a 'trivial' network.
I would say 'conv2d' is probably reasonably standard in meaning; I refer to 'convo' as arbitrary, because it is. Either (ideally) avoid abbreviations, or use standard ones.
I know nothing about neural networks, but it looks to me like convo is smart enough to create the convolution of the correct dimension based on the arguments. Where as Keras seem to force you to use a different constructor for different dimensions. That explains why it's called convo, since it creates convolutions of any dimension.
Also, most options are provided as a map to convo as well it looks like, so you'd have similar named arguments for convo once you get to defining optional things like padding and strides.
convo is smart enough to cover 1D, 2D, 3D, and any other convolution layer that the backend can support. You only need to specify the data that it can't figure out, but you can specify more if you want.
Generate fully vectorized, stand-alone, human-readable C99 code for neural net inference, and understand exactly what's happening. For example, watch the code run with Linux's perf top and see the relative costs of each layer of the computation. Total transparency, no dependencies outside the C POSIX library
(which is a set of 4 lines that appear in the middle of an ~800 line function).
That's not "human readable".
Sure you can use asan or gdb, but if gdb profiles slowly, what can you do? You're still at the mercy of the code generator to be able to optimize things.
Google those _mm512_... intrinsics (they are part of GCC) to see what they mean. The code you pasted is converting single-precision floats to half-precision floats, and storing the half-precision floats to memory, 32 at a time. That's filter packing, which happens during initialization (and never during inference)
I agree, if you don't know anything about how convolution is implemented (filter packing, data packing, matrix multiplication, sum unpacking), you could be lost. But it's very shallow compared to a JIT or CUDA library scheme, and a knowledgeable ML performance engineer would have no difficulty
The inference function (at the end of the C file) is a series of blocks, each block corresponding to a convolution or other complex operation. It's straightforward to see which, by looking at where the weights come from (a field in a struct that has the same name as the layer in your graph)
If you use perf top (for example) you can see which convolution was most expensive, and why. Does the shape of the tensor produce many small partial blocks around the edge, so the packing is inefficient (a lot of tile overhang), for example? You can see that by glancing at the code and seeing that there are many optimized blocks around the edges. As a rule, if NN-512 generates small code for a tensor (few edge cases) you have chosen an efficient tensor shape, with respect to the tile
Or you might find that batch normalization is being done at inference time (as in DenseNet), instead of being integrated into the convolution weights (as in ResNet), because there's fanout from the source and a ReLU in between. You can see that easily in the generated code (the batch norm fmadd instructions will appear in the packing or unpacking code)
Is the matrix multiplication slow because there are too few channels per group (as in ResNeXt)? Easy to see in perf, make your groups bigger. Are you using an inefficient filter shape, so we have to fall back to a slower general purpose convolution? You can easily see whether Winograd or Fourier was used
> Probably they’re afraid because it might be related to their day job :/
A slightly more common scenario is an employer that insists on "we own everything, related to your job or not, that you do even on your own time and equipment" clauses in employee contracts even though such clauses don't happen to be enforceable in the relevant jurisdiction.
Rather than having to "clear through your manager and legal" every little thing to get it added to your contract's personal IP whitelist, publishing anonymously makes perfect sense, where the plan is to de-anonymize after employment ends, at which point (should said now-former-employer have a hissy fit), their own counsel will eventually inform them they don't have a leg to stand on. After sending at least one threatening letter, of course.
Another solution is to spam your manager (and legal) with every trivial 'invention' that pops into your head until they relent[0][1], but that can burn though political capital you may prefer to use for other purposes, and will probably only narrow the scope rather than remove the unenforceable clause.
it need not be related to your job. some employer might ask that since you're skilled enough to do such thing, then you should have been performing extraordinarily on the job, even if you are already delivering what the job asks for, and just as good as your peers. at worst, some struggling poorly managed startup might even "turnaround" and eventually you don't own your side passion project anymore.
The goal of NN-512 is efficient neural net inference on inexpensive, CPU-only cloud compute instances
For example, a Skylake-X cloud compute instance costs $10 per CPU-core per month at Vultr, and the NN-512 generated code does about 18 DenseNet121 inferences per CPU-core per second (in series, not batched)
In contrast, GPU cloud compute is almost unbelievably expensive. Even Linode charges $1000 per month, or $1.50 per hour (look at the GPU plans: https://www.linode.com/pricing/#row--compute)
As AVX-512 becomes better supported by Intel and AMD chips, it becomes more attractive as an alternative to expensive GPU instances for workloads with small amounts of inference mixed with other computation
I'm not disagreeing with you. I acknowledge that there may be a market for CPU-only NN tasks.
I think a thorough benchmark, either by you or by someone else, will only help your case, by giving a clear picture to those who need to make a decision.
Fun fact, GPUs are massively under-utilized during NN training. So it's quite possible NN on a good CPU might be only slightly slower.
GPU underutilization depends on what, exactly, the model you're training is. It's not unreasonable to hit 80% or more of CUDA core usage on non-recurrent models like convnets, given sufficiently fast data pipelines and a reasonable batch size. Transformers and other recurrent functions hit 100% CUDA core utilization for large portions of each epoch, with the low-% usage on the comparatively short weight update at the end. As well, the current rule of thumb is that at the same price point (so a Xeon 4114 and a Nvidia Titan RTX) the GPU completes each epoch in 10% of the time as the CPU given the same compute graph... So it's highly unlikely that training will be anywhere close to as fast on a CPU as it is on a GPU.
Why not? You got thousands of tensor cores, or tflops under your hand, with already developed APIs, and if you're not too latency-sensitive you can batch a lot. Since you'll be doing the same inference operation millions of time, you don't have to re-prepare kernels and such, use cuda graphs or whatever is the flavour of the day for low overhead, repetitive computation? And if you want to scale a bit, you can add some GPUs before all the PCIe-lanes are all saturated, right? Apart from myriad-x and tpus I'm not sure what could be more useful?
I just want to say that I'm very interested in this library and have commented on it before. I'd really like to see it reach feature parity with pytorch or theano and emit your C++ code on the backend.
For example, I am not aware that one can currently use your library to implement Wavenet, other audio generative models like Wavegrad, or transformers.
> with dynamically generated graphs, the computational graph is never actually defined anywhere: the computation is traced out on the fly and behind the scene. You can no longer do anything interesting with the computational graph: for example, if the computation is slow, you can’t reason about what parts of the graph are slow.
Hmm, my experience is the opposite. When I used Tensorflow, there was no way I could figure out why something is slow, or require huge memory. All I have is a gigantic black box.
Meanwhile, in PyTorch, all I have to do is run it with CUDA_LAUNCH_BLOCKING=1, and it will give me an accurate picture of exactly how much milliseconds each line is taking! (Just print the current time before/after the line.) With nvprof it will even tell you which CUDA kernels are executing.
* Disclaimer: Haven't dabbled in ML for ~a year, so my view might be outdated now.
Eh. I love pytorch, but it can definitely be difficult to reason about at times. For instance, due to async dispatch on GPU, you could get assertion errors where a line fails, but the real error was actually several lines above.
I'm a theano diehard, and I'll never get over how google came along, introduced a shittier version of theano, garnered worldwide acclaim for it, and killed the better library in the process.
Having written and debugged both Theano and TF plenty in the past, I think this is a somewhat uncharitable take, esp. recalling the absolutely enormous Theano compile times. :) I think Theano was genius, but a system that relied on python-string-based C++ code-emitters was always going to have trouble with long-term sustainability.
I am one of the authors of the Theano work. I am happy to hear that the Theano project is now being maintained again.
I will agree with alevskaya that the compilation times were an issue in my particular research ten years ago. I was trying to build neural-networks for parsing that were created at run-time. Since each parse tree had a different computation graph, I was not able to use Theano since it required compiling every single type of parse tree computation graph it encountered during training.
[edit if you want more details: There is really interesting old-school work called "Recursive distributed representations" and later "Labelling recursive auto-associative memory" that used auto-encoders to consume a variable length sequence, e.g. text string, in a sequential fashion. My work with Yoshua Bengio---incomplete---was based upon the idea of doing unsupervised binary parsing of sentences using a hierarchical RAAM-style approach: At any given point in time, greedily find the two adjacent tokens that could be most easily compressed into one token with low reconstruction error. However, once you apply this recursively and end up with auto-encoding binary parse trees, you end up with a variety of different computation graphs, each of which required separate compilation.]
Tensorflow 1.0 has its roots in how Theano was built. Same thing, a statically built graph that is run through a compilation step, with a numpy-like API. So what makes Theano such an ingenious concept while TF is regarded as “programming through a keyhole”?
Here's my take about TF (in general, not particularly 1.x or 2.x):
Like many things from Google, I always had the impression that the library, while better than alternatives at the time, is too tailored to Google use cases. And if you fall outside of them, bad luck.
Still, at work we find it easier to deploy and interoperate with other tools than Pytorch. Hell, we have a guy working in Pytorch who converts his work to ONNX so that we can then connect those to some tooling we already have from back when TF was our only backend.
Could there be a better way? Perhaps. But we have to ship models and TF "just* works" (with a big asterisk, yeah).
I recently used TF 1.0 (former Theano author, current PyTorch user) and found TF 1.0 to be hellaciously difficult to grok and seemed to include a lot of unnecessary abstractions.
There was existing TF 1.0 code I was trying to extract gradients through (nsynth-wavenet). I spent over 8 hours on it unsuccessfully; I asked for help from a friend at Google who worked on TF and he couldn't figure it out either. I emailed the original author of the code and he acknowledged that he didn't know how to do it either, and he had an old notebook he could dig up that kinda would work with a lot of fixes.
My coworker said that he basically started from this article[0] and then adapted a few things to his workflow. He also said that learnopencv "covers like 70% of what you really have to do and you have to figure the rest out, not hard but may take you some time".
Are these libraries ever useful in non-deep learning applications? It sounds like Theano is a bit more general purpose, but why would I ever need it outside of a deep learning context?
I wonder if it could be used for something crazy, e.g. setting up a graph that generates shadertoy-like images on the GPU.
They are. Lots of numerical code benefits from GPU and lots of numerical code benefits from derivatives. Simulations, solvers, numerical optimization, good old fashioned statistics.
Libraries like this enable differentiable programming, which lets you backprop through more than just neural networks. For instance, people have built a differentiable raytracer and plugged a physics engine into reinforcement learning to accelerate training.
Idk about using these libraries, but its almost impossible to find generic graph libraries that aren't designed around either ML or alternatively scheduling batches. One such example is my own, https://github.com/timkpaine/tributary
Interesting library & idea, almost like its own programming paradigm when you abstract away all the specificity for building software or running ETL jobs or whatever.
But this is a completely different kind of graph. The graphs being discussed here are differentiable DAGs of mathematical computations.
I wonder does any of them have proper Windows support, i.e. DirectCompute?
CUDA is NVidia only and vendor lock in is bad for end users. Both CUDA, OpenCL and VK require large runtimes which are not included in the OS, software vendors like me need to redistribute and support it, I tend to avoid deploying libraries when I can.
Seems to have missed the existence of jax.jit, which basically constructs an XLA program (call it a graph if you like) from your Python function which can then be optimized.
The authors gives that quote (from the JAX documentation) but does not seem to interiorize it as his conclusion says:
> This is the niche that Theano (or rather, Theano-PyMC/Aesara) fills that other contemporary tensor computation libraries do not: the promise is that if you take the time to specify your computation up front and all at once, Theano can optimize the living daylight out of your computation - whether by graph manipulation, efficient compilation or something else entirely - and that this is something you would only need to do once.
It is exactly what JAX does. There is a computational graph in JAX (its encoded in XLA and specified with their numpy like syntax), it is build once, optimized and then runs on the GPU.
Not even cloese, jax.jit allow you to compute almost anything using lax.for_loops, lax.cond and other lax and jax contsturts pytorch jit does not allow that its just extra optimization for static pytorch functions.
JAX autograd will work on most any jitted fn - the control-flow limitations are no autograd for code with for/while loops since there's a statically unknowable trip count through the loop body. Much looping code can be handled differentiably using a "scan" though.
Can someone ELI5 what are the differences between the different libraries are? The article uses a lot of jargon, an something that frustrates me about getting into machine learning is that teaching material will either abstract away what the internals do or assume that you already know how the internals work.
Some specific questions:
> They provide ways of specifying and building computational graphs
Is the article talking about neural networks? As in, arrays of arrays of weights, where input values go through successive layers, and for each layer the same instruction is applied to some values with the respective weight?
Or is it talking about a graph as in, a functional graph, where manually written functions call other manually written functions? (hence why a later paragraph talks about if-else statements and for loops)
> Almost all tensor computation libraries support autodifferentiation in some capacity (either forward-mode, backward-mode, or both).
What are those?
From the wikipedia article, it sounds like autodifferentiation basically means running f(x+dx)-f(x), but if there are entire frameworks handling it, then there's probably something fancier going on.
> According to the JAX quickstart, JAX bills itself as “NumPy on the CPU, GPU, and TPU, with great automatic differentiation for high-performance machine learning research”. Hence, its focus is heavily on autodifferentiation.
The earlier description makes it sound like JAX does some cutting-edge compilation stuff to transform semi-arbitrary functions (with ifs and else and loops and stuff) into a function that returns it derivative.
So how can that stuff run on the GPU? It sounds like there would be a lot of branching code.
And how is that related to machine learning / neural networks?
- In this specific case, it's also a problem of the API: theano.scan would return the whole sequence. But if you only need the last entry, i.e. y[-1], there is a very complicated optimization rule which checks for that. Basically many optimizations around theano.scan are very complicated because of that.
- The graph building and esp the graph optimizations are very slow. This is because all the logic is done in pure Python. But if you have big graphs, even just building up the graph can take time, and the optimization passes will take much longer. This was one of the most annoying problems when working with Theano. The startup time to build the graph could easily take up some minutes. I also doubt that you can optimize this very much in pure Python -- I think you would need to reimplement that in C++ or so. When switching to TensorFlow, building the graph felt almost instant in comparison. I wonder if they have any plans on this in this fork.
- On the other side, the optimizations on the graph are quite nice. You don't really have to care too much when writing code like log(softmax(z)) -- it will optimize it also to be numerically stable.
- The optimizations also went so far to check if some op can work inplace on its input. Which made writing ops more complicated, because if you want to have nice performance, you would write two versions, one which works inplace on the tensor, and another one not. And then again 2 further versions if you want CUDA as well.
Re. the last point, was trying to think of computations where 1) an efficient in-place version is possible, and 2) the most efficient out-of-place version is significantly faster than copying the input and executing the in-place version.
In 1D convolutions, the in-place version would need to use O(filter size) scratch space for lookahead, but this doesn't seem like it would be too significant. However, it might start to become significant in higher-dimensional convolutions.
In Big-O notation, there will not be any difference, because copying the data will just be O(N), and whatever you do in the op will be at least O(N), so no change.
But in absolute terms, it could make a difference. Think of y = x + 1 vs y = x; y += 1. I would expect that the former is slightly faster. But actually I'm not really sure.
Actually, I implemented most of my native ops exactly in this way, i.e. I implemented the inplace version, and the non-inplace version would just additionally copy it and then call the inplace version.
Hello, I'm the person spearheading this Theano fork! Your comments match my experience with the old Theano very well, so I have to respond.
> Apparently, the main new feature for Theano will be the JAX backend.
The JAX transpilation feature arose as a quick example of how flexible Theano can be, both in terms of its "hackability" and its simple yet effective foundation (i.e. "static" graphs). It's definitely not the main focus of the fork, but it is easily the newest feature that stands out at the user-level.
The points you raised about the old Theano are actually the main focus, and we've already made large internal changes that address a few of them directly. At the very least, nearly all of them are on the roadmap toward our new library named "Aesara".
The `Scan` `Op` and its optimizations are definitely going to change, and I have no intention of sacrificing improvements for backward compatibility, or anything else that would constrain the extent of improvements. I too have dealt with the difficulties involved in writing Scan optimizations (e.g. https://github.com/pymc-devs/symbolic-pymc/blob/master/symbo...) and am painfully aware of how unnecessary most of them are.
> - The graph building and esp the graph optimizations are very slow. This is because all the logic is done in pure Python. ...
The most important graph optimization performance problems are not actually related to Python performance; they're demonstrably design and implementation induced. That is unless you're talking exclusively about graphs so large they reach the "natural" limits of Python performance by definition. Even then, a nearly one-to-one C translation isn't likely to solve those scaling problems.
For example, the graph optimization/rewriting framework would require entire graphs to be copied at multiple points in the process, and this was almost completely due to some design oddities. We've already made all of the large-scale changes needed in order to remedy this design constraint, so we're well on our way to fixing that. See https://github.com/pymc-devs/Theano-PyMC/pull/158
The rewriting process also doesn't track or use node information very well (or at all), so the whole optimization process itself can take an unnecessary number of passes through a graph. For instance, its "local" optimizations have a "tracking" option that specifies the `Op` types to which they apply; however, that feature isn't even used unless the local optimizations are applied by a `LocalOptGroup`. I've noticed at least a few instances in which these local optimizations are applied to inapplicable `Op`s on each visit to a node. Worse yet, within `LocalOptGroup` those local optimizations aren't applied directly to the relevant `Op`s, even though the requisite `Op` type-to-node information is readily available. In other words, optimizations could be directly applied to the relevant nodes in these cases and dramatically reduce the amount of blind graph traversals performed.
At best, a reimplementation in a language with a better compiler, like C, would largely amount to a questionable brute-force attempt at performance, and the ease of manipulating graphs and developing graph rewrites would suffer. With Aesara, we're going for the opposite. We want a smarter framework and _more_ focus on domain-specific optimizations (e.g. linear/tensor algebra, statistics, computer science) from the domain experts themselves, so code transparency and ease of development really matters. When we need raw performance in specific areas of the code, we'll pinpoint those areas and write C extensions, in standard Python fashion.
> ... When switching to TensorFlow, building the graph felt almost instant in comparison. ...
Last I checked, TensorFlow had almost no default graph optimizations, aside from some basic CSE and minor canonicalization and algebraic simplifications in the `grappler` module, so it absolutely should be instantaneous. More importantly, TensorFlow isn't designed for graph rewriting, and definitely not at the Python level where rapid prototyping and testing is possible outside of Google.
Otherwise, if you're talking about initially _building_ a graph and not calling `theano.function`, there are no optimizations involved. Latency in that case would be something entirely different and well worth reproducing for an issue. If what you were observing was the effect of calling `theano.function`, the latency was most likely due to the C transpilation and subsequent compilation. That's a feature that necessarily takes time, but produces code that's often faster than TensorFlow even today.
In summary, the changes we're most focused on right now are for developers like yourself who have had to deal with the core of Theano, so, please, stop by the fork and help us make a better `Scan`!
By graph building, I actually meant graph compilation. In TF the first `session.run`, or in Theano the `theano.function`.
I did not get too much into the internals of the graph compilation + optimization (despite writing a couple of simple own optimization passes), so I don't really know whether sth is done really inefficient, but I can easily believe that. I agree, if sth is inefficient there, it should be rewritten in a more efficient way. But I also think that even if you have it as efficient as it can be, it still would be slow, compared to a C/C++/Rust implementation, easily by a factor of 100 or so. And even in C/C++ it can still be slow, when I consider how much time LLVM or GCC takes in their optimization passes.
Yes, TensorFlow does not have much optimization, although I think the idea was always to extend that. But then, as you say, this also is one of the reasons the graph compilation is so fast. But comparing the runtime performance of Theano vs TF, in most cases, TF was just as fast or faster (which is likely dependent on the specific model; but as far as I remember, that was the general observation by the community). So because of that, I was questioning whether all that heavy graph optimization is really worth it. Numerical stability is another topic, of course. But you can also have some simple logic for that, e.g. implement your own `safe_log`, which checks if the input is `softmax(x)`, and then directly returns `log_softmax(x)`. See e.g. here: https://github.com/rwth-i6/returnn/blob/6cd6b7b3b3d3beb33140...
Btw, graph rewriting in TF is certainly also possible, and not so complicated. But it's not really optimized for that. You cannot rewrite parts of the graph inplace. You would need to create a new copy. (Although, technically, I think it would not be too complicated to allow for more graph rewriting, also inplace. But it was/is just not a high priority.)
About `Scan`: I think the main problem is the API itself. I think it is easier if the underlying op would be `WhileLoop` or so, very similar to `tf.while_loop`. Then everything becomes very natural. However, then you would need some good way to accumulate your outputs, if you actually want to have the logic of `scan`. Sth like `ys = concat(ys, [y])` inside the loop. And then it probably is necessary to have specific optimizations on that to make that efficient. Or introduce sth like `TensorArray`. But in both cases, I think this is easier than working with `Scan` as the underlying op for loops.
Btw, in the blog post, it is written that TF is focusing on dynamic graphs now. While this indeed was an important focus when TF2 was introduced, I'm not sure whether they might take a step back again. Of course this is just speculation. But I think even internally, they are seeing the problems with dynamic graphs, and many groups still use the non-eager mode with static graphs and don't have any intention to switch away from that.
> I get confused with tensor computation libraries (or computational graph libraries, or symbolic algebra libraries, or whatever they’re marketing themselves as these days).
Aren't tensors a sort of generalisation of matrices? How are they equivalent to graphs?
The word tensor in this context refers to a multidimensional array, not to a tensor in the mathematical sense. The computation graph is simply a representation of a sequence of arithmetic operations that you're performing on some data.
The last arguments about why you would want a static graph and even it's drawbacks and complaints sound basically similar to why you would want to do functional programming
This book explains and executes every single line of code interactively, from low level operations to high-level networks that do everything automatically. The code is built on the state of the art performance operations of oneDNN (Intel, CPU) and cuDNN (CUDA, GPU). Very concise readable and understandable by humans.
https://aiprobook.com/deep-learning-for-programmers/
Here's the open source library built throughout the book:
https://github.com/uncomplicate/deep-diamond
Some chapters from the beginning of the book are available on my blog, as a tutorial series:
https://dragan.rocks