> *backpropagation, and its polynomial time complexity* How do you reconcile the...

YeGoblynQueenne · 2023-07-25T11:07:29

Thanks, wow, that's a great question to check my assumptions. I turns out I don't know where my claim that backpropagation's time complexity is polynomial comes from! I am repeating what I know from the time of my Master's in 2014, when I studied neural nets. It was most likely something discussed with my tutors during the course (I had some excellent tutors).

In my defense, it certainly doesn't seem to be just my own, personal belief. For example, here are lecture notes from a course on backpropagation, which conclude with a sketch proof of a linear time complexity of O(|V| + |E|) for the computational of a partial derivative over a network with V units and E connections, if I got that right:

https://www.cs.princeton.edu/~rlivni/cos511/lectures/lect16....

There's also other similar calculations I could find floating free on the internet. Those generally seem to look at the complexity of calculating partial derivatives by automatic differentiation, basically.

Then there's all the scholarly articles that claim that training neural nets by backpropagation is not efficient (which is not the same as claiming than automatic differentiation is not efficient) and that the corresponding decision problem is somewhere in the class NP, or worse. I found a whole bunch of such papers while investigating my assumption of polynomiality just now, thanks to your question, and I even found some of them on my hard drive, which I didn't remember!

Well, it seems there is an ongoing debate, continuing all the way to date, and that goes rather further back than the Blum and Rivest paper. Most results I could find seem to be on the side of NP-completeness (or worse).

However, all those results must be examined side-by-side with the irrefutable empirical evidence that training deep neural nets with thousands of layers and millions of units, on lagre datasets no less, is common in practice.

I think the answer lies in the observation that, to compute any complexity result, one must make certain axiomatic assumptions about the structure of the machine (in the abstract sense) that they are investigating. These assumptions may well differ from the assumptions made in common practice for the same general kind of machine. So for example, it seems that the Blum & Rivest paper was criticised, even dismissed as irrelevant, at its time because it assumed a discrete activation function, when continuous functions were already (1992) the norm.

Especially with neural nets it seems that the wild variety of architectures makes it very hard to derive results with general applicability. The following letter to the editor of the journal Neural Networks from 1997, makes this point, and also notes that a popular assumption that a result applying to a simpler architecture can be generalised to more complex architectures is contradicted by the observation that adding more layers or units can _sometimes_ improve the efficiency of training, counterintuitively (though that is not, itself, an assumption that holds in general):

https://dl.acm.org/doi/10.1016/S0893-6080%2897%2900041-5

Unfortunately this is behind a paywall, but email me at the address associated with this github account: https://github.com/stassa and I can send you a copy (totally clandestinely! We'll break the law together :)

The bottom line is: I have no idea whether my claim above about the polynomial time complexity of backpropagation is right or wrong. _But_ it is certainly possible, _in practice_ to train deep neural nets on large datasets, _and_ this was possible even before the use of GPUs was common.

Which is not to say that hardware was not a very important factor in the current domination of deep neural nets in AI research. Hardware- and data.