Coding the History of Deep Learning 256 points by saip on Sept 22, 2017 | hide | past | web | favorite | 50 comments

 If you want a more nuanced research on the history on deep learning in neural networks, here is an excellent historical survey paper: https://arxiv.org/abs/1404.7828
 I'd also recommend Andrey Kurenkov's well written multi-part series on the history of neural nets: https://news.ycombinator.com/item?id=10910887[The author of the paper mentioned on parent comment is Jürgen Schmidhuber - inventor of LSTMs and a very colorful character in neural land. The NYTimes did a nice profile on him a while back: https://www.nytimes.com/2016/11/27/technology/artificial-int... HN Discussion: https://news.ycombinator.com/item?id=13066646]
 Least squares, gradient descent and linear regression separately? I get that he wants to point out the profundity and universality of the ideas encompassed in those techniques (& models; least squares and gradient descent are rightly thought of as (numerical) techniques, whereas the linear regression models is a, well, model) but that is like saying that arithmetic is fundamental to deep learning. Essentially, this "history" only takes you to 1947 and Minsky.
 Also why do linear regression (OLS) models need gradient descent at all? Cannot you calculate the parameters directly?y = X β + ε ...and a few assumptions give you... (X^t X)^-1 y = β*I might be missing something in the blog post.
 Others have pointed out that matrix inversion is O(n^3) and hence computationally infeasible. It is also worth considering that the conditioning of X^t X, k(X^t X) can be as large as k(X)^2, so solving in this way can be very unstable.
 While this is correct, you do not need to compute the inverse to solve this kind of least squares problem (and you do not need the inverse to solve any linear systems of equations).
 You can, yes, but inverting a matrix is computationally expensive, and for a large X, numerical optimization methods can be much more time/space efficient.
 From a pedagogical point of view, I think it's a very strange choice to go with gradient descent in this case. It makes linear regression look like something much more complicated than it actually is. People might be misled into thinking they need to hand code gradient descent every time they do a regression, for their 100 point dataset.
 I upvoted the replies about computational cost and stability, because they raise an important point. But what I was trying to talk about (as you are) was the presentation of the idea.Linear models are much older than computers, dating back to Gauss at least, and they do not have anything to do with gradient descent.
 really quickly:matrix inversion is ~O(n^3)gradient descent is ~O(np) where p is the number of predictors and n are the observations (n x p matrix).for lasso, calculating that derivative of the multiplier is not possible (for all points), so coordinated descent is used.
 I know it is popular to say that these techniques are based on how the brain works, but when I read about them, I have my doubts.Can anyone take a real world example of human behavior and show me how it relates to how these techniques predict humans will behave?I love the field but feel like there is a temptation to take giant leaps not supported by other observations.
 We say ANNs are "based on how the brain works" because the original mathematical model was an attempt by McCulloch and Pitts to explain how complex behavior arises from networks of simple neurons.A neuron is either activated or not and each of the many inputs can be either excitatory (encourages activation) or inhibitory (discourages activation). McCulloch and Pitts formalized this as a weighted average of the inputs that was then thresholded to 0 or 1. And they showed some basic theoretical results from that that gave it some credit as a model for how intelligence can arise from neurons. Essentially they said behavior can be described as a classifier.AFAIK, they didn't go much into how the weights were actually learned. Different strategies were tried, but we ultimately started to soften the threshold function into the logistic function (to make the network differentiable) and solve for the weights by gradient descent.Modern Deep Learning makes the additional assumption that neurons in the same layer are not interconnected. This assumption, along with the fact that we're just dealing with weighted averages, allows us to describe networks in matrix form, allows us to compute the gradients with backprop, and allows efficient simulation on the GPU. This assumption is more practical than biological.> show me how it relates to how [...] humans will behave?[This page][1] attempts to connect the dots between the McCulloch and Pitts model, the resulting classifiers, and behavior. Essentially, the theory was that neurons can be formalized into classifiers, and behavior is just the output of these classifiers. I don't know too much about modern neuroscience, but given the amazing results we are seen these days in vision, language, and planning, I'd say the central ideas of the theory are still credible.
 First of all, neurons don't have just one activation function. Each dendrite has. So, anything from dozens to thousands. Second, that definition doesn't cover the entire issue of multiple feedback loops. Third, this doesn't cover memory effects at structural (cytoskeleton) and local levels (vesicles), much less generic levels (RNA and your genes). And then we haven't even gotten into metabolomic and epigenetic wriring in your neurons ...Calling those chained regressions similar to the brain is about as correct as saying that a 3y old's drawing of a car is similar to a real Tesla...
 I mean, doi.McCulloch and Pitts published in the 1950s. Of course we know more about the brain now.If I were to ask you "How does intelligence arise from a network of activations?" Would you genuinely say that it has nothing to do with the McCulloch and Pitts theory?
 I would honestly say we really have no clue, and maybe add that as far as we can tell, activations play as much of a role in intelligence as a myriad of other factors.But more generally, I am just so tied of this "brain metaphor" on deep learning. It is a funny way to wake up your students (well, at least 10 years ago it was...), but trying to stretch this metaphor much more than that is just painful. Heck, even the activating "functions" (plural, as we now know) in a neutron isn't really a set (!) of (singular, independent) functions, it's just a top level name for a mind-boggling number of things happening as neurons "fire", with a mathematical formalism to approximate what's going on. In fact, calling an activation a "function" is probably belittling the biological processes behind them.
 Inspired by biology is typically a better way to think about it. Airplanes have wings inspired from biological birds, and they share some structural similarities, but in practice they serve very different functions.
 I would even say that this is somewhat revisionist history. From my perspective, this all started from an attempt by Kolmogorov to solve Hilbert's 13th problem:https://en.wikipedia.org/wiki/Hilbert%27s_thirteenth_problemKolmogorov authored a paper titled "On Representation of Continuous Functions of Several Variables by Superpositions of Continuous Functions of Smaller Number of Variables," that basically solved this in 1961. This led to a nice back and forth series of papers between Kolmogorov and Arnold, but the one that becomes more important is Kolmogorov's paper, "On the Representation of Continuous Functions of Many Variables by Superposition of Continuous Functions of One Variable and Addition," in 1963. What this paper proves is that any continuous function defined on the n-dimensional unit cube can be represented by the superposition of 2n one dimensional continuous functions:https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Arnold_repr...Now, the problem with this theorem is that it doesn't say how to find these magical 2n functions. However, in 1989 Cybbenko published the paper, "Approximation by Superpositions of Sigmoidal Functions," which both extends and weakens the above result. Basically, he loses the 2n bound, but gives a way to construct these functions by using a linear projection inside of a superposition of sigmoids. This led to the universal approximation theorem:https://en.wikipedia.org/wiki/Universal_approximation_theore...and I would contend the underpinnings for the modern neural net models. Now, is there any biology in there? No. It's a long series of function approximation papers. That said, I don't know the authors involved or what inspired them to write these papers. However, given that we have a documented history of dry function approximation papers that give us the mathematical power that we need to begin to justify these models, I tend to feel that the biological connections are oversold.
 That timeline seems to miss that Yann LeCun was already working on ConvNets in 1988. I don't think anyone waited for the Universal Approximation theorem to start building neural architectures, it was just a tangentially interesting mathematical result.
 Which paper are you speaking about? Certainly, I'm always interested in a more complete history. I'm currently on LeCun's page and can't figure out which paper you're speaking to:http://yann.lecun.com/exdb/publis/index.htmlMore generally, a common trope in NN papers and books is to draw a graph for matrix-vector multiplication and then draw the analogy that these are like neurons in the brain and this represents their connectivity. This is an example of the kind of backwalking biological analogies that frustrate me. Again, certainly, I don't know the motivations behind everyone in the field, but I do contend that many of the more powerful theorems have nothing to do with biology and have other origins.
 It would probably be more accurate to say "these technique are based on how we thought the brain works". For a historical summary, it's not the worst sin in the world.
 Would love to see mention of several of the main contributors to deep learning, such as Geoffrey Hinton, the “father” of deep learning, Andrew Ng and Demis Hassabis in future posts.
 Geoff Hinton - surely. But I think most experts will disagree on the other two. In terms of deep fundamental contributions I don't the think other two have made much. I think Andrew Ng has been a great popularizer/marketing guy - primarily with that Cats project. Likewise Demis Hassabis has been a great application creator - with amazing results of course - AlphaGo, Atari, etc.On a side note: I lost all respect for andrew ng after the Baidu cheating scandal. https://www.nytimes.com/2015/06/04/technology/computer-scien...I felt he got away too easy on that, without any apology or even a public statement - especially considering he is a former academic. (And that too he silently deleted his google+ posts.) Imagine if something like that had happened at a Google research team - I am pretty sure Jeff Dean or Peter Norvig would have stepped down.
 I didn't even realize this happened. Thank you for posting. Sucks because now I have less respect for Andrew.I don't understand why you'd want to cheat for a competition like this? I get it, people cheat all the time, but the field of machine learning is built on a foundation of open and shared research, and trust.
 On skim-reading the article it seems they were banned from submitting entries to a competition server for 12 months because they made a significant number of submissions.It's arguable that they were gaming the system somewhat, but unless a limit was explicitly defined then this just seems like they were doing a lot of exploration in the area.Imagine if you published some research showing you'd made something that did something cool, but then people lost respect for you because you'd made a lot of previous attempts.
 The specified limit was 2 submissions per week, according to the people organizing the competition:
 1. Third paragraph: "twice a week" 2. They registered multiple accounts to get around the limit. That removes almost all doubt that the submitter knew he/she was cheating
 From a pure technical point of view, I agree with you. But there is no doubt they all have played an important role in popularising deeplearning. I am fancisnated in the history of deeplearning and how it went from a field no one cared to what it is today.
 IMHO the ImageNet 2012 competition and the winning solution AlexNet (Krizhevsky et al.) was the pivotal moment in which (deep) neural nets went from a field only a few wizened academics cared about to becoming today's buzz word.It did 3 things:1. Provided a usable solution for what was previously an intractable real world problem, large multiclass image classification, with decent accuracy2. Crushed the prior benchmark on this task3. Found a practical workaround to what was the biggest bottleneck, computation time, by utilizing GPUs (and made Nvidia stock explode /s)The subsequent ImageNet competitions then later provided the perfect catalyst to refining and making deep neural nets mainstream. In parallel the sudden interest from everyone else who in turn started applying neural nets to pretty much every domain out there under the sun, was what I think ultimately made deep learning as it is to what it's today.
 The article mentions that GPUs are on average 50-200 times faster for deep learning, I’m curious on how he came to that number. It has a lot to do with the code and the frameworks used. I haven’t come across a good comparison, most figures seems to be taken out of the blue.
 From my experience the speedup is more around 20x-50x.
 I find this history confusing. Legendre guessing by hand? Spaghetti on the wall? No mention of the massive work of Laplace and others that led up to Legendre and Gauss, or Gauss's connection of the notion to probability? This is truly a bizarre view.And then the idea that numerical optimization accounting for the slope was novel. How does he think that mathematicians calculated for the preceding centuries?Linear regression springs full formed in the 1950's and '60's? What happened to Fisher and Student and Pearson and all the rest?Where's Hopfield? Where's Potts? Where's an awareness of the history of mathematics in general?
 This seems like a great introduction to the history. I have a problem with it, though.In the first example, the method compute_error_for_line_given_points is called with values 1, 2, [[3,6],[6,9],[12,18]]. Where did those values come from?Later in that same example, there is an "Error = 4^2 + (-1)^2 + 6^2". Where did those values come from?Later, there's another form: "Error = x^5 - 2x^3 -2" What about these?There seem to be magic formulae everywhere, with no real explanation in the article about where they came from. Without that, I have no way of actually understanding this.Am I missing something fundamental here?
 I'd also like to see more of a "teaching" post that can walk through the math incrementally.Many of the deep learning courses assume "high school math", but my school must have skipped matrices, so I've been watching Khan Academy videos.Are there any good posts / books on walking through the math of deep learning from a true beginner's perspective?
 The other replies are already telling you that these are just examples. I want to stress that these are completely unrelated examples, which is bad form IMO.If the first example had been kept, then the second would have been "Error = (6 - (2·3 + 1))² + (9 - (2·6 + 1))² + (18 - (2·12 + 1))² = (-1)² + (-4)² + (-7)² = 66", which is what compute_error_for_line_given_points evaluates to.The third would have been "Error = (6 - (m·3 + b))² + (9 - (m·6 + b))² + (18 - (m·12 + b))² = 3·b² + 42·b·m - 66·b + 189·m² - 576·m + 441" and its derivative would have to be taken in two directions, giving "dError/dm = 42·b + 378·m - 576" and "dError/db = 6·b + 42·m - 66". Visualizing that slope would require a 3D plot.
 >Am I missing something fundamental here? Yeah, these aren't magic formula, they are just examples.>In the first example, the method compute_error_for_line_given_points is called with values 1, 2, [[3,6],[6,9],[12,18]]. Where did those values come from?It's an example. The first two arguments define a line y = 2x + 1, the pairs are (x,y) points being used to compute the error."To play with this, let’s assume that the error function is Error=x^5−2x^3−2"This is just an example of a function used as exposition to talk about derivatives.It isn't even an error function though. An error function has to be a function of at least two variables.
 Good point. They are all example data. The "[[3,6],[6,9],[12,18]]" can be thought of as the coordinates of a comet, and 2 is your predicted correlation, the slope, followed by 1 your predicted constant, the y-intercept. In this case, you want to change 2 and 1 to find the combination that results in the lowest error. It the same with "Error = 4^2 + (-1)^2 + 6^2", it's an example of an error function. Does that make sense?
 Spot on. I struggled with the mainstream deep learning/machine learning MOOCs. I felt like they were to math heavy. However, I'm struggling on how to learn deep learning. I get polarized advice on it. Some argue that you need a degree or certificates from established MOOCs, others keep recommending me to do Kaggle challenges.Has anyone managed to land a decent deep learning job without formal CS/machine learning training? How did you approach it?
 `````` I felt like they were to math heavy. However, I'm struggling on how to learn deep learning. `````` These statements are in contention. You will never really understand machine learning without learning a fair bit of the math.I do think a lot can be done on the presentation of the material, and certainly don't think much of credentialism.Honestly, in your shoes I would look for a position where you can learn from people internally, rather than try and qualify yourself first. Even if you do a bunch of online learning and toy problems, you are going to flail about if you don't have a strong mentor in your first position.What related/supportive skills do you have to bring to a group that is doing ML ?edit: I should add that you don't really have to understand much these days to integrate (some) ML into a system, but you aren't going to get very far into modeling or understanding issues without some background. You can only get so far with black boxes.
 Thanks for your reply. I do agree with you, in general, and have been trying to get myself involved in more ML projects at my current work.I have around 8 years of professional software experience (C++/C#) and have fiddled around with some rudimentary machine learning for work, like linear regression, k-means clustering, etc. I have a decent idea of how/why they work, but have fallen flat on my face when learning the theory behind more complicated algorithms, e.g. Hessians from Andrew Ng's class. In my experience, many classes tend to focus on a ground up approach. With higher level frameworks like Keras, how necessary is this?
 >With higher level frameworks like Keras, how necessary is this?I would wager that you've heard this line before, but it all depends on the particulars of what you are trying to do. If you want to develop a first principles understanding of what's going on its probably important. It will be less important if you just need to see the empirical performance of n established method on your new dataset.>but have fallen flat on my face when learning the theory behind more complicated algorithms, e.g. Hessians from Andrew Ng's classReading in between the lines, maybe this is a question about Newton's method? One of the general strategies shared between software development and "mathematical" (for lack of a better word) science and engineering is to reduce a complex problem to a known use case. If you've got a grasp on linear regression, take a look at Newton's method in this case. You may be pleasantly surprised to see that the Hessian is constant. This might make it easier to make the connection to relevant topics such as the convergence rate of the method and the connection to the uncertainty in the fit.
 This is something I've also struggled with. I find it hard to read deep learning papers because I need to translate each math notation, thus struggling to get the bigger picture. I'm fond of the bottom-up approach, e.g. I started by mastering C and wrote my own libraries. But for deep learning I lean towards the opposite, starting with high-level libraries. When I want to understand the theory I search for simple python code that I can implement from scratch. This way I can understand the logic, without having to understand all the math behind it. I've mostly focused on doing Kaggle type of problems and used MOOCs when I get stuck. I've had little interest from larger companies, but I've managed to get a few offers from startups. Startups often have a couple of people with PhD-level knowledge but are also looking programmers that can code the models.
 "Machine Learning Engineer" is a title we're going to see more and more of (and we're already seeing a lot).Its one thing to know the math and theory to design, train, and tune the algorithm your company needs. But implementing it into production, at scale? That's not the same person.Ideally, you have Person/Team A, who designs but knows enough about implementation to keep that in mind during their process, and Person/Team B who implements it into the software but knows enough about the design to make it work.
 Truly ideally you have someone/team who actually can do both properly. However, very few people can. And if you have one, you may not be able to justify their time on all aspects.So the compromise is usually as you describe, but you bear the cost of translation issues no matter how you do this. It's worth remembering that it is a compromise.I think systems like tensorflow are implicitly a recognition of this, allowing lower impedance between the groups.
 There's a difference between people who can implement models and those that can create them -- startups could use people who do the former, and many don't actually need the latter.
 “It’s been used in Andrew Karpathy’s deep learning course at Stanford,”Andrew is now working at Tesla. I believe this is his course:
 Yes, you can find it here: https://youtu.be/i94OvYb6noo?t=59m6s
 All of the scientists i love (even Heaviside!) in one article. I am so pleased!
 No mention of who invented convnets?
 Which is a long-winded way to show that deep learning isn't much more than a concatenation of glorified regression functions... :-)Edit: There we go with the downvotes, I knew it that deep learning guys can't stand this claim (but it's true, as the post itself goes to show in great length... :-))

Search: