[The author of the paper mentioned on parent comment is Jürgen Schmidhuber - inventor of LSTMs and a very colorful character in neural land.
The NYTimes did a nice profile on him a while back:
HN Discussion: https://news.ycombinator.com/item?id=13066646]
y = X β + ε ...and a few assumptions give you... (X^t X)^-1 y = β*
I might be missing something in the blog post.
Linear models are much older than computers, dating back to Gauss at least, and they do not have anything to do with gradient descent.
matrix inversion is ~O(n^3)
gradient descent is ~O(np) where p is the number of predictors and n are the observations (n x p matrix).
for lasso, calculating that derivative of the multiplier is not possible (for all points), so coordinated descent is used.
Can anyone take a real world example of human behavior and show me how it relates to how these techniques predict humans will behave?
I love the field but feel like there is a temptation to take giant leaps not supported by other observations.
A neuron is either activated or not and each of the many inputs can be either excitatory (encourages activation) or inhibitory (discourages activation). McCulloch and Pitts formalized this as a weighted average of the inputs that was then thresholded to 0 or 1. And they showed some basic theoretical results from that that gave it some credit as a model for how intelligence can arise from neurons. Essentially they said behavior can be described as a classifier.
AFAIK, they didn't go much into how the weights were actually learned. Different strategies were tried, but we ultimately started to soften the threshold function into the logistic function (to make the network differentiable) and solve for the weights by gradient descent.
Modern Deep Learning makes the additional assumption that neurons in the same layer are not interconnected. This assumption, along with the fact that we're just dealing with weighted averages, allows us to describe networks in matrix form, allows us to compute the gradients with backprop, and allows efficient simulation on the GPU. This assumption is more practical than biological.
> show me how it relates to how [...] humans will behave?
[This page] attempts to connect the dots between the McCulloch and Pitts model, the resulting classifiers, and behavior. Essentially, the theory was that neurons can be formalized into classifiers, and behavior is just the output of these classifiers. I don't know too much about modern neuroscience, but given the amazing results we are seen these days in vision, language, and planning, I'd say the central ideas of the theory are still credible.
Calling those chained regressions similar to the brain is about as correct as saying that a 3y old's drawing of a car is similar to a real Tesla...
McCulloch and Pitts published in the 1950s. Of course we know more about the brain now.
If I were to ask you "How does intelligence arise from a network of activations?" Would you genuinely say that it has nothing to do with the McCulloch and Pitts theory?
But more generally, I am just so tied of this "brain metaphor" on deep learning. It is a funny way to wake up your students (well, at least 10 years ago it was...), but trying to stretch this metaphor much more than that is just painful. Heck, even the activating "functions" (plural, as we now know) in a neutron isn't really a set (!) of (singular, independent) functions, it's just a top level name for a mind-boggling number of things happening as neurons "fire", with a mathematical formalism to approximate what's going on. In fact, calling an activation a "function" is probably belittling the biological processes behind them.
Kolmogorov authored a paper titled "On Representation of Continuous Functions of Several Variables by Superpositions of Continuous Functions of Smaller Number of Variables," that basically solved this in 1961. This led to a nice back and forth series of papers between Kolmogorov and Arnold, but the one that becomes more important is Kolmogorov's paper, "On the Representation of Continuous Functions of Many Variables by Superposition of Continuous Functions of One Variable and Addition," in 1963. What this paper proves is that any continuous function defined on the n-dimensional unit cube can be represented by the superposition of 2n one dimensional continuous functions:
Now, the problem with this theorem is that it doesn't say how to find these magical 2n functions. However, in 1989 Cybbenko published the paper, "Approximation by Superpositions of Sigmoidal Functions," which both extends and weakens the above result. Basically, he loses the 2n bound, but gives a way to construct these functions by using a linear projection inside of a superposition of sigmoids. This led to the universal approximation theorem:
and I would contend the underpinnings for the modern neural net models. Now, is there any biology in there? No. It's a long series of function approximation papers. That said, I don't know the authors involved or what inspired them to write these papers. However, given that we have a documented history of dry function approximation papers that give us the mathematical power that we need to begin to justify these models, I tend to feel that the biological connections are oversold.
More generally, a common trope in NN papers and books is to draw a graph for matrix-vector multiplication and then draw the analogy that these are like neurons in the brain and this represents their connectivity. This is an example of the kind of backwalking biological analogies that frustrate me. Again, certainly, I don't know the motivations behind everyone in the field, but I do contend that many of the more powerful theorems have nothing to do with biology and have other origins.
On a side note:
I lost all respect for andrew ng after the Baidu cheating scandal.
I felt he got away too easy on that, without any apology or even a public statement - especially considering he is a former academic. (And that too he silently deleted his google+ posts.)
Imagine if something like that had happened at a Google research team - I am pretty sure Jeff Dean or Peter Norvig would have stepped down.
I don't understand why you'd want to cheat for a competition like this? I get it, people cheat all the time, but the field of machine learning is built on a foundation of open and shared research, and trust.
It's arguable that they were gaming the system somewhat, but unless a limit was explicitly defined then this just seems like they were doing a lot of exploration in the area.
Imagine if you published some research showing you'd made something that did something cool, but then people lost respect for you because you'd made a lot of previous attempts.
It did 3 things:
1. Provided a usable solution for what was previously an intractable real world problem, large multiclass image
classification, with decent accuracy
2. Crushed the prior benchmark on this task
3. Found a practical workaround to what was the biggest bottleneck, computation time, by utilizing GPUs (and made Nvidia stock explode /s)
The subsequent ImageNet competitions then later provided the perfect catalyst to refining and making deep neural nets mainstream. In parallel the sudden interest from everyone else who in turn started applying neural nets to pretty much every domain out there under the sun, was what I think ultimately made deep learning as it is to what it's today.
And then the idea that numerical optimization accounting for the slope was novel. How does he think that mathematicians calculated for the preceding centuries?
Linear regression springs full formed in the 1950's and '60's? What happened to Fisher and Student and Pearson and all the rest?
Where's Hopfield? Where's Potts? Where's an awareness of the history of mathematics in general?
In the first example, the method compute_error_for_line_given_points is called with values 1, 2, [[3,6],[6,9],[12,18]]. Where did those values come from?
Later in that same example, there is an "Error = 4^2 + (-1)^2 + 6^2". Where did those values come from?
Later, there's another form: "Error = x^5 - 2x^3 -2" What about these?
There seem to be magic formulae everywhere, with no real explanation in the article about where they came from. Without that, I have no way of actually understanding this.
Am I missing something fundamental here?
Many of the deep learning courses assume "high school math", but my school must have skipped matrices, so I've been watching Khan Academy videos.
Are there any good posts / books on walking through the math of deep learning from a true beginner's perspective?
If the first example had been kept, then the second would have been "Error = (6 - (2·3 + 1))² + (9 - (2·6 + 1))² + (18 - (2·12 + 1))² = (-1)² + (-4)² + (-7)² = 66", which is what compute_error_for_line_given_points evaluates to.
The third would have been "Error = (6 - (m·3 + b))² + (9 - (m·6 + b))² + (18 - (m·12 + b))² = 3·b² + 42·b·m - 66·b + 189·m² - 576·m + 441" and its derivative would have to be taken in two directions, giving "dError/dm = 42·b + 378·m - 576" and "dError/db = 6·b + 42·m - 66". Visualizing that slope would require a 3D plot.
>In the first example, the method compute_error_for_line_given_points is called with values 1, 2, [[3,6],[6,9],[12,18]]. Where did those values come from?
It's an example. The first two arguments define a line y = 2x + 1,
the pairs are (x,y) points being used to compute the error.
"To play with this, let’s assume that the error function is Error=x^5−2x^3−2"
This is just an example of a function used as exposition to talk about derivatives.
It isn't even an error function though. An error function has to be a function of at least two variables.
Has anyone managed to land a decent deep learning job without formal CS/machine learning training? How did you approach it?
I felt like they were to math heavy. However, I'm struggling on how to learn deep learning.
I do think a lot can be done on the presentation of the material, and certainly don't think much of credentialism.
Honestly, in your shoes I would look for a position where you can learn from people internally, rather than try and qualify yourself first. Even if you do a bunch of online learning and toy problems, you are going to flail about if you don't have a strong mentor in your first position.
What related/supportive skills do you have to bring to a group that is doing ML ?
edit: I should add that you don't really have to understand much these days to integrate (some) ML into a system, but you aren't going to get very far into modeling or understanding issues without some background. You can only get so far with black boxes.
I have around 8 years of professional software experience (C++/C#) and have fiddled around with some rudimentary machine learning for work, like linear regression, k-means clustering, etc. I have a decent idea of how/why they work, but have fallen flat on my face when learning the theory behind more complicated algorithms, e.g. Hessians from Andrew Ng's class. In my experience, many classes tend to focus on a ground up approach. With higher level frameworks like Keras, how necessary is this?
I would wager that you've heard this line before, but it all depends on the particulars of what you are trying to do. If you want to develop a first principles understanding of what's going on its probably important. It will be less important if you just need to see the empirical performance of n established method on your new dataset.
>but have fallen flat on my face when learning the theory behind more complicated algorithms, e.g. Hessians from Andrew Ng's class
Reading in between the lines, maybe this is a question about Newton's method? One of the general strategies shared between software development and "mathematical" (for lack of a better word) science and engineering is to reduce a complex problem to a known use case. If you've got a grasp on linear regression, take a look at Newton's method in this case. You may be pleasantly surprised to see that the Hessian is constant. This might make it easier to make the connection to relevant topics such as the convergence rate of the method and the connection to the uncertainty in the fit.
Its one thing to know the math and theory to design, train, and tune the algorithm your company needs. But implementing it into production, at scale? That's not the same person.
Ideally, you have Person/Team A, who designs but knows enough about implementation to keep that in mind during their process, and Person/Team B who implements it into the software but knows enough about the design to make it work.
So the compromise is usually as you describe, but you bear the cost of translation issues no matter how you do this. It's worth remembering that it is a compromise.
I think systems like tensorflow are implicitly a recognition of this, allowing lower impedance between the groups.
Andrew is now working at Tesla. I believe this is his course:
<ducks because="had to get that one out there"/>
Edit: There we go with the downvotes, I knew it that deep learning guys can't stand this claim (but it's true, as the post itself goes to show in great length... :-))