
Coding the History of Deep Learning - saip
http://blog.floydhub.com/coding-the-history-of-deep-learning
======
iluvmylife
If you want a more nuanced research on the history on deep learning in neural
networks, here is an excellent historical survey paper:
[https://arxiv.org/abs/1404.7828](https://arxiv.org/abs/1404.7828)

~~~
netvarun
I'd also recommend Andrey Kurenkov's well written multi-part series on the
history of neural nets:
[https://news.ycombinator.com/item?id=10910887](https://news.ycombinator.com/item?id=10910887)

[The author of the paper mentioned on parent comment is Jürgen Schmidhuber -
inventor of LSTMs and a very colorful character in neural land. The NYTimes
did a nice profile on him a while back:
[https://www.nytimes.com/2016/11/27/technology/artificial-
int...](https://www.nytimes.com/2016/11/27/technology/artificial-intelligence-
pioneer-jurgen-schmidhuber-overlooked.html?mcubz=0) HN Discussion:
[https://news.ycombinator.com/item?id=13066646](https://news.ycombinator.com/item?id=13066646)]

------
sun_n_surf
Least squares, gradient descent and linear regression separately? I get that
he wants to point out the profundity and universality of the ideas encompassed
in those techniques (& models; least squares and gradient descent are rightly
thought of as (numerical) techniques, whereas the linear regression models is
a, well, model) but that is like saying that arithmetic is fundamental to deep
learning. Essentially, this "history" only takes you to 1947 and Minsky.

~~~
padthai
Also why do linear regression (OLS) models need gradient descent at all?
Cannot you calculate the parameters directly?

y = X β + ε ...and a few assumptions give you... (X^t X)^-1 y = β*

I might be missing something in the blog post.

~~~
snakeboy
Others have pointed out that matrix inversion is O(n^3) and hence
computationally infeasible. It is also worth considering that the conditioning
of X^t X, k(X^t X) can be as large as k(X)^2, so solving in this way can be
very unstable.

~~~
dragandj
While this is correct, you do not need to compute the inverse to solve this
kind of least squares problem (and you do not need the inverse to solve any
linear systems of equations).

------
bluetwo
I know it is popular to say that these techniques are based on how the brain
works, but when I read about them, I have my doubts.

Can anyone take a real world example of human behavior and show me how it
relates to how these techniques predict humans will behave?

I love the field but feel like there is a temptation to take giant leaps not
supported by other observations.

~~~
cbarrick
We say ANNs are "based on how the brain works" because the original
mathematical model was an attempt by McCulloch and Pitts to explain how
complex behavior arises from networks of simple neurons.

A neuron is either activated or not and each of the many inputs can be either
excitatory (encourages activation) or inhibitory (discourages activation).
McCulloch and Pitts formalized this as a weighted average of the inputs that
was then thresholded to 0 or 1. And they showed some basic theoretical results
from that that gave it some credit as a model for how intelligence can arise
from neurons. Essentially they said behavior can be described as a classifier.

AFAIK, they didn't go much into how the weights were actually learned.
Different strategies were tried, but we ultimately started to soften the
threshold function into the logistic function (to make the network
differentiable) and solve for the weights by gradient descent.

Modern Deep Learning makes the additional assumption that neurons in the same
layer are not interconnected. This assumption, along with the fact that we're
just dealing with weighted averages, allows us to describe networks in matrix
form, allows us to compute the gradients with backprop, and allows efficient
simulation on the GPU. This assumption is more practical than biological.

> show me how it relates to how [...] humans will behave?

[This page][1] attempts to connect the dots between the McCulloch and Pitts
model, the resulting classifiers, and behavior. Essentially, the theory was
that neurons can be formalized into classifiers, and behavior is just the
output of these classifiers. I don't know too much about modern neuroscience,
but given the amazing results we are seen these days in vision, language, and
planning, I'd say the central ideas of the theory are still credible.

[1]:
[http://www.mind.ilstu.edu/curriculum/modOverview.php?modGUI=...](http://www.mind.ilstu.edu/curriculum/modOverview.php?modGUI=212)

~~~
fnl
First of all, neurons don't have just one activation function. Each dendrite
has. So, anything from dozens to thousands. Second, that definition doesn't
cover the entire issue of multiple feedback loops. Third, this doesn't cover
memory effects at structural (cytoskeleton) and local levels (vesicles), much
less generic levels (RNA and your genes). And then we haven't even gotten into
metabolomic and epigenetic wriring in your neurons ...

Calling those chained regressions similar to the brain is about as correct as
saying that a 3y old's drawing of a car is similar to a real Tesla...

~~~
cbarrick
I mean, doi.

McCulloch and Pitts published in the 1950s. Of course we know more about the
brain now.

If I were to ask you "How does intelligence arise from a network of
activations?" Would you genuinely say that it has nothing to do with the
McCulloch and Pitts theory?

~~~
fnl
I would honestly say we really have no clue, and maybe add that as far as we
can tell, activations play as much of a role in intelligence as a myriad of
other factors.

But more generally, I am just so tied of this "brain metaphor" on deep
learning. It is a funny way to wake up your students (well, at least 10 years
ago it was...), but trying to stretch this metaphor much more than that is
just painful. Heck, even the activating "function _s_ " (plural, as we now
know) in a neutron isn't really a set (!) of (singular, independent)
functions, it's just a top level name for a mind-boggling number of things
happening as neurons "fire", with a mathematical formalism to _approximate_
what's going on. In fact, calling an activation a "function" is probably
belittling the biological processes behind them.

------
houqp
Would love to see mention of several of the main contributors to deep
learning, such as Geoffrey Hinton, the “father” of deep learning, Andrew Ng
and Demis Hassabis in future posts.

~~~
luminati
Geoff Hinton - surely. But I think most experts will disagree on the other
two. In terms of deep fundamental contributions I don't the think other two
have made much. I think Andrew Ng has been a great popularizer/marketing guy -
primarily with that Cats project. Likewise Demis Hassabis has been a great
application creator - with amazing results of course - AlphaGo, Atari, etc.

On a side note: I lost all respect for andrew ng after the Baidu cheating
scandal. [https://www.nytimes.com/2015/06/04/technology/computer-
scien...](https://www.nytimes.com/2015/06/04/technology/computer-scientists-
are-astir-after-baidu-team-is-barred-from-ai-competition.html)

I felt he got away too easy on that, without any apology or even a public
statement - especially considering he is a former academic. (And that too he
silently deleted his google+ posts.) Imagine if something like that had
happened at a Google research team - I am pretty sure Jeff Dean or Peter
Norvig would have stepped down.

~~~
kahnjw
I didn't even realize this happened. Thank you for posting. Sucks because now
I have less respect for Andrew.

I don't understand why you'd want to cheat for a competition like this? I get
it, people cheat all the time, but the field of machine learning is built on a
foundation of open and shared research, and trust.

~~~
AndrewOMartin
On skim-reading the article it seems they were banned from submitting entries
to a competition server for 12 months because they made a significant number
of submissions.

It's arguable that they were gaming the system somewhat, but unless a limit
was explicitly defined then this just seems like they were doing a lot of
exploration in the area.

Imagine if you published some research showing you'd made something that did
something cool, but then people lost respect for you because you'd made a lot
of previous attempts.

~~~
interknot
The specified limit was 2 submissions per week, according to the people
organizing the competition:

[http://www.image-net.org/challenges/LSVRC/announcement-
June-...](http://www.image-net.org/challenges/LSVRC/announcement-June-2-2015)

------
narenst
The article mentions that GPUs are on average 50-200 times faster for deep
learning, I’m curious on how he came to that number. It has a lot to do with
the code and the frameworks used. I haven’t come across a good comparison,
most figures seems to be taken out of the blue.

~~~
visarga
From my experience the speedup is more around 20x-50x.

------
madhadron
I find this history confusing. Legendre guessing by hand? Spaghetti on the
wall? No mention of the massive work of Laplace and others that led up to
Legendre and Gauss, or Gauss's connection of the notion to probability? This
is truly a bizarre view.

And then the idea that numerical optimization accounting for the slope was
novel. How does he think that mathematicians calculated for the preceding
centuries?

Linear regression springs full formed in the 1950's and '60's? What happened
to Fisher and Student and Pearson and all the rest?

Where's Hopfield? Where's Potts? Where's an awareness of the history of
mathematics in general?

------
dpcx
This seems like a great introduction to the history. I have a problem with it,
though.

In the first example, the method compute_error_for_line_given_points is called
with values 1, 2, [[3,6],[6,9],[12,18]]. Where did those values come from?

Later in that same example, there is an "Error = 4^2 + (-1)^2 + 6^2". Where
did _those_ values come from?

Later, there's another form: "Error = x^5 - 2x^3 -2" What about these?

There seem to be magic formulae everywhere, with no real explanation in the
article about where they came from. Without that, I have no way of actually
understanding this.

Am I missing something fundamental here?

~~~
twillmas
I'd also like to see more of a "teaching" post that can walk through the math
incrementally.

Many of the deep learning courses assume "high school math", but my school
must have skipped matrices, so I've been watching Khan Academy videos.

Are there any good posts / books on walking through the math of deep learning
from a true beginner's perspective?

------
terrabytes
Spot on. I struggled with the mainstream deep learning/machine learning MOOCs.
I felt like they were to math heavy. However, I'm struggling on how to learn
deep learning. I get polarized advice on it. Some argue that you need a degree
or certificates from established MOOCs, others keep recommending me to do
Kaggle challenges.

Has anyone managed to land a decent deep learning job without formal
CS/machine learning training? How did you approach it?

~~~
emilwallner
This is something I've also struggled with. I find it hard to read deep
learning papers because I need to translate each math notation, thus
struggling to get the bigger picture. I'm fond of the bottom-up approach, e.g.
I started by mastering C and wrote my own libraries. But for deep learning I
lean towards the opposite, starting with high-level libraries. When I want to
understand the theory I search for simple python code that I can implement
from scratch. This way I can understand the logic, without having to
understand all the math behind it. I've mostly focused on doing Kaggle type of
problems and used MOOCs when I get stuck. I've had little interest from larger
companies, but I've managed to get a few offers from startups. Startups often
have a couple of people with PhD-level knowledge but are also looking
programmers that can code the models.

~~~
madenine
"Machine Learning Engineer" is a title we're going to see more and more of
(and we're already seeing a lot).

Its one thing to know the math and theory to design, train, and tune the
algorithm your company needs. But implementing it into production, at scale?
That's not the same person.

Ideally, you have Person/Team A, who designs but knows enough about
implementation to keep that in mind during their process, and Person/Team B
who implements it into the software but knows enough about the design to make
it work.

~~~
ska
Truly ideally you have someone/team who actually can do both properly.
However, very few people can. And if you have one, you may not be able to
justify their time on all aspects.

So the compromise is usually as you describe, but you bear the cost of
translation issues no matter how you do this. It's worth remembering that it
is a compromise.

I think systems like tensorflow are implicitly a recognition of this, allowing
lower impedance between the groups.

------
melling
“It’s been used in Andrew Karpathy’s deep learning course at Stanford,”

Andrew is now working at Tesla. I believe this is his course:

[http://cs231n.stanford.edu/syllabus.html](http://cs231n.stanford.edu/syllabus.html)

~~~
emilwallner
Yes, you can find it here:
[https://youtu.be/i94OvYb6noo?t=59m6s](https://youtu.be/i94OvYb6noo?t=59m6s)

------
hamilyon2
All of the scientists i love (even Heaviside!) in one article. I am so
pleased!

------
amelius
No mention of who invented convnets?

------
fnl
Which is a long-winded way to show that deep learning isn't much more than a
concatenation of glorified regression functions... :-)

<ducks because="had to get that one out there"/>

Edit: There we go with the downvotes, I knew it that deep learning guys can't
stand this claim (but it's true, as the post itself goes to show in great
length... :-))

