
Two big challenges in machine learning [pdf] - jsnell
http://icml.cc/2015/invited/LeonBottouICML2015.pdf
======
joe_the_user
It seems Léon Bottou is an important figure in Machine learning. His personal
website has interesting stuff too.

[https://research.facebook.com/researchers/1558013787807218/l...](https://research.facebook.com/researchers/1558013787807218/leon-
bottou/)

[http://leon.bottou.org/](http://leon.bottou.org/)

------
chestervonwinch
I wasn't aware Leon Bouttou was with facebook as well now. They're building
quite a research team. Interesting slides.

~~~
joe_the_user
It's fascinating that now that things like deep learning have massive
traction, people like LeCrun and Bouttou (both now at Facebook), who
apparently pioneered the stuff, are taking a critical position on it -
critical not being negative or dismissive but rather a "we have to see the
limitations and good beyond them" approach.

See: LeCrun's What Wrong With Deep Learning
[https://drive.google.com/file/d/0BxKBnD5y2M8NVHRiVXBnOVpiYUk...](https://drive.google.com/file/d/0BxKBnD5y2M8NVHRiVXBnOVpiYUk/view?sle=true)

~~~
chestervonwinch
To be fair, he's been talking about the relationship between theory and
empiricism with regards to neural networks for some time now. See, for
instance, page 12 here [1] (the titles of his talks all seem to be provocative
questions; almost in a tongue-in-cheek sort of way).

One of the problems (in my opinion) with networks is that the analysis has
always been post hoc. Personally, I think it's more preferable to build a
method starting from theory [2] which can then be tested empirically to see if
the assumptions of the theory hold. Then augment the theory, the method, and
experiment again.

Now, there's nothing inherently wrong with post hoc analysis - it's just a
different start point in the loop of science. However, because we didn't start
from theory, the burden is then to extract some theory from empirical
observation. Again IMO, this is can be problematic because:

1) It more easily allows for confirmation bias.

2) It leads to a multitude fragmented theories.

The second is why everything surrounding neural networks seems so incredibly
ad hoc.

[1]:
[https://www.cs.nyu.edu/~yann/talks/lecun-20071207-nonconvex....](https://www.cs.nyu.edu/~yann/talks/lecun-20071207-nonconvex.pdf)

[2]: The principles could be based on statistical learning (see SVM),
neurophysiology (see work by Poggio or Olshausen), mathematical invariants
(see work by Mallat), etc...

~~~
joe_the_user
Thanks for the post, I was hoping for more discussion of this.

I'd be even more pessimistic about one's ability to go forward from empirical
observation of opaque mechanisms.

Aside from your incisive observations, there's the point that if you have a
"good" "working" "theory of how neural networks operate", what is it a "theory
of"? It's dependent on the mechanisms that gather the test data, the sort of
answer that a certain kind of person wants out of the test data and so-forth -
the "epistemological" questions you didn't answer and couldn't answer will
come to bite you.

I'd add that SVMs do seem more firmly founded but their ultimate tweak, the
kernel trick plus projection onto feature space, is basically ad-hoc too -
still much closer to a "real" probability model etc. The problem with SVMs is
that they wind-up more or less equivalent to a 1st order neural network and
thus they don't scale - once data becomes truly huge, they require too much
storage.

Ironically, I think the best single overall critique of AI efforts was
articulated by Paul Allen[1]. The problem is that in building large systems,
people encounter a "complexity" barrier that prevents further progress[1].
Creating more complex systems to tackle that tends to fail as people wind-up
understanding less and less of their own complex systems.

The problems with all the neurophysiological models is that raw neurons are
very complex things and one doesn't know immediately which parts even carrying
meaningful signals, a problem made worse by not having a model of what those
"meaningful signals" might be.

Consider that if aliens looked at human-made microchips and tried to model
them fully, they get the clock signal and various nonlinearities in the
transitors right but have enough computation errors that no program would run
on it.

Another good argument is that just all our methodology hinges on classical
Western epistemology and a change in that may be necessary[2].

[1] [http://www.technologyreview.com/view/425733/paul-allen-
the-s...](http://www.technologyreview.com/view/425733/paul-allen-the-
singularity-isnt-near/)

[2] [http://aeon.co/magazine/technology/david-deutsch-
artificial-...](http://aeon.co/magazine/technology/david-deutsch-artificial-
intelligence/)

~~~
chestervonwinch
> if you have a "good" "working" "theory of how neural networks operate", what
> is it a "theory of"?

I think we should make the distinction between theory pertaining to a task and
theory pertaining to methods that perform (or approximate) the task.
Certainly, the former can be incorporated into the latter, so the boundary is
fuzzy. Actually, the previous is quite important because known principles of
the task can be expressed mathematically and incorporated functionally into
the approximation method. In this sense, a network architecture could arise
naturally. In fact, it sort of does with anything with a cascade-type pattern.

On the other hand, if we're going to talk about the method (networks, in
particular) independently of the task, this is more difficult. The question is
now: is the network model remarkable in some sense? Meaning: is there some
class of functions which are "best" or "more efficiently represented" by
network approximations, and what are the properties of the class that make
this the case? Yoshua Bengio has touched on this with regards to depth from
the point of view of circuit theory, but the argument is basically: "here's a
couple circuits which are more efficiently represented by increased depth,
therefore deep = good always". It would be more interesting if there was a
more rigorous analysis from a function approximation view. Perhaps literature
exists on this. I'm not sure - I'm sort of rambling now.

> kernel trick plus projection onto feature space, is basically ad-hoc too

The choice of kernel - yes I agree, but the driving theory of the method is to
maximize the margin, not choose the best kernel.

> Creating more complex systems to tackle that tends to fail as people wind-up
> understanding less and less of their own complex systems.

Interesting. Maybe there's something going on with the relationship between
entropy and complexity.

------
syense
Did I just read a powerpoint presentation?

~~~
joe_the_user
Yeah, the presentation format is annoying but argument is good, in fact, the
argument is really important.

Machine learning is about approximate reasoning but with no real guarantees of
the approximation. Even if the approximation is usually very good, being
occasionally bad can be "deeply problematic" when we don't have control over
exactly how exactly the bad recommendations are created.

~~~
shockzzz
Isn't that what confidence intervals are for?

~~~
RMarcus
I'd say that's more what cost functions are for, but that is neither here nor
there.

I think the biggest takeaways are...

1) Machine learning in large software projects is complex because decisions
made by one algorithm can influence the data of other algorithms, creating
massive biasing.

2) Simple cross-validation / hold back methodology is limited as we expand
what we want machine learning to be able to handle. Reason: big data is too
big and "correctness" is difficult to evaluate for things like Q/A systems.

