
Machine-Learning Maestro Michael Jordan on the Delusions of Big Data and Others - vkn13
http://spectrum.ieee.org/robotics/artificial-intelligence/machinelearning-maestro-michael-jordan-on-the-delusions-of-big-data-and-other-huge-engineering-efforts
======
Daishiman
Great interview. In my experience it's amazing just _how many_ people are
talking about "Big Data" and just how exactly _none_ of those are the ones
with the necessary PhDs in statistics and algorithms to get anything of any
value done.

In my experience there are very few domains within Machine Learning where you
don't need to be an expert in the field to yield useful conclusions out of the
data.

Even if you have a high-level conceptual understanding of the statistical
methods, tuning the parameters to yield something relevant, or much more so,
adapting existing algorithms to meet your needs requires some pretty serious
dedication to the field.

~~~
blauwbilgorgel
> ... how exactly none of those are the ones with the necessary PhDs in
> statistics and algorithms to get anything of any value done.

I see it almost the other way around: Companies strictly demand PhD's for Big
Data jobs and can't find this unicorn. Yet we live in a time where we don't
need a PhD program to receive education from the likes of Ng, LeCun and
Langford. We live in a time where curiosity and dedication can net you
valuable results. Where CUDA-hackers can beat university teams. The entire
field of big data visualization requires innate aptitude and creativity, not
so much an expensive PhD program. I suspect Paul Graham, when solving his spam
problem with ML, benefited more from his philosophy education than his
computer science education.

Of course, having a PhD. still shows dedication and talent. But it is no
guarantee for practical ML skills, it can even hamper research and results,
when too much power is given to theory and reputation is at stake.

In my experience Machine Learning was locked up in academics, and even in
academics it was subdivided. The idea that "you need to be an ML expert,
before you can run an algo" is detrimental to the field, not helping so much
in adopting a wider industry use of ML. Those ML experts set the academic
benchmarks that amateurs were able to beat by trying out Random Forests and
Gradient Boosting.

I predict that ML will become part of the IT-stack, as much as databases have.
Nowadays, you do not need to be a certified DBA to set up a database. It is
helpful and in some cases heavily advisable, but databases now see a much
wider adoption by laypeople. This is starting to happen in ML. I think more
hobbyists are right now toying with convolutional neural networks, than there
are serious researchers in this area. These hobbyists can surely find and
contribute valuable practical insights.

Tuning parameters is basically a gridsearch. You can bruteforce this. In goes
some ranges of parameters, out come the best params found. Fairly easy to
explain to a programmer.

Adapting existing algorithms is ML researcher territory. That is a few miles
above the business people extracting valuable/actionable insight from (big or
small or tedious) data. Also there is a wide range of big data engineers
making it physically possible to have the "necessary" PhD's extract value from
Big Data.

~~~
yid
While there's some truth in what you're saying, you sort of demonstrate a very
common pitfall:

> Tuning parameters is basically a gridsearch. You can bruteforce this. In
> goes some ranges of parameters, out come the best params found.

This sounds so simple. However, if you just do a bruteforce grid search and
call it a day, you're most likely going to overfit your model to the data.
This is what I've seen happen when amateurs (for lack of a better word) build
ML systems:

(1) You'll get tremendously good accuracies on your training dataset with grid
search (2) Business decisions will be made based on the high accuracy numbers
you're seeing (90%? wow! we've got a helluva product here!) (3) The model will
be deployed to production. (4) Accuracies will be much lower, perhaps 5-10%
lower if you're lucky, perhaps a lot more. (5) Scramble to explain low
accuracies, various heuristics put in place, ad-hoc data transforms, retrain
models on new data -- all essentially groping in the dark, because now there's
a fire and you can't afford the time to learn about model regularization and
cross-validation techniques.

And eventually you'll have a patchwork of spaghetti that is perhaps ML,
perhaps just heuristics mashed together. So while there's value in being
practical, when ML becomes a commodity enough to be in an IT stack, it is
likely no longer considered ML.

------
diydsp
I was grateful and surprised to see the article start off immediately with a
meta-remark on the collusion between pop science media and academics. It
recalled one my frustrations during grad school in the late 2000s: student
researchers striving for recognition and journalists sexing up our stories
that misinformed the public.

This feedback loop explains a great chunk of why we on HN spend so much time
knit-picking through stories on e.g. Wired. What we read is not so much
"reporting," but designs-by-committee of researchers doing things they think
the public wants/needs and reporters bending stories toward what they think
the public wants and needs.

~~~
cdoxsey
All news is like that. When they cover stories we actually know something
about we see that its all misinformed BS, but then for some strange reason, on
other issues, we're perfectly happy to have every thing voxplained to us. (or
the NY times is gospel if that floats your boat) Michael Chrichton:

“Briefly stated, the Gell-Mann Amnesia effect is as follows. You open the
newspaper to an article on some subject you know well. In Murray's case,
physics. In mine, show business. You read the article and see the journalist
has absolutely no understanding of either the facts or the issues. Often, the
article is so wrong it actually presents the story backward—reversing cause
and effect. I call these the "wet streets cause rain" stories. Paper's full of
them. In any case, you read with exasperation or amusement the multiple errors
in a story, and then turn the page to national or international affairs, and
read as if the rest of the newspaper was somehow more accurate about Palestine
than the baloney you just read. You turn the page, and forget what you know.”

------
7Figures2Commas
> When you have large amounts of data, your appetite for hypotheses tends to
> get even larger. And if it’s growing faster than the statistical strength of
> the data, then many of your inferences are likely to be false. They are
> likely to be white noise.

It's actually worse than that. What I see is that when companies have the
ability to store and "analyze" large amounts of data, their appetite for
_data_ tends to increase. So they seek to take in as much data as they can
find. More often than not, the quality of the data is mixed at best.
Frequently, it's horrible, and because the focus is on data acquisition and
not data quality, nobody notices the bad data, missing data and duplicate
data.

The result: even if you manage to come up with decent hypotheses, you can't
trust the data on which you test them.

~~~
laichzeit0
> When you have large amounts of data, your appetite for hypotheses tends to
> get even larger. And if it’s growing faster than the statistical strength of
> the data, then many of your inferences are likely to be false. They are
> likely to be white noise.

This is not necessarily a bad thing. Take the domain of application
performance management. You're collecting hundreds of thousands of metrics
from all over the place, OS, network, middleware, end user. Occasionally there
is a performance problem that is non-obvious. You go through the obvious
metrics and find nothing. It is a great thing at this point to just throw all
this data at some algorithm and let it come back to you with "metric X, Y, Z
looks related". This gives me some hypothesis I can go check that I would
probably never have thought of on my own. And I have a direct way of verifying
if it was a correct hypothesis: oh, it looks like there's 2 disks in this
cluster, 1 is running at 100% the other at 0% so the overall utilization only
shows 50%, I didn't think that was a problem. Investigate. Oh this disk has
compression enabled, the other doesn't, turn it off, the application runs fast
now.

------
shmageggy
Relevant but unmentioned in his list of accolades, Jordan is the 2015
Rumelhart prize winner, which is the equivalent of the Turing award for the
Cognitive Sciences.

[http://rumelhartprize.org/](http://rumelhartprize.org/)

------
jostmey
I am glad he pointed out that most artificial neural networks bear only a
superficial resemblance to our own biological ones. But I think he failed to
appreciate the power behind Boltzmann Machines - a type of neural network
designed to create a generative model of a dataset. Personally, I find the
resemblance between these neural networks and the real ones a little uncanny.
And very few people seem to realize that the formalism behind a Boltzmann
Machine can be adapted to fit the activation patterns of real neurons - you
just have to redefine the energy function to match that of real biological
neurons.

~~~
scottlocklin
I've only started looking at RBMs recently, but ... what are you talking
about? Biological neural networks use spikes. RBMs certainly don't. RBMs look
more like HMMs to me than like biological neurons. Don't take this as me
saying, "you're wrong" -I'm curious if there is another way to think about
RBMs (aka, "papers please" -so I have a deeper understanding when I do my own
implementation of RBMs).

~~~
jostmey
The energy function for a Boltzmann machine is usually E = v^T x W x h, where
v are the visible units, W contains the weights, and h represents the hidden
units. The form of the energy function defines the activation function of each
neuron and the learning rule that goes with it. Now, you can in theory start
off with any energy function you like (this form just happens to be the
simplest). You would then have to re-derive the activation function and
learning rule.

Just what would the energy function look like for real neurons? I don't know
but we do know that the activation function would have to "spike" in bursts.
So that is a clue. We also have rudimentary ideas about the learning rule used
in biological neural networks, so you would also want to take this into
account when determining the actual energy function. Finally, real neurons do
not send retrograde signals but are instead wired recurrently, which must also
be taken into consideration.

------
conistonwater
I learned about machine learning way after I learned mathematics, so it always
amused me that

back propagation = chain rule = forward differentiation = adjoint
differentiation

and that different disciplines have different words for what is just the chain
rule.

~~~
sherjilozair
None of the parties mentioned actually deny the above equivalence. The reason
backprop is a popular idea in deep learning is because people started
developing continuous models, where the output (and the error) was a
continuous and differentiable function of the input and the weights, which
allowed chain rule to be used to compute the gradients, which allowed one to
use gradient descent methods. This shift from discrete units to continuous
units was termed error backpropogation, and not just chain rule.

~~~
Houshalter
It's kind of unfortunate, as it's forced everything to be continuous. Which is
not very computationally efficient or easily interpreted by humans.

------
arca_vorago
I recently worked for a very cutting edge bioinformatics company, and I
particularly agree with his segment about data sizes growing.

What I would say though, is that I think it is less an issue of the
statistical strength of the data, and has more to do with the methods used to
turn data itself into the statistics. For example, I was working with what by
now (size projections are paramount in sysadmin planning for stuff like this)
should be close to a Petabytes worth of genetic data. The real issue we were
running into was that the traditional tools tend to fall apart on data of this
size.

What we ended up doing was writing a distribution protocol for a certain
application that worked well but wasn't very concurrent, and then every
machine on the network besides the storage/sequencers/backup would crunch the
data, helping even the big servers out. A big server would get 10-30 workers
and a workstation would get 1-4. We turned 2 day analysis into 4 hour
analysis.

And once we did the analysis, only one person, the company owner/genius, could
decipher it.

I have to say, as a sysadmin, it was probably one of the most challenging and
most educational positions I ever had. I actually enjoyed always being the
only person in the room without a Phd.

------
tempodox
Indeed, the big-data winter is just waiting to happen, after all the hot air
that has been produced (& continues to be). Anyway, it's very nice to see the
media hype put in perspective for a change.

I can see how some people might feel like being between a rock and a hard
place: The data firehoses are all in place, our key-value stores are getting
fuller by the hour, and we're supposed to sit and wait for decades before
we'll be able to make any sense of it? I wouldn't be surprised if some will
much rather play roulette today than make a sure bet in 10+ yrs.

------
mathgenius
These two comments seem to contradict each other:

"we have no idea how neurons are storing information, how they are computing,
what the rules are, what the algorithms are, what the representations are, and
the like."

"...you get an output from the end of the layers, and you propagate a signal
backwards through the layers to change all the parameters. It’s pretty clear
the brain doesn’t do something like that. "

So why can't the brain do some kind of backpropagation?

~~~
sherjilozair
This is a reply to multiple sibling comments. There is actually recent work
which shows that deep learning methods can also work WITHOUT any reverse
signals:
[http://s.yosinski.com/dan_cownden_presentation.pdf](http://s.yosinski.com/dan_cownden_presentation.pdf)

~~~
jimduk
Interesting paper. Any more details on the architecture of the feedback
connections. Also I can't tell from the paper where and how weights are being
updated e..g what does "train" mean in this context

~~~
asdavis
I believe instead of multiplying the delta by W^t to backpropagate the error
from layer l to l-1, you multiply it by a random projection B. It's hard to
dig deeper because there doesn't appear to be any other information on it
except here: [http://isis-innovation.com/licence-details/accelerating-
mach...](http://isis-innovation.com/licence-details/accelerating-machine-
learning/)

As an aside, are they _really_ trying to patent a slight twist on
backpropagation? That seems pretty counter-productive to me.

------
tormeh
>In the brain, we have precious little idea how learning is actually taking
place.

It's Hebbian learning. When a post-synaptic neuron fires shortly after a pre-
synaptic one fires, the synapse in question is strengthened (the surface area
actually becomes larger). I hope he's talking about higher level concepts of
learning, because otherwise he's wrong.

~~~
sushirain
Hebbian learning is the little idea we do have. We don't know much more: how
memories are represented by neurons, control, consciousness, vision, and
almost anything.

------
hooande
_A lot of people are building things [with big data] hoping that they work,
and sometimes they will ... Eventually, we have to give real guarantees. Civil
engineers eventually learned to build bridges that were guaranteed to stand
up. So with big data, it will take decades, I suspect, to get a real
engineering approach, so that you can say with some assurance that you are
giving out reasonable answers and are quantifying the likelihood of errors._

It's seems like the idea is that machine learning and data driven inference
have to grow up and become a real scientific discipline. "Why can't you be
more like Civil Engineering?" This isn't the best way to look at it. Machine
learning is designed for situations where data is limited and there are no
guarantees. Take Amazon's recommendation engine for example. It's not possible
to peer into someone's mind and come up with a mathematical proof that states
whether they will like or dislike John Grisham novels. A data driven model can
use inference to make predictions based on the person's rating history,
demographic profile, etc. It's true that many machine learning approaches
don't have the scientific heft of civil engineering, but they are still very
useful in many situations.

I'm not disagreeing with the eminence of Michael I. Jordan. I think this is a
philosophical question with no correct answer. Is the world _deterministic_ ,
can we model everything with rigorous physics style equations? Or is it
_probabilistic_ , are we always making inferences based on a limited amount of
data? Both of those views are valid, especially in different contexts. Some of
the most interesting problems are inherently probabilistic, such as predicting
the weather, economic trends and the behavior of our own bodies. "Big Data" is
obviously a stupid buzzword, but the concept of data driven decision making is
very sound. We should put less focus on media hype terms and continue to
encourage people to make use of large amounts of information. Get rid of the
bathwater, keep the baby.

~~~
conistonwater
You misunderstood what he is saying.

> Similarly here, if people use data and inferences they can make with the
> data without any concern about error bars, about heterogeneity, about noisy
> data, about the sampling pattern, about all the kinds of things that you
> have to be serious about if you’re an engineer and a statistician—then you
> will make lots of predictions, and there’s a good chance that you will
> occasionally solve some real interesting problems. But you will occasionally
> have some disastrously bad decisions. And you won’t know the difference a
> priori. You will just produce these outputs and hope for the best.

He is not saying anything about the relative heft of machine learning and
civil engineering. He is saying that if you don't worry about whether your
predictions coming from big data are accurate, and whether you know a priori
that they are accurate, you will still make predictions, but some of them will
be wrong, and you don't know which ones. The analogy with engineering is only
incidental to his point, which is mainly about overfitting.

You can point out afterwards that a certain prediction made using big data was
correct in hindsight by collecting data after the prediction was used to make
some decisions, like Amazon might. But you would _really_ like to know whether
a decision is likely to be a good one before you make it. And he, as a
scientist, is interested in knowing for sure whether his results are correct.

------
elpachuco
>>Another example of a good language problem is question answering, like
“What’s the second-biggest city in California that is not near a river?” If I
typed that sentence into Google currently, I’m not likely to get a useful
response.

So I typed that in google just to see and indeed I got nothing. I guess their
[1]knowledge graph still has a long way to go.

[1[http://www.google.com/insidesearch/features/search/knowledge...](http://www.google.com/insidesearch/features/search/knowledge.html)]

~~~
nl
Wolfram can handle it:
[http://www.wolframalpha.com/input/?i=2nd+biggest+city+in+Cal...](http://www.wolframalpha.com/input/?i=2nd+biggest+city+in+California)

~~~
dubfan
That misses the key qualifier: "near a river". The challenge there is what is
"near", and what is a "river"?

~~~
nl
You are right of course.

I was playing with it, had to go to a meeting and forgot I'd modified the
question.

------
whoisthemachine
This was an incredibly rational interview.

~~~
ExpiredLink
So their brains worked well?

------
bonchibuji
I feel like I have seen the comment on singularity somewhere else (by MJ
himself). Does anyone remember?

------
fiddlediddlefoo
His comments are way off the mark. The recent advances in neural network
training are not strictly due to convolutional neural networks, but rather the
discovery that gradient descent works remarkably well on training multilayer
neural networks when using modern hardware. All of the best performing pattern
recognition techniques in speech, image recognition, and natural language
processing now utilize "neural networks". A neural network is nothing more
than a poor name for a non-linear statistical model, and if you like one with
a hierarchical structure (which is made possible strictly due to the non-
linearity).

I don't think that anybody in the research community (except for maybe an
occasional crazy) believes that neural networks have any biological
significance beyond inspiration. NIPS (Neural Information Processing Systems)
has been a reputable venue for work in statistics for some years now with no
confusion over the idea that "Neural" does not mean a precise (or even
imprecise) imitation of biological neurons.

~~~
mborsuk
How quickly did you read this? He says very nearly what you are saying:

"Well, I want to be a little careful here. I think it’s important to
distinguish two areas where the word neural is currently being used.

One of them is in deep learning. And there, each “neuron” is really a cartoon.
It’s a linear-weighted sum that’s passed through a nonlinearity. Anyone in
electrical engineering would recognize those kinds of nonlinear systems.
Calling that a neuron is clearly, at best, a shorthand. It’s really a cartoon.
There is a procedure called logistic regression in statistics that dates from
the 1950s, which had nothing to do with neurons but which is exactly the same
little piece of architecture."

