
The Barbell Effect of Machine Learning - wallflower
https://medium.com/@nickbeim/the-barbell-effect-of-machine-learning-f840106200b9
======
ch4s3
>What allowed Google to rapidly take over the search market was not primarily
its PageRank algorithm or clean interface, but these factors in combination
with its early access to the data sets of AOL and Yahoo, which enabled it to
train PageRank on the best available data on the planet and become
substantially better at determining search relevance than any other product.

This seem like nonsense. I'm I wrong in thinking that when Google still used
PageRank that it wasn't based on ML?

~~~
tensor
PageRank still "learned" from the data, and it wouldn't be able to do that
without a ton of data. So while the data wasn't the most important thing to
get Google off the ground, it was arguably equally important as the algorithm.

Though as others have said, after the start gun went off, a lot of other
factors came into play because PageRank, like other algorithms, is easily
gamed on it's own.

~~~
IshKebab
PageRank doesn't learn from the data. It just calculates metrics _on_ the
data. It's totally different from Machine Learning.

~~~
mattkrause
It's definitely different from _supervised learning_. However, I think you
could make a case for being some sort of form of unsupervised learning and
it's not as though there's a codified list of what is "Machine Learning" and
what isn't.

~~~
stdbrouw
Well, in that case calculating the sample mean of a dataset is machine
learning too. So is sorting a list alphabetically. Regardless of whether it's
supervised or unsupervised, how can you speak of machine learning if the
metric is fixed before seeing any data? PageRank is human learning: people
look at a bunch of pages they want categorized, figure out an intuitively
appealing way to categorize them, and then encode those rules in an algorithm.

Edit: after thinking about this a bit more, I guess you could in fact think of
e.g. k-means clustering as just a very advanced form of descriptive
statistics, perhaps not fundamentally different from calculating a mean or a
kernel density estimate. And in that sense PageRank would be unsupervised
learning too, but it still feels to me like that's obscuring rather than
clarifying how it works?

------
visarga
I don't think data is as important as they make it to be. Yes, it was
important at some point when they scaled supervised learning from thousands to
millions or billions of examples. But after that, did they continue scaling up
to trillions of examples? No. Because there are diminishing returns. Maybe the
model can't handle that much data, or the accuracy doesn't increase as much,
or the data is just too big for the current computing power. The kind of data
that is used today even at Google scale is just millions or billions of
examples.

On the other hand, what kind of data could be so secret that only the big
companies could accumulate? If you have access to a service, you can mine it
and construct a dataset. Then you can train your own model to imitate their
results. If this process is done right, it becomes easy to do transfer
learning from other AI systems, even if they keep their algorithms secret.

For example, Google's latest POS tagger SyntaxNet, which made a splash a few
days ago, was trained on the results of the Stanford parser. Interestingly,
the student model surpassed the teacher model - probably because it was better
at cancelling out errors in the training set and generalizing better from
examples.

If data is the end game of these companies, then it will be a target of
espionage, leaks and public disclosures. A hard drive containing the prized
dataset of some company can be copied in a few minutes and if one copy
escapes, then it is circulated and becomes public domain. So it's hard to
protect data. Data likes to flow (remember the Ok Cupid dataset - that was
crawled without permission from the behind the login wall?). Flowing data can
be turned back into machine learning models.

~~~
adrianratnapala
I assume people claim IP protection of one kind or another on these datasets,
and even if their claims were spurious, they would still make trouble.

That's not going to stop some hacker could crawl OkCupid and balckmail some
people. But if Google or Microsoft suspect FooStartup Inc. is using one of
their datasets in its service, then the lawyers will descend.

------
meeper16
> Proprietary algorithms can help, but they are secondary in importance to the
> data sets themselves.

This is flat wrong. No algo, no intelligence or synthetic congnition

> The dramatic rise of Google provides a glimpse into what this kind of
> privileged access can enable. What allowed Google to rapidly take over the
> search market was not primarily its PageRank algorithm or clean interface,
> but these factors in combination with its early access to the data sets of
> AOL and Yahoo, which enabled it to train PageRank on the best available data
> on the planet and become substantially better at determining search
> relevance than any other product.

This is so wrong on so many fronts. A) They has open access to the web just
like DMOZ and Yahoo and many others via crawling systems. B) They were
attractive to software engineers who in turn made their IT depts in large
corps switch to Google as the default search engine C) They stole the ad
matching algo from Bill Gross which in turn made them successful. E) too many
other factors to list

Lets also not forget that PageRank was a variant of an early link analysis
algo that had been around a while.

~~~
izendejas
On the first statement, I think it's mostly right and he's merely just re-
stating what Peter Norvig has stated all along albeit not as precisely.

On the second statement, I believe you both missed the point. Google was able
to acquire click-through data which helped train it's rankin algorithms better
by establishing the partnerships with Yahoo and AOL. It was the implicit
feedback, not the crawled data that helped. But that obviously wasn't the only
factor--having very smart engineers/scientists matters. But I think now that a
lot more is open sourced, computing power is cheaper, etc the OPs thesis is
that data is now the hardest thing to come by, and on this I fully agree.

