
Beyond PageRank: Learning with Content and Networks - prakash
http://measuringmeasures.blogspot.com/2010/01/beyond-pagerank-learning-with-content.html
======
rjurney
It speaks well for Flightcaster that founder Bradford Cross can so effectively
communicate such complex concepts in such a simple way to a general audience.
Excellent post.

~~~
bradfordcross
That is very kind of you.

It would be great if more researchers did some pieces like this from time to
time.

Things are going very fast right now and it is creating a large gap in general
awareness as to what is happening.

This is true not only for the general public, but also for researchers with
different specializations. I think there is also room for a more technical
version of this kind of essay to serve like a cross-specialization review
paper.

------
jonmc12
"I want to switch gears slightly and talk about other ways we use humans in
modern research."

I thought the first argument, regarding the use of humans to conserve
resources associated with time, data and domain expertise, was right on and
very articulate.

I expected this argument to be extended in the '40 years' section of the
article. After all, isn't the shift to computer-driven learning simply a
threshold between te utility / cost of supervised/semi-supervised learning vs
unsupervised learning? In other words, couldn't you say that at some point the
'ways we use humans' becomes too expensive vs non-human methods? When this
threshold is crossed is when 'non-biological entities have more reasoning and
processing power than biological entities' Maybe just another way of saying
the same thing.

In any case, a great post. Worth reading closely in my opinion.

~~~
sdrinf
Um, no.

The trade-off isn't between utilizing humans VS utilizing computers. It's the
R&D cost of building simple, and proven systems, and paying peanuts into mturk
VS the R&D of complex, and sophisticated systems * probability of working.

In other words, our tightest bottleneck is not the cost of supervising, but
rather our understanding of the strategies, and abilities to tackle a given
problem domain using only the limited tools of a fancy abacus.

And since this is HN, I'd like to digress here, and propose the curse of
researching such strategies, and abilities, is that from an investor's POV,
the math simply doesn't add up. The R&D phase can go on practically as long as
the budget; probability of even incremental success is slim; and except for a
couple of well-defined domains, applicability is negligible. Investors have 2
AI-winters[1] to point, and laugh at, which I suspect is the main reason
behind ML/IR/NLP to re-market their terminology. This is a very deep problem I
would pay for humans to tackle.

To deconstruct the second part of your argument, the notion of "having more
reasoning and processing power than biological entities" is naive at best. In
the context of human brains neither "reasoning power", nor "processing power"
could be quantified in any way that would even resemble our quantification of
these properties in computers. It's like comparing the Na'vi[2] to Steve Jobs.

This naivety stems from the common misconception of presuming a
singular/general reasoning capability; instead of a fairly large amount of
highly specialized modules, each responsible to a single type of cognition,
and reasoning. An important point to note here, is that what our brain does VS
what we are building[3] differ significantly: the market is much more
interested in functional diversification, than exchanging pennies of
supervising for pennies of computing power.

In conclusion, the "reasoning, and processing power" of humans, and
ML/IR/NLP/etc systems are not only incomparable in terms of quantification,
but also in their evolutionary environment, and thus their functional goals.

[1] <http://en.wikipedia.org/wiki/Ai_winter>

[2] <http://en.wikipedia.org/wiki/Navi#Na.27vi>

[3] [http://stackoverflow.com/questions/1050696/the-business-
of-a...](http://stackoverflow.com/questions/1050696/the-business-of-
artificial-intelligence)

~~~
jonmc12
The article was addressing the validity of the viewpoint that 'non-biological
entities have more reasoning and processing power than biological entities...I
think this is all going to happen in the next 40 years...' I tend to agree
with you that there is no known quantitative measure to answer this question.
Instead (because there is no known quantification method), I proposed the
question - if a given learning problem can be solved most cost effectively
using only computers, has it satisfied this requirement (of a non-biological
entity achieving superior reasoning on a specific problem(s))?

You said 'no', due to the cost of R&D. However, computer-driver R&D,
particularly in the form of generative hypothesis testing may one day be able
to perform the job of 'understanding the strategies' more accurately and at a
lower cost than a human.

Lets say your cost of learning is a function of the resource limitations the
article laid out. ie, Cost of Learning = Cost of R&D + Cost of Data + Cost of
Domain expertise. What I am suggesting is that if this Cost of Learning is
lower for a computer alone than it is for a computer + human, than this is
perhaps an adequate test to say 'In this problem area, a non-biological entity
has achieved superior reasoning compared to a biological entity'.

btw, I tend to agree with your digression as well. This is a deep problem that
is not really investor friendly, but needs sponsorship one way or another to
advance the state of the art.

~~~
sdrinf
Picking single-aspect reasoning lightens the condition (and thus, the strength
of the argument) quite a bit. If you take a single learning problem domain,
there are a number of fully unsupervised ML/IR systems already out there with
superhuman reasoning capabilities.

The second part of your argument is a funky one. Fully autonomous R&D-capable
AIs (AKA self-improving that is, a problem-solving / reasoning system, that is
capable of creating new, and improving upon existing problem-solving /
reasoning systems -which, coincidently, might include itself) are the modern
equalent of the philosopher's stone, and has not only withstanded attacks in
ways that few other problem domains did, but seems to be an AI-hard[1] problem
by itself. As I laid down above, market forces does not favour these kind of
R&D; thus I predict this problem to be unsolved for the short-term future.

On the other paw, the second parameter of your cost function - "cost of data"
- can be exchanged by "amount of available data" -which seems to be in an
explosive boom from where I stand. Thus, while we're talking predictions, I
would hypothise a short-term future with a slow, and incremental improvements
upon tackling domains combined with an exponentially growing amount of
available data, making even relatively naive learning methods to have results
well within the optimal range.

It will still require humans. It will still require R&D. But the "cost of
learning" will be so much lower, that the process will be routinely employed
(successfully) by developers; thus diversifying the overall software market
even further.

I know it doesn't make for a fancy movie, but throwing away 6.5M years of R&D
by mother evolution doesn't seems to be a wise diversification move.

[1] <http://en.wikipedia.org/wiki/AI-complete>

------
ramanujan
bradfordcross: good overview, with one minor point. I wouldn't think of
Pandora's problem as one mitigated with semi-supervised learning. That's
usually applied to a situation where you have a small number of labeled points
and a whole mass of unlabeled data; often the task is then to determine low
density regions to define boundaries of natural clusters.

In Pandora's case, they have TONS of labeled data. All you'd need to do would
be to run a decision tree (or a categorical-variable version of PCA) to (1)
determine that many of those features are strongly statistically dependent and
(2) reduce the number that need to be populated for any given song.

You could probably also do supervised learning on their massive sound database
to infer lots of these features automatically (i.e. i bet you can pick out
male vs. female vocalists without having someone listen to it).

Combining these (supervised learning on historical data + decision tree on
historical data) would likely vastly increase their per-song labeling
throughput. Only "global" features like song genre would have to be input by
humans.

~~~
bradfordcross
This is the same point I am making. If they are still manually curating each
song at 30 mins each, they could just stop, use the labels they already have,
and infer the rest through semi-supervised learning, or learning the target
labels based on the destructured tracks.

~~~
yannis
It is a good approach and one I share. NLP has done it for years, once you
have a corpus which is tagged is so much more easier to then work on your
data. I also like your idea of using the Mechanical Turk to gain traction on
the manual tagging, in any way that is probably what would super intelligent
computers might do in the 40-year span - use humans to tag - before they
super-massively carry out with the balance of the calculations! :)

One area which the article did not touch is how to introduce controls to
identify 'rigging' of the system, ie, similarly to controlling link farms at
Google. This is where the problem in my opinion is turned the other way out.

------
lrm242
This is a great article. I posted on the same topic but different perspective
today as well: <http://fitnr.com/filtering-the-web-of-noise/>

~~~
bradfordcross
Thanks! Likewise - the filtering problem needs to be solved.

Funny that we are discussing in on HN where the paradigm is exactly the link
curation approach you are talking about in your post.

~~~
lrm242
Indeed. From the perspective of a learning algorithm, what would it think
about either of these links? They've both been shared on Twitter, both been
retweeted at least once on Twitter, and one has been curated here on Hacker
News.

Alas, ultimately actions taken by people can always been simulated by
spammers. We talked about this a bit on Fred's post: how do you know the act
of sharing is legitimate and not spam? Hacker News employs votes, but that
data isn't easily extracted by a learning algorithm in a general sense. I can
only assume that part of solving this problem in totality is coming up with a
way of determining the realness of the human curator. Otherwise, spammers will
attack as soon as it gains any sort of momentum.

------
felicisvc
Thanks for this great post, it really sets a great platform to explore AI and
its development in the future. So glad you highlighted that Google's
technology is so much more than just Pagerank which people seem to want to
oversimplify quite often

