
 To make smarter systems, it’s all about the data - prakash
http://www.cdixon.org/?p=340
======
ovi256
If you are interested in the topic, I highly recommend "The Unreasonable
Effectiveness of data" by Norvig et al. [1]

[1] [http://googleresearch.blogspot.com/2009/03/unreasonable-
effe...](http://googleresearch.blogspot.com/2009/03/unreasonable-
effectiveness-of-data.html)

------
byoung2
I think the next big advance in search AI won't happen until we figure out a
better way of organizing data.

I had an idea in 2000 for a new kind of GPS. The technology existed back then,
but until now the data is missing (we're gradually getting there). If every
brick and mortar store fed an XML feed of inventory and prices to a central
service (probably Google), you could tie that to location data and have a GPS
that let you search by product and price rather than store name or address.

Let's say you're looking for a charging cable for an iPod. Instead of
navigating to the nearest Fry's Electronics, you would type in "iPod charger"
and sort by price or distance. The results might surprise you...it turns out
TJ Maxx has them for $0.99.

Now, if we could just get the stores to give up that data!

~~~
queensnake
I'd think it'd be the losinger stores that do that at first (I'm thinking
Borders), since maybe that wouldn't be the first store you'd go to. If/when
that kicked up sales, other stores would have to follow. What a utopia that'd
be though. But, eg Borders can't tell you for sure that something _is_ in the
store, from their own system.

------
pasbesoin
aka The importance of context. (Not just in valuing ideas, but also in
generating them.)

As an aside, this is the second time this morning that this concept has come
up, in my persona communications. (Dare I note some context in this, itself?)

------
lucifer
I found the article self-contradictory.

In case of Google, he actually makes a pretty strong case for the algorithm,
not "data". The data (the links) were always there. It was precisely the
algorithm that _generated information_ from that data.

And a Bayesian Net itself is clearly the by product of applying an algorithm
to a data space.

The NetFlix case is anecdotal, but consider the following (equally anecdotal)
counter example: The (massive) increase in the available data to humans since
the advent of the networks have not contributed to any significant increase in
the general intelligence of the population.

Unless by AI he is referring to the highly narrow case of machine creativity
given little to no input (of data), then (obviously) algorithms do require
data sources.

~~~
physcab
I think you're right that the post is slightly contradictory, but his premise
I believe is correct. In all that I have studied on machine learning in both
academia and start-up land, I have observed that you consistently pick
algorithms to glean out the information you need from a particular dataset.

In many cases you can do what my graduate advisor recommends "keep it simple
stupid" meaning that perhaps all that is needed is a k-nn approach and
euclidean distance. But sometimes the data is highly overlapped and complex,
so you have to go with a more rigorous means of classification or whatnot.

Finally it should be noted that machine learning techniques are relatively
new. Neural networks have been around for quite some time and have well
documented advantages. ML by contrast is still a lot of black magic (tweaking
of various magic parameters and such) so the benefits of various algorithms
are somewhat subjective.

