

Ask HN: Access to the Corpus - What Would You Do? - BrandonWatson

What if you had programmatic access to the entirety of the dataset of Google, Bing or Yahoo; every page they have crawled (including all meta data), all the searches performed and ads for display.  If you had programmatic access to that dataset, what business would you create?<p>One of my friends posed this question to me on Friday and my brain seized up.  Creating another search engine made no sense, but the massive size of the data set and potential possibilities actually made my brain shut down.<p>What would HN folks do?
======
caffeine
So .. Google _have_ these data. Their hacking chops are not unremarkable. What
have they done with it? Well, basically, you can ask for a word and they'll
find other pages that feature that word ...

And that's it.

They have access to "the world's combined knowledge" and a zillion PhDs, and
that's _all_ they can do?! It's shocking. But it's not, really, because the
data are basically useless without annotations.

So let's go to Disneyland and pretend we have a genius NLP engine or an
annotated web. Then,

1) 360 on a company/product. In particular: who are _all_ the stakeholders,
and how do they feel? I'd sell this to e.g. analysts. Same thing on people's
online identities.

2) Memetracing. I'd sell this to advertisers. (So, we follow historical
product releases and see exactly what memes spread about them and how, and
through who. Related to (3) on ismarc's post)

3) Rumors (this would require your feed to be real-time). I'd also probably
sell this to stock traders (the idea here is to monitor e.g. forums frequented
by GE employees to guess scoop)

OK, so those are basic. More interesting:

4) Organization-tracing. If you can label social graph edges with influence
levels / information intakes, you can start playing with predicting
organizational decision-making (behavioral economics / game theory ..There's a
TED talk about this).

5) Games: procedural content generation that looks really real, i.e. worlds
full of people whose identities are plausible, whose interactions with others
are plausible, etc.

Those all require some analytical / NLP firepower on a scale which I don't
think is really doable at the moment. The problem is that bags of words are
meaningless without a social context - the data are pretty worthless unless
your computer can figure out who it's important for and why.

~~~
timcederman
If you really think that's all they're doing then you have very little
understanding in just how subtly awesome some of the stuff they're doing is.

Peter Norvig does a better job explaining some of it than I ever could.
<http://www.youtube.com/watch?v=nU8DcBF-qo4>

~~~
caffeine
Peter Norvig says exactly what I said in that video.

He says it when he says "... but the computer doesn't understand physics." Why
not? The internet knows a hell of a lot more physics than I do.

The statistical stuff is cool, but all you can do with it is fancy accounting.
You can never really pull out _meaning_ .. and doing real analogical /
inductive reasoning on statistical data sets is _hard._

You'll notice that suggestion (4) can only be considered doable because the
internet has easily accessible APIs to determine who is real and what
organizations they belong to.

While suggestion (5) is not feasible at all at the moment.

------
ismarc
You have to consider that all that data isn't just a list of web pages. That
includes information provided by the web page and the contents of the page
itself (as well as associated metadata).

1) Create a map of the web (what links where and how), enabling an enhanced
"browsing" experience (no more perusing a site for links to other interesting
places)

2) The contents of those pages contain a large volume of technical
documentation as well as a large number of opinions of the technologies that
rose from the documentation. With both, and a large enough history of the
creation of pages, the release of the documentation and the opinions
presented, a model can be built to predict the success rate of any particular
technology.

3) Given sufficient time, there is a large enough set of text that can be
rendered to audio through speech synthesis (higher quality the better) in
order to train a speech recognition system to a previously unseen level of
accuracy.

4) Given a proper algorithm, sites with security vulnerabilities can be
discovered just from what is crawlable and most likely should not be accessed.
From that list, and given the total number of unique websites contained in the
database, you can calculate the ratio of harmful to potentially non-harmful
websites and provide a risk threshold of any given link on any given page.

5) Provide a search engine that uses regular expressions on different sets of
the data (metadata, tags, text, text in a specific tag, etc.) to present a
search engine with a previously unseen level of accuracy (accuracy of the
results is dependent on the individual doing the searching, not the systems
ability to guess that by std::list, I didn't mean "Sexually Transmitted
Diseases STD List" (seriously,
[http://www.google.com/search?sourceid=chrome&ie=UTF-8...](http://www.google.com/search?sourceid=chrome&ie=UTF-8&q=std::list)
has that as the second result).

Edit: I failed at formatting

~~~
BrandonWatson
(5) is an interesting idea. I'm stuck on what else in the meta stream you
would want to search, but certainly giving the flexibility to search the tags
as well could open up some interesting experiences.

~~~
ismarc
Search based on content type, content language, with/without keywords, author.
Honestly, if Google just put out #5 and charged for it, it'd be worth $10 a
month to me.

------
tel
SEO? With something like that you could programattically find semantic
locations with lower coverage and then sell the knowledge that you could
possibly attack that keyword. For instance, I'd love to see a chart which
plotted frequency of use in an English corpus against some hypothetical
Google-coverage variable. Anything that's a strong outlier from something like
an exponential curve could be a good target.

~~~
BrandonWatson
I am not sure I understand in depth enough about SEO to fully understand this
idea, but it sounds good in concept.

------
notaddicted
I'd like to see which search results were clicked on for a given search, and
then associate those sites. And then do an amazon style: people who landed at
this site also went to _____. As well as other relations based on user
browsing, not based on site links.

~~~
nico
I really like this idea, though probably the other sites would be the same
shown in the search results. Usually when people are looking for something,
they just orderly open the search results.

------
bowman
I would use it to fight crime/corruption. It isn't too hard to identify people
from a few google searches. Examples:

Food places near X will give you a good indication where they live and their
income Find what they are searching during normal work hours. Often this will
give you where they work. Searches of themselves.

Then just use this information to hunt them down or expose them.

~~~
nico
How would you know who's a criminal?

~~~
blasdel
From their searches of course!

It comes up in criminal cases all the time -- when you're googling "age of
consent laws" or "how to dispose of a dead body" it's pretty easy to establish
premeditation, though that doesn't help with 'pre-crime'.

The GP may have been focused on finding known fugitives on the run.

------
jhancock
Create new tools similar to how Google, Yahoo and MS already do when
leveraging this asset. I know this is somewhat a BS answer, but really what
other answer is there? This is like asking the question "What new features
should Google, Yahoo or MS build on top of their search?"

~~~
BrandonWatson
My initial thought was that maybe you could create an information business
along the lines of ComScore. That seemed a bit pedestrian.

Overnight, I began to wonder if you could build a tool much like Farecaster,
where you could help ad buyers understand the movement of pricing for
keywords, and make predictions about when they will move up and down.

------
andyleclair
Download all of the porn on the internet. I'm only half joking.

------
yannis
I would keep the meta data and reverse engineer the algo!

~~~
BrandonWatson
Good luck with that. :) Solving the user acquisition problem may prove
somewhat challenging, no?

------
snitko
Sell it to someone, who'd know what to do with it?

~~~
bluedanieru
You know that bit in Office Space where they're asking each other what they'd
do if they had a million dollars and one of the guys replies that he'd invest
half in mutual funds and take the other half to his friend who works in
securities?

You're that guy now :-)

~~~
snitko
Yeah, you got me, ok. But somebody would say this eventually, I saved this
somebody's ass, didn't I?

------
Diakronik
I wouldn't create a business. I'd write a program that could learn (text-
based) language. Then I'd write it up and submit it to Computational
Linguistics. Then I'd die in obscurity as someone took my idea and figured out
how to monetize it.

------
derefr
Start the world's most underhanded (and successful) SEO firm.

------
trevelyan
Better machine translation.

------
CamperBob
This scenario is reminiscent of what happened a few years back when AOL
released their whole corpus of search queries with associated user-ID hashes.
Most of what was done with that data set amounted to amateur CSI work and
general meanness. I don't recall any profound revelations or can't-miss
business opportunities coming out of it. The value of raw data is overrated.

~~~
BrandonWatson
AOL released the list of searches created by users, purportedly anonymized.
You didn't have access to the index of their crawler. Also, their subset was
really small. I'm talking about a continuously updated stream.

I have started learning about complex event processing (CEP) and am wondering
if you could do anything with that as a model tied to suggesting related
links/sites.

