

Stack Overflow Creative Commons Data Dump - nswanberg
http://blog.stackoverflow.com/2009/06/stack-overflow-creative-commons-data-dump/

======
nwjsmith
Jeff Atwood gets a ton of flack on here for his blog content, but what he,
Joel, and the rest of the Stack Overflow team are creating an incredibly
valuable resource (especially for the type that frequent HN). Props.

~~~
biohacker42
As someone who gives Jeff a lot of flack, I whole heatedly agree with
nwjsmith, StackOverflow is great stuff.

------
robryan
Would be interesting to see if someone could use this data to incorporate into
an intelligent Google squared/ wolfram alpha type search engine for
programming questions.

------
caustic
I would like to see Hacker News data dump. Would it be possible?

------
petercooper
If you download this and want to unpack this bizarre archive format on OS X:
_port install p7zip_ and then _7za e [filename].zip_

It only took a few minutes to figure this out but if you're as confused by
this quirky archive format as I was, there you go :) Don't bother trying to
unpack the ZIP file in the normal OS X way as it'll just keep unpacking over
and over and not give you anything useful.

~~~
Zev
I just used EZ 7z (<http://www.macupdate.com/info.php/id/19139>) to do it for
me. No need to tarnish my system with MacPorts.

------
zacharypinter
Very cool, though I hope this doesn't get abused by google spammers.

~~~
antirez
that's impossible. It's free data, and Google pay you for this unfortunately:
you copy this data into spamxyz.org and Google will index you, send you a lot
of visitors, and provide adsense in order to monetize. There is something
wrong I guess but this is how it works.

~~~
easyfrag
True enough, but the same could be said for Wikipedia (does Wikipedia offer a
dump of data like this?), the catch for spamxyz.org is that stackoverflow's
answer should show up earlier in the results list due to pagerank.

If spamxyz.org violated the cc terms would that be enough to complain to
Google?

~~~
roam
Why would you complain with Google? It's doing what it's supposed to do: index
the web. Should we expect Google to know where the data was originally
published? I do, because it's an important metric in determining PageRank
besides the number of inbound links. But that doesn't mean Google shouldn't
index it. It might simply be intended as a mirror of the first site.

In a recent court case in the Netherlands, some company A filed a complaint
against a website because it ran a story about another company B that went
bankrupt and mentioned the plaintiff in an unrelated story on the same page.
Searching for the plaintiff's company name and "bankruptcy", Google would show
you a summary that looked like company A had gone under. Does that make the
website responsible? Is it Google's fault? The judge decided the website
should take responsibility and fix it. I think that you're wasting your time
when you're searching Google to find out whether your company has gone
bankrupt.

In the case of spamxyz.org: report it as spam. In the case of the Dutch court
case: use your common sense. In the case of newspapers crying about summaries
in the search results: use robots.txt. But people should stop pointing at
Google to fix all the problems on the internet. They could do a lot better,
but there are plenty of scenarios in which you don't want Google to do decide
on their own whether they should show a site in their search results or not.

PS: regarding Wikipedia data:
<http://en.wikipedia.org/wiki/Wikipedia_database>

------
jcdreads
I am cheered by the news that they tested for the presence of an AOL-style
incident.

