Hacker News new | past | comments | ask | show | jobs | submit login
Stack Overflow Creative Commons Data Dump (stackoverflow.com)
64 points by nswanberg on June 4, 2009 | hide | past | favorite | 14 comments



Jeff Atwood gets a ton of flack on here for his blog content, but what he, Joel, and the rest of the Stack Overflow team are creating an incredibly valuable resource (especially for the type that frequent HN). Props.


As someone who gives Jeff a lot of flack, I whole heatedly agree with nwjsmith, StackOverflow is great stuff.


He blogged yesterday on a related topic, by the way: http://www.codinghorror.com/blog/archives/001272.html


Would be interesting to see if someone could use this data to incorporate into an intelligent Google squared/ wolfram alpha type search engine for programming questions.


I would like to see Hacker News data dump. Would it be possible?


If you download this and want to unpack this bizarre archive format on OS X: port install p7zip and then 7za e [filename].zip

It only took a few minutes to figure this out but if you're as confused by this quirky archive format as I was, there you go :) Don't bother trying to unpack the ZIP file in the normal OS X way as it'll just keep unpacking over and over and not give you anything useful.


I just used EZ 7z (http://www.macupdate.com/info.php/id/19139) to do it for me. No need to tarnish my system with MacPorts.


Try the Unarchiver: http://wakaba.c3.cx/s/apps/unarchiver.html

It's one of the first things I install on any OS X system. Great unarchiver that takes care of most (all?) the common archive formats, although I run into the occasional password issue. It beats StuffIt without a doubt.


Very cool, though I hope this doesn't get abused by google spammers.


that's impossible. It's free data, and Google pay you for this unfortunately: you copy this data into spamxyz.org and Google will index you, send you a lot of visitors, and provide adsense in order to monetize. There is something wrong I guess but this is how it works.


True enough, but the same could be said for Wikipedia (does Wikipedia offer a dump of data like this?), the catch for spamxyz.org is that stackoverflow's answer should show up earlier in the results list due to pagerank.

If spamxyz.org violated the cc terms would that be enough to complain to Google?


Why would you complain with Google? It's doing what it's supposed to do: index the web. Should we expect Google to know where the data was originally published? I do, because it's an important metric in determining PageRank besides the number of inbound links. But that doesn't mean Google shouldn't index it. It might simply be intended as a mirror of the first site.

In a recent court case in the Netherlands, some company A filed a complaint against a website because it ran a story about another company B that went bankrupt and mentioned the plaintiff in an unrelated story on the same page. Searching for the plaintiff's company name and "bankruptcy", Google would show you a summary that looked like company A had gone under. Does that make the website responsible? Is it Google's fault? The judge decided the website should take responsibility and fix it. I think that you're wasting your time when you're searching Google to find out whether your company has gone bankrupt.

In the case of spamxyz.org: report it as spam. In the case of the Dutch court case: use your common sense. In the case of newspapers crying about summaries in the search results: use robots.txt. But people should stop pointing at Google to fix all the problems on the internet. They could do a lot better, but there are plenty of scenarios in which you don't want Google to do decide on their own whether they should show a site in their search results or not.

PS: regarding Wikipedia data: http://en.wikipedia.org/wiki/Wikipedia_database



I am cheered by the news that they tested for the presence of an AOL-style incident.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: