

Six Months of HackerNews Front Page Data  - matt1
http://www.mattmazur.com/2010/03/six-months-of-hackernews-front-page-data/

======
eob
I hacked together a quick script to generate new HN story titles from your
list of existing ones. It is a pretty shoddy job -- no syntax-level modeling
-- but some of the ones it generated are still pretty amusing:

mark cuban: how to say facebook

how ravelry scales to remove ipad stories from wind

google patents its way to copenhagen

bert and rails ecosystem white paper

how phusion built a blog posting

2010 conference makes the last days of android phones

is like sex. it's better when harvard teaches networking

thunderbird and one hell of the illusion of music

customer development and lying with nginx

toddlers develop individualized rules for scalewell startup fund

bill gates sums up massive data failure leads to control robots

the fuel for running our financial system

i have become a programming language

the future of instant approval

scheme that 'cancer-proofs' rodent's cells

the design and getting your business

h.264 to reach 1 billion rows into the expression problem

the bible that runs on your vc "closing" fees

ask pg: quick tips on different sql implementations

the insanely great in the free version of iphone

scalable apps on vetting opportunities

mona lisa's smile a frozen sculpture of programming

coelacanth: lessons from moleskine to rule your code

results with people: do what would never launch

~~~
eob
I put the python scripts here if anyone wants to play with them:

<http://people.csail.mit.edu/eob/files/hn/>

The code wasn't written to be anything more than a quick toy.. so don't zing
me for its poor quality :)

------
icefox
Based upon that data it looks like the best time to submit the data is between
12 and 16 UTC.

Edit: This is with thirty seconds of tossing it through awk. It is pretty well
distributed so maybe it is insignificant. I only counted articles that reached
1st place, you should parse it yourself rather than take my word for it of
course. And a graph would be nice.

~~~
joshstaiger
I happened to be playing with R today, so I took a stab at making a chart:

<http://tinyurl.com/hnrank>

~~~
aneesh
Rule of thumb for when to submit seems to be: whenever PST people are awake,
and not eating meals.

------
shmichael
It has been just today that I discussed the prospects of analyzing HN front
page posts with a friend.

Promise to come up with interesting results. Thank you.

~~~
shmichael
and my friend didn't even wait up for me.

<http://news.ycombinator.com/item?id=1175223>

------
wvl
Thanks for the dataset. FWIW: %s/&quot;//g takes it from 170M to 100M

Of course, compression negates the saving, but it still seemed odd.

~~~
matt1
Good point. I just went with the default export settings--if I do it again in
the future, I'll definitely do it this way.

------
sandaru1
The current data format is harder to read using python csv module. This code
will convert it to python compatible csv : <http://gist.github.com/325195>

It's bit slow(~15 seconds), but it's a one time job.

------
revorad
Thanks a lot Matt. That should be one heck of a dataset to play with.

------
paraschopra
Thank you so much. This is all I needed to make my HN points predictor for
newly submitted stories.

Now if I only find a nice chunk of time on a lazy weekend..

------
aditya
Does anyone have a full dump of HN posts and comments?

~~~
silentbicycle
Some were posted roughly a year ago, but they're no longer up. I might have
them somewhere, give me some time to dig.

------
marcamillion
Hrmm....so now we will see if your tool explodes when a URL to the site
reaches the front page.

Kinda like when you google google.

------
petewailes
Doing statistical coolness now. Will post results later. Stay tuned...

