

Y Combinator Dataset Of Posts - xirium

If anyone is considering making a YCombinator site search or wants to perform an analysis of historical posts then we've made a dataset available at http://www.xirium.com/ycombinator-news20080424.tar.gz<p>The dataset is 100MB, so only download it if you need it. This dataset may be removed in the next week or so.
======
jrnewton
> so only download it if you need it > This dataset may be removed in the next
> week or so

the latter cancels the former.

------
mattjaynes
I've set up a mirror, should be quite a bit faster ;)

<http://weblava.net/ycombinator-news20080424.tar.gz>

~~~
xirium
Thank you. You may also want to mirror user data (
<http://news.ycombinator.com/item?id=173045> ).

~~~
mattjaynes
Sure ;)

<http://weblava.net/ycombinator-news-profile20080424.tar.gz>

~~~
palish
Thank you.

For what it's worth, here are additional mirrors.

Posts:
[http://dl.getdropbox.com/u/315/programming/datasets/ycombina...](http://dl.getdropbox.com/u/315/programming/datasets/ycombinator-
news20080424.tar.gz)

Profiles:
[http://dl.getdropbox.com/u/315/programming/datasets/ycombina...](http://dl.getdropbox.com/u/315/programming/datasets/ycombinator-
news-profile20080424.tar.gz)

~~~
xirium
Thank you.

You also want to mirror the update utility (
<http://news.ycombinator.com/item?id=173354> ).

~~~
palish
You got it:
[http://dl.getdropbox.com/u/315/programming/datasets/ycombina...](http://dl.getdropbox.com/u/315/programming/datasets/ycombinator-
news-update20080424.tar.gz)

------
sadiq
I'm trying to pull it down to one of our university boxes so we can mirror it.
It's going a little slow at the moment though (eta 10 hours).

I'll update with the link as soon as it's done.

~~~
ivank
I canceled my download, so hopefully part of 2.5KB/s will trickle down to your
connection.

~~~
groovyone
Me too :)

~~~
ra
I've cancelled mine too. Was getting 1.5 kb/s

------
mariorz
cool! why not set up a torrent and seed?

~~~
xirium
Firstly, that would require installing it :). Secondly, this is being served
by a very old server which, from empirical testing, wouldn't cope with this
type of protocol. Thirdly, I have to keep borrowed bandwidth to a manageable
level during business hours, which ideally means making it tend to zero in the
long term. I was anticipating a mirror to continue serving this data but it
would have been rude to ask.

Anyhow, I was expecting at most 20 concurrent connections before traffic
decayed. I wasn't expecting concurrent connections from 56 unique IP
addresses. The server is in London and it is transferring 1.7MB/s, mostly to
European users. However, latency and routing to US clients seems to
drastically reduce throughput to those users.

Most of the HTTP 206 [Partial] requests seem to be from Internet Explorer,
despite this being a relatively obscure choice on this forum. I can only
suppose that IE is quite inclined to re-establish connections after a
connection briefly stalls. The latter would be because the httpd state creeped
39MB into virtual memory on this 64MB RAM server. This would also be why TCP
window scaling didn't occur.

Anyhow, I knew it was risky to use this server to serve relatively large files
but it will be used in the future for posting smaller tidbits.

~~~
attack
I used to run 50 bittorrent downloads simultaneously off of my kuro just fine.
A wrist watch could handle it.

------
programnature
This is awesome, thanks.

One suggestion: it would be even more useful (for my purposes at least) if you
had another version that only included the full posts, rather than having the
full posts in addition to having separate files for each comment subthread.
The way it is now, there is a lot of data duplication, since a comment of
depth n will appear in n separate files.

------
mmcgrana
It would be neat if there were comparable datasets available for other sites.
For example, I'd be excited about getting my hands on a dataset describing
Twitter's user/following graph.

------
gaika
Can I import the posts into jaanix so it is available for searching / tagging
/ saving / editing ?

~~~
kirubakaran
Why not?

