

Ask HN: HN data access - bashgrep

How can I get access to all of the submissions on news.ycombinator.com? I don't want the comments, just the posts.
It seems like after every HNS outage settings get added to make it more difficult to access the content on HNS. For example, it seems like you have to sign in now to access older posts. Also, the thrift database has only about 4million records but the hids on HNS are in the 5millions.
======
bashgrep
Seems like HNS is rate limiting connection speed for connections from amazon
ec2 machines.

HNS/pg, do you not want us scraping HNS? What is the best way to get all the
posts?

------
unimpressive
I think that there are some mirrors floating around. But I wouldn't know where
those mirrors are, or what kind of access they allow.

~~~
bashgrep
Where did you read about them?

~~~
unimpressive
On the original millionshort thread, where somebody pointed out the number of
HN mirrors and linked a few.

~~~
bashgrep
These all lead to dead ends: <https://news.ycombinator.com/item?id=1721105>
<https://news.ycombinator.com/item?id=1881262>

Do you have a link for the post you are talking about?

EDIT: In this thread pg says wait 30 seconds between each request
(<https://news.ycombinator.com/item?id=1702488>), but that doesn't work
either.

EDIT: "unimpressive" Was this is what you were referencing?:
[https://docs.google.com/spreadsheet/ccc?key=0AqL8kR005z0QdEN...](https://docs.google.com/spreadsheet/ccc?key=0AqL8kR005z0QdENvNUJJTjYxY2lVa0RqUzJhTHFqT0E&authkey=CIeUndcL&hl=en&authkey=CIeUndcL#gid=0)

~~~
unimpressive
<https://news.ycombinator.com/item?id=3911687>

It appears to be dead.

EDIT: Another one that appears to be among the living.

<http://hackerbra.in/>

------
Mz
I am not sure, but I think the IDs count posts and comments. So I think that
means about 4 million submissions and over a million comments. Not all
submissions draw comments.

Can anyone verify this?

~~~
pasbesoin
Look at the URLs for a submission (found via its "discuss" link on the post
listing page) and for a non-post comment. They are identical in format; only
the ID value varies.

For this to work, they must share the same set of ID values.

