
Ask HN: Does anybody have a list of HN story ids including 'dead' submissions? - jacquesm
I&#x27;m working on some hobby project and would like to get my hands on this data without bothering Dan (he&#x27;s got work enough) or re-crawling HN. The various APIs seem to be rather problematic for this (maybe I&#x27;m missing something?).
======
smt88
They appear to be sequential. Couldn't you start with the first post and
iterate through them, checking to see if the story is still available?

~~~
jacquesm
Yes, but this will cause a pretty hefty load (and huge runtime) for the # of
ids in the system.

There has to be a lighter weight way of doing this and I'm sure someone on HN
has already done this.

~~~
smt88
Well, ideally you'd be doing HEAD requests and HN is serving (almost)
everything from cache, so it should really be no more onerous to their system
than ~12M lookups in an in-memory index (and a few MB of transfer). As for
you... it could definitely be time-consuming if you did it serially. If you
did it in parallel, it wouldn't take long but would be "bad citizen" behavior
toward HN's servers. There might be a happy medium, but only HN knows what
that would be.

I think that if no one speaks up with a database, you should approach HN or
Algolia directly. If they wanted you to have this info, it would be trivial
for them to dump it (and less strenuous on their resources).

In fact, I'd suggest scraping Algolia (sorted by date[1]) instead of the
method I proposed earlier, since it would require far, far less overhead.

1\.
[https://hn.algolia.com/?query=&sort=byDate&prefix&page=0&dat...](https://hn.algolia.com/?query=&sort=byDate&prefix&page=0&dateRange=all&type=all)

~~~
jacquesm
Good points.

I've found something I can use to start with, older dumps of comments and
stories that allow me to at least figure out which ones are the dead ones
(because they're not in either set, so they are worth crawling individually,
dead items are only a very small number compared to the rest).

That should get a base of about 7.5M items cataloged to test with, once I have
that working I can use algolia per your suggestion to fill in the blanks.

Thank you very much for the suggestions!

