

Ask HN: What is the best way to scrape HN profiles? - dewey

I&#x27;m currently working on a small weekend project where I need to get the HN karma value for a given subset of users. (It&#x27;s just a fun exercise and I&#x27;ll share it once it&#x27;s ready).<p>I&#x27;m using the HN API provided by Algolia [0] but I realised it&#x27;s just updating the values every time there&#x27;s a new post or submission from that user.<p>Anyone aware of an API where I could grab this value without getting IP banned by HN like i probably would if I access&#x2F;scrape 1000 user profiles every hour?<p>Thanks!<p>Edit: If it&#x27;s updated once every 24h that&#x27;s perfectly fine for my use-case.<p>[0] https:&#x2F;&#x2F;hn.algolia.com&#x2F;api&#x2F;v1&#x2F;users&#x2F;dewey
======
minimaxir
The Algolia API allows for 10,000 calls per hour so you wouldn't get IP banned
for your use case.

~~~
dewey
I'm aware of that, the problem is:

> it's just updating the values every time there's a new post or submission
> from that user.

Source: [http://hn.algolia.com/about](http://hn.algolia.com/about)

~~~
byoung2
So the problem is that the karma count in the API is not updated if a user has
not submitted anything recently? So a comment from 2 days ago that is still
getting upvoted would not update. You could scrape user profiles at a rate of
2 per minute and still respect robots.txt
([https://news.ycombinator.com/robots.txt](https://news.ycombinator.com/robots.txt)).
But to scrape more and be a good citizen, use the API with limitations, or
some combination of the two (e.g. if you have a list of 1000 users, use the
API first, and for users who may not have posted in the last 48 hours, scrape
them directly). You could also try contacting YC staff and asking for
permission to scrape at a higher rate.

~~~
dewey
> So the problem is that the karma count in the API is not updated if a user
> has not submitted anything recently?

Exactly.

> or some combination of the two (e.g. if you have a list of 1000 users, use
> the API first, and for users who may not have posted in the last 48 hours,
> scrape them directly)

That's a good idea, I'll try that for now. Thanks!

