

Any good api to scrape HN other than this? - kaushikfrnd
https://github.com/karan/HackerNewsAPI
how to scrape HN other then https:&#x2F;&#x2F;github.com&#x2F;karan&#x2F;HackerNewsAPI . any good premade library in python ?
======
carbocation
The robots.txt from news.ycombinator.com reads as follows:

    
    
        User-Agent: * 
        Disallow: /x?
        Disallow: /vote?
        Disallow: /reply?
        Disallow: /submitted?
        Disallow: /submitlink?
        Disallow: /threads?
        Crawl-delay: 30
    

So nominally you should feel free to set up a scraper that crawls one non-
disallowed resource every 30 seconds.

~~~
rrpadhy
I see that submitted and threads are also not allowed.

What is a safe limit to crawl this data, if I have to absolutely need that
data? 30 mins between users? 1 hour between users?

~~~
kaushikfrnd
i am following hnsearch api for long and found that they crawl user submission
url and user details urls every 2-4 hours or so .

------
napoleond
Just use [https://www.hnsearch.com](https://www.hnsearch.com), along with
[https://www.hnsearch.com/rss](https://www.hnsearch.com/rss) and
[https://www.hnsearch.com/bigrss](https://www.hnsearch.com/bigrss) if you want
to mimic the front page.

There is rarely a need to scrape HN directly, but if you do make sure your bot
is polite (especially with respect to rate limits).

~~~
kaushikfrnd
I am trying to fetch all posts,comments plus all user data . I will ty
hnsearch .

------
goldenkey
Yahoo pipes would work really well if you're willing to write a few HTML
regexes or dom element selectors.

[http://pipes.yahoo.com/pipes/](http://pipes.yahoo.com/pipes/)

------
jcla1
Not a full featured api, but a way to scrape all of HN:
[http://jcla1.com/blog/2013/05/13/crawling-
hackernews/](http://jcla1.com/blog/2013/05/13/crawling-hackernews/)

Disclaimer: It's my own blog

edit: Uses HNSearch, so it doesn't violate the robots.txt and can be crawled
faster

~~~
zerd
Did you manage to download the whole database that way? Edit: Also, why didn't
you use the "start" (offset) parameter?

~~~
jcla1
No, not tried to download it yet. Regarding your question, if you try to use a
start > 999 you get this error: "Validation error: max limit is 100, max
start+limit is 1000", which is why I avoided that parameter.

------
obayesshelton
You don't even need an api, all you need is an rss reader and read -
[https://news.ycombinator.com/rss](https://news.ycombinator.com/rss)

------
deft
I wrote an alright one in Python for use in my HN app for BlackBerry 10. Not
sure how good it is, but check it out here:
[https://github.com/krruzic/Reader-
YC/tree/master/app](https://github.com/krruzic/Reader-YC/tree/master/app)

I'm not sure what you're trying to do though. I used beautifulsoup because I
couldn't get lxml working on BB10, but if it was switched to using lxml it
would be much faster.

------
shamsulbuddy
[http://hnapp.com/](http://hnapp.com/) \-- This is the best HN Scraped site..
returns data in JSON / RSS format.

------
mikektung
Depending on what you're trying to do with the data, you may find
[http://diffbot.com/products/automatic/](http://diffbot.com/products/automatic/)
helpful for getting the clean article text and categorization in JSON format.
It can be used as a complement/augmentation to the great suggestions here for
getting the links.

Disclosure: Founder of Diffbot here.

------
dmpayton
I wrote a Python wrapper for the iHackerNews API, if that helps.

[https://github.com/dmpayton/python-
ihackernews](https://github.com/dmpayton/python-ihackernews)

~~~
kaushikfrnd
i saw your github repo . Wonderful work but saw your api was not working
getting some errors when i tried the link
[http://api.ihackernews.com/by/kaushikfrnd](http://api.ihackernews.com/by/kaushikfrnd).
Can you confirm it will work if i run it on my own server .

~~~
dmpayton
Ah, looks like there's an issue with the iHackerNews API itself, which I don't
have a hand in. You'll want to hit up @ronnieroller on Twitter. Sorry I can't
be of more help. :/

------
droid_w
There's a twitter feed based on HN -
[https://twitter.com/newsycombinator](https://twitter.com/newsycombinator)

You can use the twitter API and read from there

------
amirouche
There is hundred of data sets out there why it must always be HN?

~~~
zerd
Because quality datasets are hard to get. E.g. on reddit you would just get
cats and memes.

------
mvanveen
I have a ScraPy-based crawler project available at
[http://github.com/mvanveen/hncrawl](http://github.com/mvanveen/hncrawl)

------
cheeaun
I built [https://github.com/cheeaun/node-
hnapi](https://github.com/cheeaun/node-hnapi)

------
kaushikfrnd
can anyone say me how to get
[https://news.ycombinator.com/news](https://news.ycombinator.com/news) through
hnsearch api . I want the api link ->
[[http://api.thriftdb.com/api.hnsearch.com/](http://api.thriftdb.com/api.hnsearch.com/)]
!!

------
rotub
[https://www.hnsearch.com/api](https://www.hnsearch.com/api)

------
jenjenhar
Out of curiosity, Why does HN not release an official API?

~~~
code_duck
My impression is that pg wants to encourage the hacker spirit by providing a
bare bones service which could easily have a 'hacked' api built upon it.

~~~
taliesinb
_My_ impression is that HN's link and comment data is too valuable for pg to
give away.

Certainly, if I have had access to it I know I could do some pretty useful
sociology on HN's audience (= the pool of startup hire material).

~~~
code_duck
I don't believe that HN restricts or discourages the scraping of HN content in
any way... Other than the restrictions here:
[https://news.ycombinator.com/robots.txt](https://news.ycombinator.com/robots.txt)

If you have a fabulous idea for how to use the data contained on this site,
I'm sure everyone will be impressed and interested to see it.

------
fakename
other than this

------
notastartup
I wrote [http://scrape.it](http://scrape.it) and
[http://scrape.ly](http://scrape.ly) to do this.

------
culo
try these

\- [https://www.mashape.com/scrape/scrape-
it#!documentation](https://www.mashape.com/scrape/scrape-it#!documentation)

\-
[https://www.mashape.com/karangoel/hnify#!documentation](https://www.mashape.com/karangoel/hnify#!documentation)

~~~
notastartup
haha good to see someone link it! I am the author of Scrape.it currently on
mashape. I also wrote [http://scrape.ly](http://scrape.ly) for crawling web
pages and extracting data.

