I'm starting to do 'Introduction to CS' from Udacity (http://www.udacity.com/overview/Course/cs101/CourseRev/apr2012). I'm planning to make a Hacker news search app through the course. What do you think, is that a good idea, or not? I know there is something similar; this is my plan: the search will be executed in the domains, referred by the hacker news stories. Thank you

Keep in mind that if you crawl this site your ip will be banned. At least mine was when I was playing around with a web crawler i built.

There is an unofficial API: http://www.hnsearch.com/api (Provided by the very search engine referred to in the OP, haha!)

Unfortunately there is no API for getting access to personal information on HN (i.e. comments I have made, or stories I've upvoted). You're relegated to scraping if you want that information.

now there is :

here is how you can pull a specific username's submissions and you can add filters: http://api.thriftdb.com/api.hnsearch.com/items/_search?filte...

And then here is how you can pull the comments for a specific thread/discussion/id: http://api.thriftdb.com/api.hnsearch.com/items/_search?filte...

you can now grab a lot of data.. including they enlarged the site's rss feed in hopes of slowing a few of the scrapers..

there are a few items missing, but they added a lot: http://www.hnsearch.com/api

btw that includes a user bio now, as well as things you've upvoted... etc.. its all just done via filters..

the also boosted the rss feed to help slow down the strapers

You can always get around this by throttling your web crawler. It will take a much longer time, but at least you'll be able to read HN in the meantime.

The tricky thing when doing this is knowing what rate to stop at without getting permanently banned. I built an Android Market crawler two summers ago, and luckily Google only temp bans (from my experience), so that might be an easier project without any risk.

Respecting robots.txt is probably the best plan.

Use disposable IPs.

Why will you be banned? Do you know about the reason?

To prevent people from (unintentionally) DDoSing the site.

there's a bunch simliar but i don't think any have updated with the newly added API additions...go for it! http://www.hnsearch.com/api

here are most of the other apps still up http://news.ycombinator.com/item?id=2672826

speaking of this, there should be one. And a "http://Archive as well.

only a handful of topics make it to the top everyday, but the majority of content is high quality stuff, which if applied at a proper place would help many. What say?

I think a practical way to implement simple version would be through Google CSE, but you would have more control if you roll your own search.

