
Show HN: I wrote a tiny Python-based HN crawler with scrapy - mvanveen
https://github.com/mvanveen/hncrawl
======
shadowsun7
Here's mine, written on a lazy Sunday evening in January:
<https://github.com/shadowsun7/hacker-news-confidence>

It runs on cron, scrapes HN once every 30 minutes, and sorts it according to
the Wilson score confidence interval for a Bernoulli parameter (the one from
the 'How Not To Sort' article here: [http://www.evanmiller.org/how-not-to-
sort-by-average-rating....](http://www.evanmiller.org/how-not-to-sort-by-
average-rating.html) )

See the results here: <http://hn.elijames.org/>

I've found that the HN articles I like tend to have high vote counts as
compared to comment counts. Mostly this makes sense: an article from
Scientific American has more votes than comments, whereas an article on _Why
PHP Sucks_ (which I find boring) would be highly controversial, which would
have higher comments.

So I've treated comments as a negative signal. Mostly this works - I find
myself skimming more effectively from my page than from HN itself.

~~~
willvarfar
nice page :)

Of course, by commenting here you are actually pushing this article downwards
on your own page ;)

------
jacquesm
Please use this:

<http://api.ihackernews.com/>

HN is slow enough as it is ;)

~~~
mvanveen
I wanted to use it! Originally I was targeting this API, but when I was
getting a huge number of 500 errors when I was developing.

The comments to the right of the page all complain about the stability of the
service and it all was just too frustrating to use while I wanted to get
something up fast.

Also, keep in mind that the tool only scrapes one HN page per crawl (the home
page). It's up to the consumer to be polite, but I hope that this tool is used
responsibly.

~~~
jacquesm
I'm sure your intentions are good. Did you take up the errors issue with the
makers of the api?

HN is built using a homebrew stack, it's a miracle it performs as well as it
does.

What you could do is cache the results and point your tool at the cache, that
would already be much better. After all, if your conclusion is that the HN api
is broken then maybe provide a better API rather than a tool that hits the
source?

~~~
willvarfar
> HN is built using a homebrew stack, it's a miracle it performs as well as it
> does.

Surely it'd be sloppy oversight if the HN site couldn't be hosted on a
homebrew stack?

People seem to imagine you need a constellation of mongo and load balancers
and everything else just as a baseline for a helloworld web-app.

Paul rocks.

~~~
jacquesm
HN traffic is pretty high. Also, I believe that RTM had a hand in optimizing
HN:

<http://news.ycombinator.com/item?id=2120756>

And comparing HN to a helloworld web-app is simplifying things a bit, the fact
that it is sparsely designed does not mean there isn't a significant amount of
work done under the hood.

------
jimmy2times
I haven't run the crawler so I'm not sure what else it does, but if it only
parses the home page and fetches the external links, why not read
<http://news.ycombinator.com/rss> (you can use the feedparser module) and
download the pages with urllib? No scraping involved.

------
awsedwsq
Scrapy, what a fantastic framework it is!

One thing bothers me with your code tough, the following could be replace with
built in Scrapy-tools.

Edit: The code got wrongly formatted in here, check it out on PasteBin
<http://pastebin.com/2QzWgWxN>

~~~
d0mine
Prefix the code with spaces to format it:

    
    
      # Without using BeautifulSoup.
      for item in hxs.select('//td[@class="title"]/a'):
          news_item = NewsItem()
          news_item['title'] = item.select('text()').extract()[0]
          news_item['url'] = item.select('@href').extract()[0]

------
wrath
I'm curious why you used BeautifulSoup when scrapy has its own built-in HTML
parser (HtmlXPathSelector).

Is there an advantage to BeautifulSoup or is it just the tool that you're most
comfortable with?

------
DanielRibeiro
Don't forget the Crawl-delay on HN's robots.txt:
<http://news.ycombinator.com/robots.txt>

~~~
mvanveen
It should only scrape the front page once per run as an initial condition to
unleash the other requests. But thanks for the heads up!

I'll be sure to mention this in the docs.

 _Edit_ : changes have been posted. Thanks for the suggestion!

------
zalew
relevant [http://zalew.net/2011/12/08/grab-your-hackernews-stories-
and...](http://zalew.net/2011/12/08/grab-your-hackernews-stories-and-comments-
python/) ;)

~~~
mvanveen
Cool link, thanks!

It looks like he's using methods from an existing Python CLI interface project
to scrape the site. Definitely a creative way to go, although I personally
think starting from scratch wasn't too bad either.

~~~
zalew
yeah, I modified a bit his script. I chose his script because it logs you in
so you can grab _your_ content.

btw you have a hardcoded path in there /Users/mvanveen/root/dev/news/out/

~~~
mvanveen
Nice catch, much appreciated! It's fixed. Pushing it up now.

------
zerop
On the legal front, what all should I consider before crawling a website,
apart from honoring robots.txt?

~~~
klapinat0r
Depending on your use-case, their TOS. (See IMDb for a relevant example:
[http://app.imdb.com/find?api=v1&appid=iphone1&locale...](http://app.imdb.com/find?api=v1&appid=iphone1&locale=en_US&q=shrek)
see the bottom comment of that link). That is, most websites which provide
unique info will not let you re-distribute this (be it an IMDb score, a
summary of a movie written by one of their users, etc.).

Edit: for personal use there is not much to take into account other than what
you mention. It is not illegal to "see" a webpage, it's more of a question of
what you do with the content. In fact, one could argue that scraping is better
than refreshing in your browser (as you'll hit them with fewer GET requests if
done properly - as you won't download js/css dependencies).

------
the_cat_kittles
Why noy just use requests and lxml, you could do the whole thing in about 10
lines in one file

------
kuromitsu
I think many people will find this useful. I certainly do!

~~~
mvanveen
Thanks, I hope so! Glad you like it.

It's pretty tiny, feel free to fork/contribute!

------
zerop
Good one. How would you find URLs of all pages on a website.

~~~
etcet
If you have a URL and want to find all the pages you can there's only so much
you can do (without considering external links).

1\. Crawl all the links

2\. Check for directory listings on all path combinations (those discovered in
href's and src's)

3\. Check robots.txt for any other discoverable pages

4\. Brute force expected or possible URLs

5\. Try and parse more links from any javascript (probably hard)

That's all I can think of.

~~~
zerop
Thanks...On the legal front, what all should I consider before crawling a
website, apart from honoring robots.txt?

