
HN2JSON: A ruby gem for HackerNews - jcla1
https://rubygems.org/gems/hn2json
======
dfc
Be careful not to hammer the site. Your IP could be added to the blocklist if
you are too aggressive:

 _"Yes, we block IPs that seem to be crawlers ignoring robots.txt. We've
always blocked abusive IPs, but I tightened up the blocking a few weeks ago. A
lot of people were crawling HN, most of them unnecessarily because they were
doing things they could have done more efficiently through HNSearch's API[1]."
--pg_ [2]

[1] <http://www.hnsearch.com/api>

[2] <http://news.ycombinator.com/item?id=3196298>

------
mmackh
I've written a script that extracts HN, which anyone is welcome to use. I use
it for the Hacker News iPhone app:

<http://api.thequeue.org/hn/frontpage.xml>

<http://api.thequeue.org/hn/new.xml>

<http://api.thequeue.org/hn/best.xml>

------
markburns
item = HN2JSON.find 4623690

NoMethodError: undefined method `url=' for #<HN2JSON::Entity:0x007fb84cd63a88>

from
/Users/markburns/.rvm/gems/ruby-1.9.3-p194/gems/hn2json-0.0.4/lib/hn2json/parser.rb:92:in
`block in get_attrs_post' from
/Users/markburns/.rvm/gems/ruby-1.9.3-p194/gems/hn2json-0.0.4/lib/hn2json/entity.rb:92:in
`add_attrs' from
/Users/markburns/.rvm/gems/ruby-1.9.3-p194/gems/hn2json-0.0.4/lib/hn2json/parser.rb:91:in
`get_attrs_post' from
/Users/markburns/.rvm/gems/ruby-1.9.3-p194/gems/hn2json-0.0.4/lib/hn2json/entity.rb:71:in
`get_attrs' from
/Users/markburns/.rvm/gems/ruby-1.9.3-p194/gems/hn2json-0.0.4/lib/hn2json/entity.rb:56:in
`initialize' from
/Users/markburns/.rvm/gems/ruby-1.9.3-p194/gems/hn2json-0.0.4/lib/hn2json.rb:35:in
`new' from
/Users/markburns/.rvm/gems/ruby-1.9.3-p194/gems/hn2json-0.0.4/lib/hn2json.rb:35:in
`find'

~~~
jcla1
Sorry, I forgott to update the gem, on rubygems.org. Just install the gem
again now.

~~~
markburns
Cool thanks. Might be nice to override the inspect method to display something
nicer.

~~~
jcla1
Yeah! The idea is to return the object in JSON

------
rdudekul
Going through the code on github to see how a HN page is parsed, was
informative. I may use this to create one using Node.js. My interest is in
building an intelligent agent that filters content based on my interests
(example: coding, customer acquisition, hiring etc.) and notifies me on a
daily or weekly basis.

~~~
jcla1
I have written a program that is similar to what you just explained, also on
GitHub <https://github.com/jcla1/hackernews>

------
selvan
Checkout apify - <http://apify.heroku.com/resources> & scrapify -
<https://github.com/sathish316/scrapify> Library to scrap HTML content as JSON
data.

------
mvanveen
I wrote a small, ScraPy based HN crawler available at
<http://github.com/mvanveen/hncrawl> in case anyone is interested.

------
qmacro
Excellent! I know I'm biased but I also know you've put a lot of effort into
this. Well done Joseph.

------
why-el
Nice work. Does Cronic have to be a runtime dependency?

~~~
jcla1
Not really, but at the time I didn't want to have to write my own date parser.
(HN doesn't show the date, just things like "x days ago")

