Hacker News new | comments | show | ask | jobs | submit login
Python API for Hacker News (github.com)
53 points by karangoeluw on Sept 12, 2013 | hide | past | web | favorite | 32 comments



That library is in many ways deprecated and broken: At first, it uses only old-style classes because it doesn't inherits object explicitly. Furthermore, it uses print in a method; it would be more "Pythonic" to return a str object, which was formatted using str.format.

I think the future is Python 3, and new implementations in Python 2 syntax are simply unneccessary. I would suggest the usage of Python-3-style syntax, which is also valid in Python 2.7 (which isn't hard).


> At first, it uses only old-style classes because it doesn't inherits object explicitly.

Please explain this further.

> usage of Python-3-style syntax, which is also valid in Python 2.7

Will do this



Alright. Fixed.


I tried building a REST API once for a challenge if anyone is interested: https://github.com/mapleoin/newhackers


Nice effort. Just a few remarks:

- You should certainly use Requests http://docs.python-requests.org/en/latest/

- The Story class seems somewhat redundant. You could possibly use collections.namedtuple as a container for properties or simply a dictionary. The print_story method could just be the __str__ special method.

- JSON output would be useful.


I will try and implement these. Thanks for the suggestions.


Fixed.



Not OP, but from a quick glance at the source it doesn't appear to. It downloads the pages and uses beautiful soup (python html parsing library).


Nope, it's scraping.


Is that an official API? How long has it been around?


Completely unofficial. I started creating it a month ago.


Wow, that's great. I use another one and it's quite unreliable. Thanks!


You can use mine and compare the two, and based on your feedback either I or any other dev can improve it.


It scraped, slices.


I think screen scrapping is not allowed by HN. Few tries with these APIs might get your IP banned!


The robots.txt file doesn't seem to disallow scraping. https://news.ycombinator.com/robots.txt


Scraping the listing pages seems allowed though.


I don't think there's a prohibition to screen scraping, but if you make too many requests to the server in a certain amount of time, your IP will be banned to prevent the server from melting.


Agree. Also HN have RSS.


I don't get why you're using a try except block for the num_comments variable. You shouldn't be casting to an int if it doesn't have the attribute.


The meta text on any page can be this:

> 21 points by johns 15 minutes ago | discuss

or

> 152 points by ar7hur 3 hours ago | 58 comments

If the rgex matches (case 2), then I cast it to an int. Otherwise (case 1, 0 comments).


It's silly to use BeautifulSoup to parse the page when you could use a simple RegEx:

<td class=\"title\"><a href=\"(.?)\"(.?)>(.?)</a>(.?)</td>


"HTML tags lea͠ki̧n͘g fr̶ǫm ̡yo​͟ur eye͢s̸ ̛l̕ik͏e liq​uid pain"

http://stackoverflow.com/questions/1732348/regex-match-open-...


I am willing to sacrifice my soul and everything that is holy.


Regex to parse HTMl is probably the single worst thing you can do.


Crafting a wide purpose regex to parse whatever HTML comes in is bad.

Building a regex to extract relevant data from simple, fixed-form page data, bypassing tags irrelevant to the problem at hand is not.


...until the HTML changes.

I haven't look at their parsing code, so I have no idea if it is any better than using a regex, but if the regex assumes too much, simply reordering the attributes in a tag (or something similar) could break a regex-based solution.


Some people, when confronted with a problem... bah you know the rest.


Arg, there should be asterisks after every period.


BeautifulSoup is great, as long as you're using open source HTML5 parser from Google. https://github.com/google/gumbo-parser




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: