Hacker Newsnew | comments | ask | jobs | submitlogin
HackerNews API: What if HN does not have API? Make API on the fly with APIfy (heroku.com)
130 points by sathish316 690 days ago | comments


fizx 690 days ago | link

Hah! tectonic and I applied to YC with almost exactly this in 2009?!

We went as far as building a browser-based IDE-like environment for generating these, and a language called parsley for expressing the scrapes. If you're interested in this, you could check out some of our related open source libraries:

Edit: I just open-sourced the scraping wiki site we created here: https://github.com/fizx/parselets_com

http://selectorgadget.com

https://github.com/fizx/parsley

https://github.com/fizx/parsley-ruby

https://github.com/fizx/pyparsley

https://github.com/fizx/csvget

    > cat hn.let
    {
      "headlines":[{
        "title": ".title a",
        "link": ".title a @href",
        "comments": "match(.subtext a:nth-child(3), '\\d+')",
        "user": ".subtext a:nth-child(2)",
        "score": "match(.subtext span, '\\d+')",
        "time": "match(.subtext, '\\d+\\s+\\w+\\s+ago')"
      }]
    }
    > csvget --directory-prefix=./data  -A "/x" -w 5 --parselet=hn.let http://news.ycombinator.com/
    > head data/headlines.csv
    comments,title,time,link,score,user
    4,Simpson's paradox: why mistrust seemingly simple statistics,2 hours ago,http://en.wikipedia.org/wiki/Simpson%27s_paradox,41,waldrews
    67,America's unjust sex laws,2 hours ago,http://www.economist.com/opinion/displaystory.cfm?story_id=14165460,59,MikeCapone
    23,Buy somebody lunch,3 hours ago,http://www.whattofix.com/blog/archives/2009/08/buy-somebody-lu.php,58,DanielBMarkham

-----

sc00ter 690 days ago | link

Oh! Selector gadget is wonderful, and I still use it to this day. It's so easy to forget that there are real people behind the tools we use - so fizx and tectonic, I thank you!

-----

christiangenco 689 days ago | link

Seconded - selector gadget is one of the best scraping tools around. You've saved me many a frustrated evenings.

-----

tectonic 689 days ago | link

Thanks for the kind words! Maybe this attention will push me to update the code and make a new screencast.

-----

tluyben2 689 days ago | link

Selectorgadget is great; without checking out your creations in real life, did you consider taping selectorgadget to a proxy so you can scrape sites and store the paths you found in one go? That would massively enhance the process imho :) Maybe Apify can do that, but I hope they put that in github as well; i'm not a great fan of closed source/cloud development tools.

-----

sathish316 690 days ago | link

Do you mind if i use selector gadget in APIfy? I use firebug and firefinder to get selectors. But i can't expect everyone to have the same tools.

-----

fizx 689 days ago | link

Everything mentioned is MIT-licensed, and like many OSS authors, we would love any ideas or code in there to be widely used in virtually any capacity.

Also, if you're in or near SF, I'd be happy to get coffee sometime.

-----

sathish316 680 days ago | link

Thanks for suggesting SelectorGadget. I ended up building an in-browser firebug+firefinder clone to make the app even more easier to use. Please check out Preview webpage for any API

-----

tectonic 689 days ago | link

Please do use selectorgadget! And if you'd like to push anything back, or chat about parsing in general, send me a message. I have a more advanced branch of SG that can generate better selectors, but I haven't pushed it out yet. It's all in the repo.

-----

tluyben2 689 days ago | link

I was going to suggest here that you do. And opensource your creation on github so we can all improve it.

-----

pg 690 days ago | link

HN does have an API: http://www.hnsearch.com/api

-----

pbiggar 689 days ago | link

I wish HN had a API which could write - the hnsearch one was only read-only when I used it. I tried writing a tool for HN (https://github.com/pbiggar/hackerite) that needed to be able to upvote stories, and although hacks existed to make it work, it wasn't a very pleasant experience.

-----

wslh 689 days ago | link

And you can avoid HNSearch API limits with: https://gist.github.com/1360455

-----

sathish316 690 days ago | link

The original intent of this app was not HN. It was smaller datasets like govt, public service and transit sites which generally don't have any APIs.

-----

6ren 689 days ago | link

Seems to be fried. (Popularity is a good sign.)

So, it's basically a web-scraper, but with a JSON API. The API input is limited to a single parameter, that indexes the record to be scraped. The API output is taken from that indexed record, consisting of a set of scraped elements within that record, and presented as JSON, with attributes named as user specified.

Although this is limited to a list of renamed records, it could be extended (if needed), and I really like the concept and UI implementation. Feedback: As someone who has never used css, I found it very tricky to even duplicate the tutorial: selectors are sensitive to leading and trailing spaces; the selectors given in the tute aren't what's needed (and see BTW below); and often "API call failed: Internal Server Error" indicating a problem with the selector, but not what it is, and ATM service is often "unavailable" :), it's slow switching back and forth between "edit" and "test" (why not include testing on the same page? like HN comment edits: textarea + rendered result); when an attribute is removed, it remains in the JSON (code eg http://apify.heroku.com/resources/4fcb26d7a06a160001000024); it takes a long time (30s, 1min) to get a result. I hate to say it, but it's like my experience with ruby: it takes so much time and effort to get the tool to basically work, that I've used up all my enthusiasm/gumption and have none left for the project I had in mind. But much of this is because of current traffic spike, my ignorance of css, and minor polishing/bugs that can be fixed in vers 1.1 - as I said, I really like the idea and UI.

But a deeper question: why a service, instead of a library? It's cross-language, but has an extra dependency (the service), an extra network jump, processing from many users convening at one point. It's interesting to me, because the world seems to be moving towards services, and this would logically include components that formerly would be libraries. Will this happen? What are the pros and cons? Will Amazon etc provide free computation for users of open-source components, analogous to open-source libraries? Interesting.

BTW: minor typo/bug in active URLs in the tute (http://apify.heroku.com/tutorial/create): an extra "s" in "episodess":

  http://apify.heroku.com/api/big_bang_theory_episodess.json
  http://apify.heroku.com/api/big_bang_theory_episodess/5.json

-----

sathish316 689 days ago | link

Thanks for the feedback. Will fix it in a future version.

Service is just an extension of this library:

https://github.com/sathish316/scrapify

The intent of service is to make mobile apps without a backend/db like Parse for read only APIs

-----

DanielRibeiro 690 days ago | link

We have had an HN API for a while now: http://api.ihackernews.com/

-----

zainny 690 days ago | link

This API is extremely unreliable and has a great deal of functionality missing as well (commenting, voting). The original developer is also no longer maintaining it. I tried to build an Hacker News app using it some time ago and abandoned the idea very quickly.

Hacker News needs a real API.

-----

DanielRibeiro 690 days ago | link

I have been using it for a few years now. It does give a error quite often (1 out of 10 requests on average), but for what I'm using, it is pretty solid (as long as I retry when these errors occur).

The lack of functionality: it had it. The problem is that it not only required user/password, but it also was caught into HN's safety net, that prevents multiple accounts from the same IP to do a lot of stuff.

So it can work as library, but not as a server-side API. Therefore he removed it.

For another project of mine[1], I used Hacker News search API[2], which is really consistent, and really powerful, and is maintained by the the yc company that does ThriftDB[3]

[1]http://hnwho.com/

[2] http://www.hnsearch.com/api

[3] http://venturebeatprofiles.com/news/view/y-combinator?articl...

-----

ronnier 689 days ago | link

The reason that it's so unreliable is that my server's IP address gets banned by YC when the requests are to fast, and I have it hosted on a small cheap server. There are ways around this, but it just isn't worth the time or money.

-----

Jd 690 days ago | link

My problem with HackerNews API (having done something like this -- the Hacker News Filter on Github) is that you get throttled after you hit a certain number of HTTP requests and your IP gets banned for a certain amount of time.

So as nice as this is, it simply won't work here for the many people who would like to use near live data on HN.

-----

sathish316 690 days ago | link

The page contents are cached and page is fetched every 1 hour. In fact cache is expired only for HackerNews API from the backend. Expiry feature is still not pushed

-----

Jd 690 days ago | link

Might be the best you can do but the problem still exists.

-----

hk_kh 689 days ago | link

Proxy your traffic through tor, and you are set.

This is not saying you should do it.

-----

sathish316 690 days ago | link

Hacker News content is expired every 1 hour.

Hacker New Newest links are also available here: http://apify.heroku.com/resources/4fca651b8526fe0001000002

Other APIs are never expired (Expire feature is still not pushed)

-----

jc4p 690 days ago | link

Is it broken right now? It just says "API call failed: Internal Server Error" when I hit Test API.

There's also a good API which powers my favorite Android HN app over here: http://hndroidapi.appspot.com/

-----

sathish316 690 days ago | link

I just fixed it. Someone removed an attribute by mistake.

-----

altano 690 days ago | link

Can you add support for CORS (http://en.wikipedia.org/wiki/Cross-origin_resource_sharing)?

Can you add support for taking existing JSON API (rather than scraping HTML)? This useful for APIs that are neither accessible with CORS nor JSONP, APIs that are provided by incompetent mental midgets who don't answer emails or participate to their Google Group (cough MBTA cough).

-----

sathish316 690 days ago | link

If i understood correctly JSONP supports only GET but CORS supports all http methods. APIfy can only provide GET Requests (Index and Show) over a static html page

-----

cheeaun 690 days ago | link

That's one of the advantages of CORS. The other advantage is it allows cross-domain XMLHttpRequest requests in the browser, which will have better error handling than JSONP.

-----

roycyang 690 days ago | link

Looks interesting. I just tried to scrap a sample API but got an error with no further information on why it was broken:

http://apify.heroku.com/resources/4fca83088526fe000100011a/e...

-----

roycyang 690 days ago | link

After playing around with it some more, it's working. Question, are you going to introduce REGEX or any other rules or even some helper functions to further process the API? That would allow us to drill down even further. It is really great for bootstrapping and getting some live data quickly. Kudos!

-----

sathish316 690 days ago | link

Regex is not yet pushed to the webapp. Regex support is available in the gem used by APIfy https://github.com/sathish316/scrapify

-----

gildas 689 days ago | link

Does not work with twitter [1].

"API call failed: Internal Server Error"

[1] http://apify.heroku.com/resources/4fcb23c5a06a160001000014

-----

6ren 689 days ago | link

You need to specify a unique index for the attribute - you specified it as "tweet", but not how to scrape it. The error messages are not informative at this stage; it might be something else too.

I tried to fix your e.g. but couldn't get it to work (I tried //span@data-time (xpath) - what is the unique index of a tweet?)

-----

sathish316 689 days ago | link

If html content is loaded using ajax, you have to give the URL of the Ajax request html. See http://apify.heroku.com/resources/4fc2a234ae684d0001000008/e... for example.

It won't work for #! URLs. Twitter has a nice streaming API for search if that's what you're looking for.

-----

gildas 689 days ago | link

* Ajax responses are often JSON content. How do I apply a CSS/XPath selector on it?

* I guess #! URLs could be transformed into _escaped_fragment_ URLs [1]

* I know twitter has an API. It was just an example. Maybe this example [2] would be more relevant (content could also be fetched with an _escaped_fragment_ URL).

[1] https://developers.google.com/webmasters/ajax-crawling/docs/...

[2] http://apify.heroku.com/resources/4fcb2f7ba06a160001000044

-----

zafriedman 690 days ago | link

This might be a stupid question and perhaps I didn't look hard enough on your website, but is this open source? I didn't see a GitHub link anywhere. I'm specifically curious as to how you routed Noko or whatever scraping library you're using to do its thing.

-----

sathish316 690 days ago | link

The webapp is not open source. But the underlying library Scrapify is open source. It internally uses Nokogiri if that's what your question.

https://github.com/sathish316/scrapify#readme

-----

sathish316 690 days ago | link

If you're creating APIs, please add Attributes. To get quick help on css or xpath selectors for attributes press c or x in site.

-----

premasagar 689 days ago | link

Did anyone ever make an API that could read a user's upvoted/saved articles from HN? It would require some kind of login credentials, as the data is not public.

-----

sktrdie 688 days ago | link

Simply record the HTTP requests happening when you access that page. You probably have to pass the Cookie along through the header for the credentials to work.

But it's just HTTP, so it's basically already an API.

-----

Trindaz 690 days ago | link

Is this related to http://www.apifydoc.com/?

-----

sathish316 690 days ago | link

It's not related. APIfy is a web frontend for a small library i've written to scrap webpages

https://github.com/sathish316/scrapify.

The advantage of the app over the library is caching and automatic expiry

-----

temphn 689 days ago | link

Does this work for sites that are behind logins? Didn't see anything related to authentication but may have missed it.

-----

sathish316 689 days ago | link

There is no authentication mechanism right now. Only public sites are supported

-----

sinzone 690 days ago | link

would be cool if all the APIs created via APIfy are automatically listed into Mashape.com

-----

sathish316 690 days ago | link

Somebody created an API in APIfy to list all APIfy APIs. Recursive and awesome. I just need to fix and make it work.

-----

sinzone 690 days ago | link

cool - let me know when it's ready

-----




Lists | RSS | Bookmarklet | Guidelines | FAQ | DMCA | News News | Feature Requests | Bugs | Y Combinator | Apply | Library

Search: