
HackerNews API: What if HN does not have API? Make API on the fly with APIfy - sathish316
http://apify.heroku.com/resources/4fca535156983f0001000002
======
fizx
Hah! tectonic and I applied to YC with almost exactly this in 2009?!

We went as far as building a browser-based IDE-like environment for generating
these, and a language called parsley for expressing the scrapes. If you're
interested in this, you could check out some of our related open source
libraries:

Edit: I just open-sourced the scraping wiki site we created here:
<https://github.com/fizx/parselets_com>

<http://selectorgadget.com>

<https://github.com/fizx/parsley>

<https://github.com/fizx/parsley-ruby>

<https://github.com/fizx/pyparsley>

<https://github.com/fizx/csvget>

    
    
        > cat hn.let
        {
          "headlines":[{
            "title": ".title a",
            "link": ".title a @href",
            "comments": "match(.subtext a:nth-child(3), '\\d+')",
            "user": ".subtext a:nth-child(2)",
            "score": "match(.subtext span, '\\d+')",
            "time": "match(.subtext, '\\d+\\s+\\w+\\s+ago')"
          }]
        }
        > csvget --directory-prefix=./data  -A "/x" -w 5 --parselet=hn.let http://news.ycombinator.com/
        > head data/headlines.csv
        comments,title,time,link,score,user
        4,Simpson's paradox: why mistrust seemingly simple statistics,2 hours ago,http://en.wikipedia.org/wiki/Simpson%27s_paradox,41,waldrews
        67,America's unjust sex laws,2 hours ago,http://www.economist.com/opinion/displaystory.cfm?story_id=14165460,59,MikeCapone
        23,Buy somebody lunch,3 hours ago,http://www.whattofix.com/blog/archives/2009/08/buy-somebody-lu.php,58,DanielBMarkham

~~~
sathish316
Do you mind if i use selector gadget in APIfy? I use firebug and firefinder to
get selectors. But i can't expect everyone to have the same tools.

~~~
fizx
Everything mentioned is MIT-licensed, and like many OSS authors, we would love
any ideas or code in there to be widely used in virtually any capacity.

Also, if you're in or near SF, I'd be happy to get coffee sometime.

~~~
sathish316
Thanks for suggesting SelectorGadget. I ended up building an in-browser
firebug+firefinder clone to make the app even more easier to use. Please check
out Preview webpage for any API

------
pg
HN does have an API: <http://www.hnsearch.com/api>

~~~
pbiggar
I wish HN had a API which could write - the hnsearch one was only read-only
when I used it. I tried writing a tool for HN
(<https://github.com/pbiggar/hackerite>) that needed to be able to upvote
stories, and although hacks existed to make it work, it wasn't a very pleasant
experience.

------
sathish316
Hacker News content is expired every 1 hour.

Hacker New Newest links are also available here:
<http://apify.heroku.com/resources/4fca651b8526fe0001000002>

Other APIs are never expired (Expire feature is still not pushed)

------
Jd
My problem with HackerNews API (having done something like this -- the Hacker
News Filter on Github) is that you get throttled after you hit a certain
number of HTTP requests and your IP gets banned for a certain amount of time.

So as nice as this is, it simply won't work here for the many people who would
like to use near live data on HN.

~~~
sathish316
The page contents are cached and page is fetched every 1 hour. In fact cache
is expired only for HackerNews API from the backend. Expiry feature is still
not pushed

~~~
Jd
Might be the best you can do but the problem still exists.

~~~
hk_kh
Proxy your traffic through tor, and you are set.

This is not saying you should do it.

------
DanielRibeiro
We have had an HN API for a while now: <http://api.ihackernews.com/>

~~~
zainny
This API is extremely unreliable and has a great deal of functionality missing
as well (commenting, voting). The original developer is also no longer
maintaining it. I tried to build an Hacker News app using it some time ago and
abandoned the idea very quickly.

Hacker News needs a real API.

~~~
DanielRibeiro
I have been using it for a few years now. It does give a error quite often (1
out of 10 requests on average), but for what I'm using, it is pretty solid (as
long as I retry when these errors occur).

The lack of functionality: it had it. The problem is that it not only required
user/password, but it also was caught into HN's safety net, that prevents
multiple accounts from the same IP to do a lot of stuff.

So it can work as library, but not as a server-side API. Therefore he removed
it.

For another project of mine[1], I used Hacker News search API[2], which is
really consistent, and really powerful, and is maintained by the the yc
company that does ThriftDB[3]

[1]<http://hnwho.com/>

[2] <http://www.hnsearch.com/api>

[3]
[http://venturebeatprofiles.com/news/view/y-combinator?articl...](http://venturebeatprofiles.com/news/view/y-combinator?article=398456)

------
altano
Can you add support for CORS (<http://en.wikipedia.org/wiki/Cross-
origin_resource_sharing>)?

Can you add support for taking existing JSON API (rather than scraping HTML)?
This useful for APIs that are neither accessible with CORS nor JSONP, APIs
that are provided by incompetent mental midgets who don't answer emails or
participate to their Google Group ( _cough_ MBTA _cough_ ).

~~~
sathish316
If i understood correctly JSONP supports only GET but CORS supports all http
methods. APIfy can only provide GET Requests (Index and Show) over a static
html page

~~~
cheeaun
That's one of the advantages of CORS. The other advantage is it allows cross-
domain XMLHttpRequest requests in the browser, which will have better error
handling than JSONP.

------
6ren
Seems to be fried. (Popularity is a good sign.)

So, it's basically a web-scraper, but with a JSON API. The API input is
limited to a single parameter, that indexes the record to be scraped. The API
output is taken from that indexed record, consisting of a set of scraped
elements within that record, and presented as JSON, with attributes named as
user specified.

Although this is limited to a list of renamed records, it could be extended
(if needed), and I really like the concept and UI implementation. Feedback: As
someone who has never used css, I found it very tricky to even duplicate the
tutorial: selectors are sensitive to leading and trailing spaces; the
selectors given in the tute aren't what's needed (and see BTW below); and
often "API call failed: Internal Server Error" indicating a problem with the
selector, but not what it is, and ATM service is often "unavailable" :), it's
slow switching back and forth between "edit" and "test" (why not include
testing on the same page? like HN comment edits: textarea + rendered result);
when an attribute is removed, it remains in the JSON (code eg
<http://apify.heroku.com/resources/4fcb26d7a06a160001000024>); it takes a long
time (30s, 1min) to get a result. I hate to say it, but it's like my
experience with ruby: it takes so much time and effort to get the tool to
basically work, that I've used up all my enthusiasm/gumption and have none
left for the project I had in mind. But much of this is because of current
traffic spike, my ignorance of css, and minor polishing/bugs that can be fixed
in vers 1.1 - as I said, I really like the idea and UI.

But a deeper question: why a service, instead of a library? It's cross-
language, but has an extra dependency (the service), an extra network jump,
processing from many users convening at one point. It's interesting to me,
because the world seems to be moving towards services, and this would
logically include _components that formerly would be libraries_. Will this
happen? What are the pros and cons? Will Amazon etc provide free computation
for users of open-source components, analogous to open-source libraries?
Interesting.

BTW: minor typo/bug in active URLs in the tute
(<http://apify.heroku.com/tutorial/create>): an extra "s" in "episodess":

    
    
      http://apify.heroku.com/api/big_bang_theory_episodess.json
      http://apify.heroku.com/api/big_bang_theory_episodess/5.json

~~~
sathish316
Thanks for the feedback. Will fix it in a future version.

Service is just an extension of this library:

<https://github.com/sathish316/scrapify>

The intent of service is to make mobile apps without a backend/db like Parse
for read only APIs

------
roycyang
Looks interesting. I just tried to scrap a sample API but got an error with no
further information on why it was broken:

[http://apify.heroku.com/resources/4fca83088526fe000100011a/e...](http://apify.heroku.com/resources/4fca83088526fe000100011a/edit)

~~~
roycyang
After playing around with it some more, it's working. Question, are you going
to introduce REGEX or any other rules or even some helper functions to further
process the API? That would allow us to drill down even further. It is really
great for bootstrapping and getting some live data quickly. Kudos!

~~~
sathish316
Regex is not yet pushed to the webapp. Regex support is available in the gem
used by APIfy <https://github.com/sathish316/scrapify>

------
jc4p
Is it broken right now? It just says "API call failed: Internal Server Error"
when I hit Test API.

There's also a good API which powers my favorite Android HN app over here:
<http://hndroidapi.appspot.com/>

~~~
sathish316
I just fixed it. Someone removed an attribute by mistake.

------
zafriedman
This might be a stupid question and perhaps I didn't look hard enough on your
website, but is this open source? I didn't see a GitHub link anywhere. I'm
specifically curious as to how you routed Noko or whatever scraping library
you're using to do its thing.

~~~
sathish316
The webapp is not open source. But the underlying library Scrapify is open
source. It internally uses Nokogiri if that's what your question.

<https://github.com/sathish316/scrapify#readme>

------
gildas
Does not work with twitter [1].

"API call failed: Internal Server Error"

[1] <http://apify.heroku.com/resources/4fcb23c5a06a160001000014>

~~~
sathish316
If html content is loaded using ajax, you have to give the URL of the Ajax
request html. See
[http://apify.heroku.com/resources/4fc2a234ae684d0001000008/e...](http://apify.heroku.com/resources/4fc2a234ae684d0001000008/edit)
for example.

It won't work for #! URLs. Twitter has a nice streaming API for search if
that's what you're looking for.

~~~
gildas
* Ajax responses are often JSON content. How do I apply a CSS/XPath selector on it?

* I guess #! URLs could be transformed into _escaped_fragment_ URLs [1]

* I know twitter has an API. It was just an example. Maybe this example [2] would be more relevant (content could also be fetched with an _escaped_fragment_ URL).

[1] [https://developers.google.com/webmasters/ajax-
crawling/docs/...](https://developers.google.com/webmasters/ajax-
crawling/docs/specification)

[2] <http://apify.heroku.com/resources/4fcb2f7ba06a160001000044>

------
premasagar
Did anyone ever make an API that could read a user's upvoted/saved articles
from HN? It would require some kind of login credentials, as the data is not
public.

~~~
sktrdie
Simply record the HTTP requests happening when you access that page. You
probably have to pass the Cookie along through the header for the credentials
to work.

But it's just HTTP, so it's basically already an API.

------
sathish316
If you're creating APIs, please add Attributes. To get quick help on css or
xpath selectors for attributes press c or x in site.

------
temphn
Does this work for sites that are behind logins? Didn't see anything related
to authentication but may have missed it.

~~~
sathish316
There is no authentication mechanism right now. Only public sites are
supported

------
Trindaz
Is this related to <http://www.apifydoc.com/>?

~~~
sathish316
It's not related. APIfy is a web frontend for a small library i've written to
scrap webpages

<https://github.com/sathish316/scrapify>.

The advantage of the app over the library is caching and automatic expiry

------
sinzone
would be cool if all the APIs created via APIfy are automatically listed into
Mashape.com

~~~
sathish316
Somebody created an API in APIfy to list all APIfy APIs. Recursive and
awesome. I just need to fix and make it work.

~~~
sinzone
cool - let me know when it's ready

