Hah! tectonic and I applied to YC with almost exactly this in 2009?!
We went as far as building a browser-based IDE-like environment for generating these, and a language called parsley for expressing the scrapes. If you're interested in this, you could check out some of our related open source libraries:
Selectorgadget is great; without checking out your creations in real life, did you consider taping selectorgadget to a proxy so you can scrape sites and store the paths you found in one go? That would massively enhance the process imho :) Maybe Apify can do that, but I hope they put that in github as well; i'm not a great fan of closed source/cloud development tools.
Please do use selectorgadget! And if you'd like to push anything back, or chat about parsing in general, send me a message. I have a more advanced branch of SG that can generate better selectors, but I haven't pushed it out yet. It's all in the repo.
I wish HN had a API which could write - the hnsearch one was only read-only when I used it. I tried writing a tool for HN (https://github.com/pbiggar/hackerite) that needed to be able to upvote stories, and although hacks existed to make it work, it wasn't a very pleasant experience.
My problem with HackerNews API (having done something like this -- the Hacker News Filter on Github) is that you get throttled after you hit a certain number of HTTP requests and your IP gets banned for a certain amount of time.
So as nice as this is, it simply won't work here for the many people who would like to use near live data on HN.
This API is extremely unreliable and has a great deal of functionality missing as well (commenting, voting). The original developer is also no longer maintaining it. I tried to build an Hacker News app using it some time ago and abandoned the idea very quickly.
I have been using it for a few years now. It does give a error quite often (1 out of 10 requests on average), but for what I'm using, it is pretty solid (as long as I retry when these errors occur).
The lack of functionality: it had it. The problem is that it not only required user/password, but it also was caught into HN's safety net, that prevents multiple accounts from the same IP to do a lot of stuff.
So it can work as library, but not as a server-side API. Therefore he removed it.
For another project of mine, I used Hacker News search API, which is really consistent, and really powerful, and is maintained by the the yc company that does ThriftDB
The reason that it's so unreliable is that my server's IP address gets banned by YC when the requests are to fast, and I have it hosted on a small cheap server. There are ways around this, but it just isn't worth the time or money.
Can you add support for taking existing JSON API (rather than scraping HTML)? This useful for APIs that are neither accessible with CORS nor JSONP, APIs that are provided by incompetent mental midgets who don't answer emails or participate to their Google Group (cough MBTA cough).
So, it's basically a web-scraper, but with a JSON API. The API input is limited to a single parameter, that indexes the record to be scraped. The API output is taken from that indexed record, consisting of a set of scraped elements within that record, and presented as JSON, with attributes named as user specified.
Although this is limited to a list of renamed records, it could be extended (if needed), and I really like the concept and UI implementation. Feedback: As someone who has never used css, I found it very tricky to even duplicate the tutorial: selectors are sensitive to leading and trailing spaces; the selectors given in the tute aren't what's needed (and see BTW below); and often "API call failed: Internal Server Error" indicating a problem with the selector, but not what it is, and ATM service is often "unavailable" :), it's slow switching back and forth between "edit" and "test" (why not include testing on the same page? like HN comment edits: textarea + rendered result); when an attribute is removed, it remains in the JSON (code eg http://apify.heroku.com/resources/4fcb26d7a06a160001000024); it takes a long time (30s, 1min) to get a result. I hate to say it, but it's like my experience with ruby: it takes so much time and effort to get the tool to basically work, that I've used up all my enthusiasm/gumption and have none left for the project I had in mind. But much of this is because of current traffic spike, my ignorance of css, and minor polishing/bugs that can be fixed in vers 1.1 - as I said, I really like the idea and UI.
But a deeper question: why a service, instead of a library? It's cross-language, but has an extra dependency (the service), an extra network jump, processing from many users convening at one point. It's interesting to me, because the world seems to be moving towards services, and this would logically include components that formerly would be libraries. Will this happen? What are the pros and cons? Will Amazon etc provide free computation for users of open-source components, analogous to open-source libraries? Interesting.
After playing around with it some more, it's working. Question, are you going to introduce REGEX or any other rules or even some helper functions to further process the API? That would allow us to drill down even further. It is really great for bootstrapping and getting some live data quickly. Kudos!
This might be a stupid question and perhaps I didn't look hard enough on your website, but is this open source? I didn't see a GitHub link anywhere. I'm specifically curious as to how you routed Noko or whatever scraping library you're using to do its thing.