I tried Komono, but it cannot auth into the sites I want to pull the data from....
Just grabbed import.io - will see if it can loginto sites and grab the data from services I am already paying thousands per month for.
EDIT:
To add some context: I pay about $3,000 per month for some monitoring services which do not have any real reportin mechanisms. So for my daily and weekly reports, I have to manually compile them and screen shot a ton of things, compose an email and send.
I want to configure a scraper to automatically grab screens of things I want regularly and email them.
I want to have a script that will grab many diff pieces of data (visual graphs, typically) and put them all into one email.
I am working with my monitoring vendors to get them to add reporting... but until that can happen - I am tired of spending a couple hours per week screen capping graphs...
I'm evaluating these to augment a system I'm building on top of casper. This is the first I've seen of this one, but right out of the gate I think I prefer Kimono.
But then his data source is gone and what he was doing is pointless. Losing your processor of said data source while said data source is still available is frustrating.
Coupled with goquery (https://github.com/PuerkitoBio/goquery ) to scrape the dom (well, the net/html nodes), this makes custom scrapers trivial to write.
(sorry for the self-promoting comment, but this is quite on topic)
There are a bunch of comments about rolling your own scraper instead of relying upon a possibly unreliable SaaS app.
That makes me think -- would it be viable to run a service that, instead of running the scraping on their own servers, simply gave you a custom binary to run?
Assuming that you trusted the executable, you would never have to worry about the company failing. It'd just be a one-time fee, and yours to use in perpetuity. Presumably updates would be free.
While their real-time Extractors aren't quite as quick as doing it yourself, we've found them to be particularly useful for sites requiring JavaScript and/or cookies to use.
It's also worth mentioning that it's quick to get started. You can start playing around with real data without having to dig into a site's URL structure, and then write your own scraper later if needed.
Isn't it illegal to scrape without permission? How would import.io handle the case when a large site comes back with legal threats when a user of their site has used scraped the wrong site? Can they claim non-responsibility?
Also what happens when sites start blocking their IPs due to repeated scraping or is this unlikely to happen?
They presented last year at Yahoo!'s Hack Europe: London hackathon. It's an interesting concept, they've come far since their initial presentation and while the app has its quirks I have come to use it occasionally for some tasks.
I hope that they'll manage to properly monetize on this - I don't see why I should pay for using a scraping rule if I can just write the scraper myself which doesn't cost me that much more time.
What kind of legitimate uses are there for something like this? This is not a sarcastic question. It seems like an obvious spam magnet, but if people are using it legitimately wouldn't their sources already be providing an API or RSS key?
I've my own use case for it and it will probably mirror other sites. I run my own blog and thus have ads and affiliate links there. The thing is, as good as Google Adsense is, it's shitty for my site and my topic (Web Dev).
What am I left with? Great affiliates like Team Treehouse, Lynda.com, framework themes, and Udemy. The problem is that none of those offer any kind of a good API. All they have is a link and possibly an image that they provide.
By using Kimono, I can scrape (but I don't) all of Udemy's programs, categorize them with custom categories, build a full-text search engine around it and serve relevant ads per post. For instance, my "Best Bootstrap Themes" post would yield "Learn bootstrap" udemy course and an on-the-fly-but-cached image for it thus serving relevant ads to my users.
Same goes for Lynda. If someone lands on "Why C# is a great language to learn" (one of my unreleased articles), my custom API built on top of scraped data could serve them with a "ASP.NET Essentials" course.
So why use something like this for framework themes? Take Wrapbootstrap.com, they have a great affiliate program. Using Kimono, you can easily get daily refreshes of their main page which usually has: sales priced themes, featured themes, and new rising themes. This way, you can serve users with an ad that has up-to-date prices and themes that are hot right now.
What about non-ad uses? You can create custom search, weighted according to YOUR metrics and build your own marketplace front and aggregate several sources in order to serve users with better content.
We use scraping to gather product prices from online shops for a price comparison site. I have permission from the sites who are not bothered to provide us with a price list other than their public website. Legal and necessary - so there is a market for this I believe, I am not sure about its size though.
Doing something like http://openstates.org/ is a perfect example. State government data is shitty most of the time and doesn't have a public api you can query so open states runs 50+ scrapers to get the data and normalize it.
I suspect the real, top-secret business behind import.io is in either training a system to crawl the web and see structured data, and/or gathering over time a very rich crowd-sourced database of structured data.