Hacker News new | past | comments | ask | show | jobs | submit login
Web Scraping as a Service (tubes.io)
9 points by aarondf on July 20, 2013 | hide | past | favorite | 8 comments



What's the legality behind web scraping? Can you sell access to data that isn't yours?


They can be hit with a tort of trespass to chattels. If you have a lot of servers performing scraping of a service they can sue you for depriving them of use of their service. I simply don't see how something like this could ever be launched as a business. However it could exist in another form.

Last year I spent a bunch of time on a "bittorrent for webscraping" project we ended up scrapping at work after we pivoted. The p2p approach with a tit-for-tat strategy and a headless webkit approach to scraping is the only long term way we could see where you can scrape legally or at least semi-legally. Our intent was to simply open source the project and let the rest of the world run with it. All the technology is there to democratize the proprietary data industry as has happened with music and movies. It's only a matter of time before a lot of that data is freed from silos like Twitter and Craigslist.

The strategy was the following:

(1) Tit-for-tat scraping where you intentionally avoid scraping websites likely to be in your jurisdiction by subnet. My machines scrape sites in Europe while your machines scrape sites in the US. You also avoid scraping data from services whose data you want.

(2) Headless webkit with various methods of avoiding browser fingerprinting. The key is to look as much like legitimate traffic as possible, with the goal of making it as likely that attempts to block a client would affect both real users and scrapers. This is the most powerful deterrent to blocking.

(3) 1-1 correspondence between the templates a site uses (handlebars, ejs, server-side templates, etc) and a database of snippets that describe exactly how to convert those templates to JSON data. Any service that provides those template definitions as part of a client-side webapp would be trivial to convert to scraping formula.

(4) All the "recipes" for sites and their templates would probably be handled centrally at first the same way torrents were. When a recipe is broken because a site has been updated, the system would know that recipe broke because it periodically checks for good data against a known corpus of previously scraped data. Once a broken recipe has been identified someone could go fix it (conceivably this could be automated as well since you have a known corpus of data and you could detect how an XPath selector may have changed)

(5) Distributed hash table pointing to caches of previously requested data so that a swarm for a particular service doesn't end up creating a DDoS by accident. Each datum should only be scraped once for a certain interval of time after which the data expires and can be requested again.

AFAIK such a system leaves you open to at most a lawsuit for inducing a third party to breach contract, assuming that third party is even in a jurisdiction where they can be held liable. But such a lawsuit would be tenuous at best. Furthermore, you'd have to prove that the third-party ever visited the site in question through non-automated means because if they never visit the site in question in person, then they can not be held to the terms of service of that site and therefore can't be liable for breaching a contract they never agreed to.

FWIW IANAL but I have talked to a bunch about this and the above is the closest I could come up with for a service that scrapes sites without running afoul of the law.

Anyways, such a system is in the realm of black hat tactics, but to be honest, monopolization of data to extract exorbitant rents and abuse of third party developers is much more despicable. The whole world would be better off if we commoditize access to all sorts of metadata.


I really like the idea; definitely could have used this in several of my projects over the years. It'd be cool if there were more use cases to give a better picture of what all is possible with the scraping service. That's what drew me into IFTTT when I first visited their website. The example of getting the title tag from the website wasn't that exciting :) . And any chances of a trial period?? I'd be more willing to pay after seeing a functioning implementation in my website/app.


Basically anything is possible with the scraping service (provided the target pages don't render themselves via javascript).

It will grab the DOM and you can parse through it and return a nice little JSON object.


A headless webkit browser like PhantomJS can scrape pages that render themselves via javascript.


i could have a need for this type of service as well, but it's not clear what benefit it provides over PhantomJS. is it just the alternating of the scraping IP?

a free trial would be helpful, but before i sign up, i need to know what i gain from the service. it seems like i still have to write all the same logic, but tubes.io just takes care of making the web request. if that's all, it's not exactly something i'd be willing to pay for.


I'm a tubes.io customer (and a happy customer too) and my main use for it is both the alternating IP, and that it allows me to scale up scraping operations without having to deal with too much on my backend.

The alternating IP is actually a bit of a big deal, it's kind of a pain to run your own proxies and have to manage the proxy switching yourself (so painful that I would pay 10 bucks/month for someone else to do it for me)


true, alternating IPs are useful, except when you already have agreements to do the scraping with the target site, which we do.

so what would interest me is something that made scraping easier or more robust. so far, it looks like the service still has some growing to do before it gets there.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: